Tuesday, 18 April 2017

Sqoop Commands

Sqoop is mainly used for import and export purpose. But sqoop is capable of doing many operations. We will see different used cases of sqoop.

  • To enquire the Database
If you want to see all the database in the server then you can see through this through the sqoop command also.
$ sqoop list-databases —connect jdbc:mysql://192.168.0.1 —username root —password root
If you want to see the list of tables of a particular database, using sqoop command.
$ sqoop list-tables —connect jdbc:mysql://192.168.0.1/db100 —username root —password root  
If you want to see the result of the sql query through the sqoop command then sqoop has a tool called EVAL tool, which is used to execute the query and show the result of the query.(query can be for selecting, inserting new rows, deleting new row ,updating records etc.)
$ sqoop eval —connect jdbc:mysql://192.168.0.1/db100 —username root —password root  —query “select * from tbl1;” 
[Here while sqoop in no import and export has been done, here just the query is executed and the result is show in your standard output.]

  • To import a table from Mysql server
Import a table to HDFS from RDBMS
$ sqoop import —connect jdbc:mysql://192.168.0.1/db100 —username root —password root   —table tbl1 —target-dir /tbl1_In_HDFS
This import command imports table tbl1 from database db100 to HDFS path /tbl1_In_HDFS
—target-dir is an argument which takes HDFS path as value to import the table and save it in the given HDFS Location.
Importing Subset of Table

Subset of tables means a set from the base table it may be selected number of rows of base table or selected number of columns of base table or may be selected no of rows as well as columns of that table.

I) selected number of rows
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /tbl10 —where “age>
II) Selected number of columns 
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /tbl10 —coumns ID,Name,Gender

 III) Select Number of Rows as well as Columns
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /tbl10 —coumns ID,Name,Gender —where “age>20”

 The main features of Sqoop is that it can import the table parallel (using more than one mapper).
In all the above import case we have not defined the number of mappers mean sqoop will execute its default number of mappers which is 4. It means in all the above import sqoop command the number of mapper executed are 4. If you go and see your hdfs path where the import has been done you can see the 4 part-m* files. These files are mappers output means these output ar the 4 mappers output.



  • Controlling the Parallelism
Controlling the parallelism mean defining the number of mappers in sqoop commands. If we give the number of mappers in Sqoop command then in that command where we mention the number of mapper, the parallelism of data import will be controlled. It totally depends on the number of mapper you mention in the Sqoop Command.
$ sqoop import —connect jdbc:mysql://192.168.0.1/db100 —username root —password root   —table tbl1 —target-dir /tbl1_In_HDFS -m 2

Here, -m (—num-mappers) means no of mappers 
Here, in this sqoop command the number of parallel import will be 2 not 4(default no. of mapper).
Choosing higher number of mapper is not the best way to optimise the sqoop import rather it will be slow because more number of mapper mean more number of connection hitting to database to read data from the table and more number of connection in database mean more load to the database due to which database will respond slow which will affect the sqoop import. We should choose always less number of mapper as much as we can.
Importing with more number of mappers will create more number of files in HDFS which will increase NameNode load.

  • Importing more than one Table
Sqoop is capable of importing more than one table in single sqoop command. It can be done with the sqoop tool import-all-tables where we can’t use —target-dir argument because the importing of multiple tables happens in sequential manner because the importing of second table cannot be done because of WORM feature of HDFS. To overcome this issue while importing multiple table —warehouse-dir is used which will creates parent directory in HDFS and the corresponding table will create its directory inside the parent directory through which sqoop is capable of importing multiple tables in single command.
 $ sqoop import-all-tables —connect jdbc:mysql://192.168.0.1/db100 —username root —password root —warehouse-dir /tbl1_In_HDFS -m 2

  • How Sqoop can slice Table?
when sqoop executes any import command using only one mapper, then there no slicing of table is done ,also called sequential import. But when sqoop command uses more than 1 mapper then the slicing of table has to be done. 
Sqoop when executes the command it internally search for Primary key and run the query “select min(id), max(id) from tbl1;” which mean the columns id is primary key (unique and can’t be null) then sqoop will assumes total number of rows and divide the total number of rows with the total number of mappers. And then generated the query for all the mappers which will contain the query like if the total number of rows is 10 and the number of mappers is 2. Then sqoop will create jar file where the specified conditions are to be met.
 select * from tbl1 where id>0 && id =<5; all the rows meeting this condition will be imported by 1st mapper
 Select * from tbl1 where id>5 && id=<10 ; all the rows meeting this condition will be imported by 2nd st mapper

Here , main table is sliced into two sub-files 1st which contain the data which satisfies the first mapper and the second file which satisfied the second mapper.


To be Continued . . .

3 comments:

How to play with csv data in Spark?

Lets see how we can read the csv data [in my case from LFS] and converting the RDD to Dataframe using case class. Sample Data:- 1,Ram,Si...