Wednesday 26 April 2017

Sqoop Use Cases


  • Importing all tables from a Database
Sqoop is capable of importing more than one table in single sqoop command. It can be done with the sqoop tool import-all-tables where we can’t use —target-dir argument because the importing of multiple tables happens in sequential manner because the importing of second table cannot be done because of WORM feature of HDFS. To overcome this issue while importing multiple table —warehouse-dir is used which will creates parent directory in HDFS and the corresponding table will create its directory inside the parent directory through which sqoop is capable of importing multiple tables in single command.
 $ sqoop import-all-tables —connect jdbc:mysql://192.168.0.1/db100 —username root —password root —warehouse-dir /tbl1_In_HDFS -m 2
  • Importing all tables from a Database excluding some
If a database contains 10 tables (let’s say) and user want to import only 8 tables. sqoop can import all table excluding the tables which need not to be imported. lets us suppose tabl8, tbl9 is not needed to import then,
sqoop import-all-tables –connect jdbc:mysql://192.168.1.0/db100 –username root –password root –exclude-tables tbl8,tbl9 -m 2

  • How to import tables in parallel having No primary Key.
When the source table doesn't have primary key(which uniquely define table, due to which sqoop conforms the data imported data has not ambiguous) then sqoop cannot slice the table. Either we have to do sequential import(no parallel import) or we need to be confirm that any column in the table is not null and unique, if we tell sqoop to slice table using such column. Sqoop can slice table and do parallel import.
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —split-by uniqueColName

  • Speeding the Import and Export 
When sqoop import or exports it uses generic JDBC to transfer data between RDBMS and MYSQL which is present in Hadoop and RDBMS also. But if Sqoop can use native tool especially built for transfer (import/export) for particular database then transfer of data will be much faster. But to use this native tool task tracker also should have the native tool and another thing is transferring data using native utilities produce data in text output not in binary formats (sequence, avro etc ) format. For e.g.:- mysql has native tool called mysqldump and mysqlimport, PostgreSql has pg_dump tool which are proficient in transferring data from/to RDBMS. Sqoop currently supports mysql and PostgreSql.
Simply by using —direct argument sqoop will  use its corresponding native tool which is already defined in sqoop. if you need for other databases also you should configure these native tools in Hadoop cluster also(Task-Tracker).
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —direct

  • Importing Incremental Data(Only Newly added data from source Table)
Once we import the table in HDFS, the source table has got more new records to import all these newly added records we use  incremental import which needs two more information what was the last value of which column(Primary Key). Sqoop will import only those rows which is newly added.
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —incremental append —check-column id —last-value 200
Here Sqoop after importing the incremented data prints the information which will be useful to create another sqoop imcremental import.
         25/04/17 08:16:36 INFO tool.ImportTool: Incremental import complete!
    25/04/17 08:16:36 INFO tool.ImportTool:  --incremental append
    25/04/17 08:16:36 INFO tool.ImportTool:   --check-column id
    25/04/17 08:16:36 INFO tool.ImportTool:   --last-value 400
Here, we need to preserve the columns name and the last value for another incremental import in future. Sqoop has great feature called SQOOP JOB where we can save the information for future reference.
  • Exploring Sqoop Job
Sqoop uses its inbuilt metastore to store the information. User can create and execute job by using sqoop job features.The most important benefit of the built-in Sqoop metastore is in conjunction with incremental import. Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job. This way, users do not need to remember the last imported value after each execution; everything is handled automatically. 

$ sqoop job —create tbl1import — import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —incremental append —check-column id —last-value 400

To see the list of sqoop job created $sqoop job —list
To see the detail  of job $sqoop job —show jobname
To execute the job $sqoop job —exec job name
To delete the created job $sqoop job —delete job name
To override the sqoop job $sqoop job -exec job name — -m 1

  • Importing data from more than one table (using complex SQL queries)
Importing Data using free-form is expensive(uses huge resources and take longer time). Because Sqoop needs metastore of table to import the data, but here in case of joining there is no such table in database but it is an outcome of join query.
So, here sqoop cannot use the metadata of table to be imported. In such cases sqoop command becomes much expensive which uses more resources as well as takes more time.
$ sqoop import --connect jdbc:mysql://localhost/db100 --username root --query 'select tbl1.id, tbl1.name,tbl2.P_Name, tbl2.phoneno from tbl1 join tbl2 USING(id) where $CONDITIONS' -m 2 --target-dir /freeform2 --split-by id

Sqoop uses the table’s primary key by default and generates the query select min(col), max(col) from tbl (for table tbl and split column col). In the case of the free-form query import, there is no table that Sqoop can use for fetching those values instead, it will use the entire query specified on the command line as a subquery in place of the table name, resulting in a query select min(col), max(col) from ($YOUR_QUERY). Sqoop needs to materialise the output result set prior to moving any data just for the purpose of getting the import boundaries.  If Sqoop knows the boundary prior materialising the join query then transfer of data will be faster.
$ sqoop import --connect jdbc:mysql://localhost/movielens --username root --query 'select tbl1.id, tbl1.name,tbl2.P_Name, tbl2.phoneno from tbl1 join tbl2 USING(id) where $CONDITIONS' -m 2 --target-dir /freeform2 --split-by id --boundary-query "select min(id),max(id) from tbl1;”

  •  Importing From RDBMS to HIVE [Special Case]
 ******SQOOP TO HIVE*******************
1) Import only schema (Only create table in HIve )
2) Import Only Data  (Only imports the data to hive Table)
3) Import Schema+Data (create table in hive and load data.)


sqoop create-hive-table --connect jdbc:mysql://localhost/movielens
--username root --table genre
(it is used when you have data already in HDFS and you just want to table schema.)

sqoop import --connect jdbc:mysql://localhost/movielens --username root --table genre --hive-import --hive-table genre
(It is used when you already have table in hive you need data to load.)

sqoop import --connect jdbc:mysql://localhost/movielens --username root --table movie --hive-import --hive-table hivemovie --create-hive-table
(this is very common case where you will create table in hive and load data into the table.)
  •  Exporting data from HDFS to RDBMS
Sqoop export works in reverse manner of sqoop import. It transfers data from HDFS to RDBMS. It is used to transfer generated, processed, or backed-up data from Hadoop to your database. After you execute the sqoop command sqoop will generate and compile the java class. The generated class will be sued in the submitted map reduce job that will export your data.
Similar to the import mode no data issuing transferred through the sqoop client itself. All transfer are done in the map reduce job with sqoop overseeing(controlling) the process.
The destination table must exist prior to exporting. Sqoop exports data by performing INSERT statements

$ sqoop export --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --export-dir /tbl1Hdfs/

Internally sqoop exports data in insert manner for each row is inserted in a separate insert statement.There are various ways using sqoop can insert data in efficient manner. By enabling the jdbc —batch feature parameter. some database drivers use the ability to send multiple rows to remote databases inside one request to achieve better performance, others might simply send each query separately. 
$ sqoop export --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --export-dir /tbl1Hdfs/ --batch

Another we can set  property sqoop.export.records.per.statement to a value of two or more, Sqoop will create the following query: 
INSERT INTO table VALUES (...), (...), (...), ...; 
As the query is completely generated by Sqoop, the JDBC driver doesn’t alter it, sending it to the remote database as is.
$ sqoop export -Dsqoop.export.records.per.statement=10 --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --export-dir /tbl1Hdfs/

  • Export all data or nothing
We need to be sure that exporting should export all data no nothing mean either our destination should contain all data or better it be empty (doesn’t want half or incomplete data.). Sqoop needs a temporary table (exactly same as destination table) which we called as staging table. Sqoop will first export all data into this staging table (—staging-table )instead of the main table . Sqoop opens a new transaction to move data from the staging table to the final destination, if and only if all parallel tasks successfully transfer data. We should create staging table as sqoop doesn't create table in RDBMS.This approach makes sure that data is exported fully or not. But it comes up with disadvantage on database side.
  1. At a period of time database would have duplicate data (one in staging table another in final table) consuming double storage. 
  2. As the data is first loaded somewhere else and then moved to the final table, using a staging table will always be slower than exporting directly to the final table. 
$ sqoop export --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --staging-table staging_tbl1 —export-dir /tbl1HDfs

Staging table  needs to be empty in order to end up with consistent data. You can specify the parameter --clear-staging-table to instruct Sqoop to automatically clean the staging table for you. If supported by the database, Sqoop will use a TRUNCATE operation to clean up the staging table as quickly as possible

3 comments:

How to play with csv data in Spark?

Lets see how we can read the csv data [in my case from LFS] and converting the RDD to Dataframe using case class. Sample Data:- 1,Ram,Si...