Sunday 16 April 2017

Sqoop Tutorial

Sqoop practicals
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases. Sqoop can be used to import data from a relational database such as MySql or Oracle into the Hadoop Distributed File System (HDFS). It can also be used to transform data stored in Hadoop as part of a MapReduce and then export the data back into a relational database.
After installing Sqoop successfully now you can import and export data from any RDBMS to Hadoop HDFS in order to process it with Hive, MapReduce or Spark.


Installation of Sqoop
Download sqoop from the link http://mirror.fibergrid.in/apache/sqoop/1.4.6/
Download Mysql connector from the link. http://dev.mysql.com/downloads/connector/j/
After successfull downloading now, Go to Downloads Folder ,
Step 1.
Untar sqoop .tar file
$tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
Step 2.
Move the sqoop untar file to /usr/local/sqoop-1.4.6

Step 3.
then go to sqoop-1.4.6 conf directory
rename sqoop-env.template.sh to sqoop-env.sh
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh

Step 4.
add hadoop path :-
export HADOOP_COMMON_HOME=/usr/local/hadoop-2.7.0 export HADOOP_MAPRED_HOME=/usr/local/hadoop-2.7.0
Step 5.
open ~/.bashrc file
$sudo gedit ~/.bashrc
Configure the environment variables in .bashrc file
export SQOOP_HOME=/usr/local/sqoop-1.4.6 export SQOOP_CONF_DIR=$SQOOP_HOME/conf export SQOOP_CLASS_PATH=$SQOOP_CONF_DIR export PATH=$PATH:$SQOOP_HOME/bin
save and close
$source ~/.bashrc to restart bashrc file
Step 6.
Untar mysql-connector-java-5.1.38.tar.gz in Downloads folder
$tar -zxvf mysql-connector-java-5.1.38.tar.gz
Inside untared mysql-connector folder
copy mysql-connector-java-5.1.38-bin.jar into sqoop
s lib directory
$cd /home/user/Downloads/mysql-connector-java-5.1.38/
$sudo cp mysql-connector-java-5.1.38-bin.jar $SQOOP_HOME/lib

Step 7.
To verify Sqoop installation is done Properly

$cd $SQOOP_HOME/bin
$sqoop -version
it should show-
16/08/28 21:48:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Sqoop 1.4.6
git commit id c0c5a81723759fa575844a0a1eae8f510fa32c25
Compiled by root on Mon Apr 27 14:38:36 CST 2015

After successfully configuration of Sqoop we are not ready to work in Sqoop.

In Sqoop working mechanism only mapper job will run and reducer is not required because the complete import and export process doesnt require aggregation.
There are different commands which sqoop uses. They are-
These are the all available tools which Sqoop uses. Mostly Import and Export are in action.

Here, I am assuming mysql is hosted in 192.168.1.0 (if you have hosted on same machine you can use localhost) and database is db1.

IMPORT FROM RDBMS TO HDFS
The import tool is used when you want to transfer data from the relational databases into Hadoop.

Sqoop import connect jdbc:mysql://192.168.1.0/databasename username root password root table tablename target-dir /user/data
--connect:- It contains jdbc url to your 
database. The syntax of the URL is specified for each
database. The URL is followed by two parameters
--username
--password
, which are the credentials that sqoop use connecting to the database
--table:- It contain the name of the table to transfer .
--target-dir:- to specify the directory on HDFS where Sqoop should import your data.

PROCESS:-
First Sqoop will connect to the database to fetch the table metadata (the no. of columns, their names and the associated data types and also the table is partitioned or not). At this point Sqoop will not transfer any data between the database and your machine. Rather, it is querying the catalog tables and views. Based on retrieved metadata, Sqoop will generate a Java Class and compile it using the JDK and Hadoop libraries available in your machines.
Next, Sqoop will connect to your Hadoop cluster and submit a MapReduce Job. Each mapper of the job will then transfer a slice of the table's data


In Sqoop data can be imported bases on uses cases. Sqoop supports different ways to import the data through different use case. Lets explore the Sqoop Import cases.

Here, Database is MySql running on machine 192.168.1.0, database is db1 where the credentials username is root and password is root.



  • Case 1. Importing one table from database
    sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 target-dir tbl1data


    Case 2. Importing only a subset of table

    Subset of table brings two more cases:-
      Selected number of Rows
  •   Selected number of columns
  •   Selected number of Rows and columns
    Sqoop can import data in all three cases.
1. Selected number of Rows
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 where ''age>20 target-dir /tbl1data
Here, Only the rows which satisfies the condition will be imported.

2. Selected number of Columns
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 --columns id,name,age target-dir /tbl1data
Here, Only id, name and age columns data will be imported.

3. Selected number of Rows and Columns
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 -- columns id,name,age where age>20” –target-dir /tbl1data
Here, Only id , name and age will be imported which age will be greater than 20. Case 3. Importing all tables from database
sqoop import-all-tables connect jdbc:mysql://192.168.1.0/db1 username root passoword root warehouse-dir /database1
Importing tables is done in Sequential manner, but the importing of a single table can be done in parallel. By default, Sqoop will create a directory with the same name as the imported table inside Your home directory on HDFS and import all data there
[Here,
warehouse-dir is used but not target-dir, because warehouse-dir will create a parent folder where all the tables will be imported. But this doesn’t happen with –target-dir Sqoop will reject importing into an existing directory to prevent accidental overwriting of data. ]

Case4. Importing all table excluding some tables
sqoop import-all-tables connect jdbc:mysql://192.168.1.0/db1 username root password root exclude-tables tbl1, tbl2 warehouse-dir /database2

In this case, all the tables inside database db1 will be imported except the tables which are given in exclude-tables arguments (tbl1, tbl2).

Case 5. Speeding up transfers
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl2
direct --target-dir /datadirectory
[Rather than using JDBC interface for transferring data the direct mode delegates the job of transferring data to the native utilities provided by the database vendor.]
Note:- Not all databases have available native utilities and also this mode is not available for every supported Databases. Sqoop has support for MySql and PostgreSQL only as of now.

Case 6. Overriding type mapping
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
map-column-java id=Long target-dir /datadirectory
id is column name and Long is data type you want to change for multiple columns if you want to change the data types by:- --map-column-java col1=Float, col2=String, col3=String, .......

Case 7. Controlling Parallelism
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
direct --target-dir /datadirectory --num-mappers 10 or -m also
Note: It is not advisable to give more number of mappers until and unless the size of data is very large. Sqoop default launches 4 mapper which is suitable for most of the cases. Increasing number of mappers may affect your source database server.

Case 8. Incremental Importing
Appending new rows, and you need to periodically sync the table’s state to Hadoop for further processing If the data is increasing, for that type of data we use incremental import which will take two argument check-columns and last-value based on these argument values sqoop will data excluding the last value.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 incremental append check-column Id last-value 25 target-dir /datadirectory

15/05/04 01:22:43 INFO tool.ImportTool: --incremental append
15/05/04 01:22:43 INFO tool.ImportTool: --check-column Id
15/05/04 01:22:43 INFO tool.ImportTool: --last-value 1500
15/05/04 01:22:43 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')


Sqoop when running in incremental mode, always prints out the value of the last import row. This allows you to easily pick up where you left. It is always difficult to remember the last value and columns based on which sqoop imported data so, to make things a type of automatic last-value Sqoop give the ease of creating and saving job, after when we execute job it will automatically save the last-value.

Case 9. Preserving the last-value imported value (Sqoop job)
We can take advantage of built-in Sqoop metastore that allows you to save all parameters for later reuse. The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime. Each saved job has a logical name that is used for referencing.
Sqoop job create jobname -- import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl2 incremental append check-column id last-value 1500 target-dir /datadirectory

Case 10. Sqoop Job commands
[Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job]
sqoop job exec jobname sqoop job list
sqoop job show jobname sqoop job delete jobname
[Executed the sqoop jobs] [Lists the sqoop jobs] [Describe the sqoop job] [Delete the sqoop job]

Case 11: Importing the table data with the Date Reference.
sqoop import --connect jdbc:mysql://192.168.1.0/db1 --username root -P -- table tbl1 --incremental lastmodified --check-column last_update_date -- last-value "2015-05-4 01:05:14" --target-dir /dateIncrementdirectory

Note : specify the data column with the correct format in the table.


The incremental mode lastmodified requires a column holding a date value (suitable types are date, time, datetime, and timestamp) containing information as to when each row was last updated. Sqoop will import only those rows that were updated after the last import. This column should be populated to the current time on every new row insertion or on a change to an existing row.
Internally, the lastmodified incremental import consists of two standalone MapReduce jobs. The first job will import the delta of changed data similarly to normal import. This import job will save data in a temporary directory on HDFS. The second job will take both the old and new data and will merge them together into the final output, preserving only the last updated value for each row.

Case 12: Listing all the databases:
sqoop list-databases --connect:mysql://192.168.1.0 --username root P To view all the list of databases


case 13: Listing all the tables of a database:
sqoop list-tables connect:mysql://192.168.1.0/db1 --username root -P To view all the tables inside the given databases.

case 14: Playing with the format_file: sequence.
Sqoop supports three different file formats; one of these is text, and the other two are binary. The binary formats are Avro and Hadoops SequenceFile. You can enable import into SequenceFile using the --as-sequencefile parameter:
sqoop import --connect jdbc:mysql://192.168.1.0/db1 --table persons2 - username root -P --as-sequencefile --target-dir /sf

(The Default file format is text file format and it has some limitations
1. cannot store carbon datatypes 
2. consumes more space )
so, these limitations is covered by sequence file as it stores everything in binary format it consumes very less space as compared to text file format.

Still sequence file format has the limitation that it does not contain the schema of the data. if you need to use the data then you need to refer to the schema as well. again that problem is solved by Avro file format which contain schema as well as data in the file.

case 15: Playing with the format_file: avro_file
sqoop import --connect jdbc:mysql://192.168.1.0/db1 --table persons2 --
username root -P --as-avrodatafile --target-dir /avro


Case 16: Importing from two different tables(using join function) 
Importing using Complex SQL queries.
sqoop import --connect jdbc:mysql://192.168.1.0/dn1 --username root -P -- query 'SELECT tbl1.id, tbl1.name,tbl2.country,tbl2.salary FROM tbl1 JOIN tbl2 USING (id) WHERE $CONDITIONS' --split-by id --target-dir  /joindata

[Free form query will affect parallel processing. So, it is advised to run the joins in the database and save in temporary table after that you import the table thorugh sqoop]
To help Sqoop split your query into multiple chunks that can be transferred in parallel, you need to include the $CONDITIONS placeholder in the where clause of your query. Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task
scoop incremental importWe can optimise the free-form query by supplying boundary-query
sqoop import --connect jdbc:mysql://192.168.1.0/dn1 --username root -P --
query 'SELECT tbl1.id, tbl1.name,tbl2.country,tbl2.salary FROM tbl1 JOIN tbl2 USING (id) WHERE $CONDITIONS' --split-by id --boundary-query select min(id), max(id) from tbl1--target-dir /joindata

Case 17: Creating Hive Table and loading data from HDFS
The biggest advantage of using Sqoop for populating tables in Hive is that it can automatically populate the metadata for you. If the table in Hive does not exist yet, Sqoop will simply create it based on the metadata fetched for your table or query.

sqoop create-hive-table connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 --fields-terminated-by ',' ;

[This will create the table in hive and if you have already imported data in HDFS then you can load data from HDFS to hive table ]
load data inpath '/path/of/loaded/data' into table tbl1
[This three-step process of importing data to HDFS, creating the Hive Table, and then loading the HDFS-resident data into Hive can be shortened to one step if you know that you want to import straight from a database directly to Hive by using hive-import. ]

Case 18: Importing Direct into the Hive Table
If the table already exists, Sqoop will import data into the existing table.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
hive-import

Case 19: Importing partitioned Hive tables Sqoop supports Hive partitioning out of the box. In order to take advantage of this functionality, you need to specify two additional parameters: --hive-partition-key, which contains the name of the partition column, and --hive-partition-value, which specifies the desired value.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
hive-import hive-partition-key Id hive-partition-value 500
Sqoop mandates that the partition column be of type STRING. The current implementation is limited to a single partition level

Case20: Importing Data into Hbase
Instead of importing data into an HDFS file or a Hive table, you would like to transfer it into HBase, Hadoop’s real-time database. Sqoop has out-of-the-box support for HBase. To enable import into HBase, you need to supply two additional parameters: --hbase-table and --column-family. The parameter --hbase-table specifies the name of the table in HBase to which you want to import your data.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 -- hbase-table tbl1 column-family details
Sqoop can transfers data from different relational database systems, such as Oracle, MySQL, PostgreSQL, and Microsoft SQL Server, as well as from enterprise data warehouses, such as Netezza and Teradata.

Case 21 Importing from PostgreSQL Database
sqoop import connect jdbc:postgresql://192.168.1.0/db1 username root password root table
tbl1 target-dir tbl1data

Case 22: Importing from Oracle Database
sqoop import connect jdbc:oracle:thin:192.168.1.0:1521/db1 username root password root
table user.tbl1 target-dir tbl1data

Case 23: Saving ConnectionString in file
Create a file (connectingstring.txt ) write list-tables
--connect
jdbc:mysql://localhost/db1

--username
root
--password
root
save it in localFIle system
$sqoop option-file /full path/of /file
or you can add only connection string in file and provide the required parameters on the sqoop command.


             Exporting  from HDFS to RDBMS
Sqoop Export feature allows us to transfer data from the Hadoop ecosystem to relational databases. Sqoop will transfer data to the relational database using INSERT statements. Instead of transferring data from the relational database using SELECT queries, Sqoop will transfer the data to the relational database using INSERT statements. Sqoop’s export workflow matches the import case with slight differences. After you execute the Sqoop command, Sqoop will connect to your

database to fetch various metadata about your table, including the list of all columns with their appropriate types. Using this metadata, Sqoop will generate and compile the Java class. The generated class will be used in the submitted MapReduce job that will export your data. Similar to the import mode, no data is being transferred through the Sqoop client itself. All transfers are done in theMapReduce job, with Sqoop overseeing the process from your machine.

Case 25: Exporting to RDBMS from HDFS
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1

Case 26: Inserting Data in Batches
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1 batch
sqoop export Dsqoop.export.records.per.statement=10 connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1
[It is like INSERT INTO TBL1 VALUES(),(),()......;]
sqoop export Dsqoop.export.statements.per.statement=10 connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1
There are the three different methods of exporting data fro HDFS to
RDBMS. We can combine them also to get the  query optimised

Case 27: Exporting with All-or-Nothing
Sqoop will first export all data into this staging table instead of the main table that is present in the parameter --table. Sqoop opens a new transaction to move data from the staging table to the final destination, if and only if all parallel tasks successfully transfer data. Sqoop requires that the structure of the staging table be the same as that of the target table.
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 staging-table temp_tbl1 export-dir /data1
Here, at the period of time the database have two datasets which is
consuming memory. So after exporting we may need to clear the staging
table, so for clearing  staging table.

Case 28: Clearing Staging-table
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 staging-table temp_tbl1 export-dir /data1 clear staging-table temp_tbl1

Case 29: Data transfer from Hive or HBase to RDBMS
In Sqoop it is not supported to transfer data directly from Hive or Hbase to RDBMS . This is done in two-step approach
1. Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2. Use Sqoop to export output of previous step to RDBMS 




Thankyou!! If you have any doubts please feel free to contact me on hadooppravin@gmail.com

2 comments:

  1. Praveen, these notes have been top class.
    Much appreciated.
    Vinay.

    ReplyDelete
  2. Nice blog sir it's really helpful......:)

    ReplyDelete

How to play with csv data in Spark?

Lets see how we can read the csv data [in my case from LFS] and converting the RDD to Dataframe using case class. Sample Data:- 1,Ram,Si...