Thursday 21 December 2017

How to play with csv data in Spark?

Lets see how we can read the csv data [in my case from LFS] and converting the RDD to Dataframe using case class.
Sample Data:-

1,Ram,Sita,Ayodha,32
2,Shiva,Parwati,Kailesh,33
3,Bishnu,Laxmi,Sagar,34
4,Brahma,Swarswati,Brahmanda,35
5,Krishna,Radha,Mathura,36

At first lets read the file,

scala> val rdd1=sc.textFile("file:///home/cloudera/Desktop/Sampledata") 

 [if you want to read from hdfs just add hdfs://hostname:8020/user/cloudera/Sampledata]

lets define the schema for the dataset using case class

scala> case class dataSchema(id: Int, name: String, wName: String,place: String,age: Int)

After successfully defining the case class, lets map the schema to the rdd.

scala> val newdataset1 = rdd1.map(line =>line.split(",")).map(record => dataSchema record(0).trim.toInt, record(1), record(2), record(3), record(4).trim.toInt)).toDF()

Here, we are applying the schema which we created using case class into the RDD rdd1,

scala> newdataset1.show() [it will show you the data]

Screen Shot:-


rdd to dataframe













Also we can register the dataset as table and query accordingly.

lets register dataset to table in spark.

scala> newdataset1.registerTempTable("Newtbl")

The given dataset [newdataset1] is now registered as Newtbl .
lets query the table using sql syntax.

scala> newdataset1.sqlContext.sql("select * from Newtbl where age>33").collect.foreach(println)

This will run the sql query and print the output.. we can do many complex tasks in easy manner.




Thankyou!!!!


Wednesday 26 April 2017

Sqoop Use Cases


  • Importing all tables from a Database
Sqoop is capable of importing more than one table in single sqoop command. It can be done with the sqoop tool import-all-tables where we can’t use —target-dir argument because the importing of multiple tables happens in sequential manner because the importing of second table cannot be done because of WORM feature of HDFS. To overcome this issue while importing multiple table —warehouse-dir is used which will creates parent directory in HDFS and the corresponding table will create its directory inside the parent directory through which sqoop is capable of importing multiple tables in single command.
 $ sqoop import-all-tables —connect jdbc:mysql://192.168.0.1/db100 —username root —password root —warehouse-dir /tbl1_In_HDFS -m 2
  • Importing all tables from a Database excluding some
If a database contains 10 tables (let’s say) and user want to import only 8 tables. sqoop can import all table excluding the tables which need not to be imported. lets us suppose tabl8, tbl9 is not needed to import then,
sqoop import-all-tables –connect jdbc:mysql://192.168.1.0/db100 –username root –password root –exclude-tables tbl8,tbl9 -m 2

  • How to import tables in parallel having No primary Key.
When the source table doesn't have primary key(which uniquely define table, due to which sqoop conforms the data imported data has not ambiguous) then sqoop cannot slice the table. Either we have to do sequential import(no parallel import) or we need to be confirm that any column in the table is not null and unique, if we tell sqoop to slice table using such column. Sqoop can slice table and do parallel import.
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —split-by uniqueColName

  • Speeding the Import and Export 
When sqoop import or exports it uses generic JDBC to transfer data between RDBMS and MYSQL which is present in Hadoop and RDBMS also. But if Sqoop can use native tool especially built for transfer (import/export) for particular database then transfer of data will be much faster. But to use this native tool task tracker also should have the native tool and another thing is transferring data using native utilities produce data in text output not in binary formats (sequence, avro etc ) format. For e.g.:- mysql has native tool called mysqldump and mysqlimport, PostgreSql has pg_dump tool which are proficient in transferring data from/to RDBMS. Sqoop currently supports mysql and PostgreSql.
Simply by using —direct argument sqoop will  use its corresponding native tool which is already defined in sqoop. if you need for other databases also you should configure these native tools in Hadoop cluster also(Task-Tracker).
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —direct

  • Importing Incremental Data(Only Newly added data from source Table)
Once we import the table in HDFS, the source table has got more new records to import all these newly added records we use  incremental import which needs two more information what was the last value of which column(Primary Key). Sqoop will import only those rows which is newly added.
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —incremental append —check-column id —last-value 200
Here Sqoop after importing the incremented data prints the information which will be useful to create another sqoop imcremental import.
         25/04/17 08:16:36 INFO tool.ImportTool: Incremental import complete!
    25/04/17 08:16:36 INFO tool.ImportTool:  --incremental append
    25/04/17 08:16:36 INFO tool.ImportTool:   --check-column id
    25/04/17 08:16:36 INFO tool.ImportTool:   --last-value 400
Here, we need to preserve the columns name and the last value for another incremental import in future. Sqoop has great feature called SQOOP JOB where we can save the information for future reference.
  • Exploring Sqoop Job
Sqoop uses its inbuilt metastore to store the information. User can create and execute job by using sqoop job features.The most important benefit of the built-in Sqoop metastore is in conjunction with incremental import. Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job. This way, users do not need to remember the last imported value after each execution; everything is handled automatically. 

$ sqoop job —create tbl1import — import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /Tbl1 -m 2 —incremental append —check-column id —last-value 400

To see the list of sqoop job created $sqoop job —list
To see the detail  of job $sqoop job —show jobname
To execute the job $sqoop job —exec job name
To delete the created job $sqoop job —delete job name
To override the sqoop job $sqoop job -exec job name — -m 1

  • Importing data from more than one table (using complex SQL queries)
Importing Data using free-form is expensive(uses huge resources and take longer time). Because Sqoop needs metastore of table to import the data, but here in case of joining there is no such table in database but it is an outcome of join query.
So, here sqoop cannot use the metadata of table to be imported. In such cases sqoop command becomes much expensive which uses more resources as well as takes more time.
$ sqoop import --connect jdbc:mysql://localhost/db100 --username root --query 'select tbl1.id, tbl1.name,tbl2.P_Name, tbl2.phoneno from tbl1 join tbl2 USING(id) where $CONDITIONS' -m 2 --target-dir /freeform2 --split-by id

Sqoop uses the table’s primary key by default and generates the query select min(col), max(col) from tbl (for table tbl and split column col). In the case of the free-form query import, there is no table that Sqoop can use for fetching those values instead, it will use the entire query specified on the command line as a subquery in place of the table name, resulting in a query select min(col), max(col) from ($YOUR_QUERY). Sqoop needs to materialise the output result set prior to moving any data just for the purpose of getting the import boundaries.  If Sqoop knows the boundary prior materialising the join query then transfer of data will be faster.
$ sqoop import --connect jdbc:mysql://localhost/movielens --username root --query 'select tbl1.id, tbl1.name,tbl2.P_Name, tbl2.phoneno from tbl1 join tbl2 USING(id) where $CONDITIONS' -m 2 --target-dir /freeform2 --split-by id --boundary-query "select min(id),max(id) from tbl1;”

  •  Importing From RDBMS to HIVE [Special Case]
 ******SQOOP TO HIVE*******************
1) Import only schema (Only create table in HIve )
2) Import Only Data  (Only imports the data to hive Table)
3) Import Schema+Data (create table in hive and load data.)


sqoop create-hive-table --connect jdbc:mysql://localhost/movielens
--username root --table genre
(it is used when you have data already in HDFS and you just want to table schema.)

sqoop import --connect jdbc:mysql://localhost/movielens --username root --table genre --hive-import --hive-table genre
(It is used when you already have table in hive you need data to load.)

sqoop import --connect jdbc:mysql://localhost/movielens --username root --table movie --hive-import --hive-table hivemovie --create-hive-table
(this is very common case where you will create table in hive and load data into the table.)
  •  Exporting data from HDFS to RDBMS
Sqoop export works in reverse manner of sqoop import. It transfers data from HDFS to RDBMS. It is used to transfer generated, processed, or backed-up data from Hadoop to your database. After you execute the sqoop command sqoop will generate and compile the java class. The generated class will be sued in the submitted map reduce job that will export your data.
Similar to the import mode no data issuing transferred through the sqoop client itself. All transfer are done in the map reduce job with sqoop overseeing(controlling) the process.
The destination table must exist prior to exporting. Sqoop exports data by performing INSERT statements

$ sqoop export --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --export-dir /tbl1Hdfs/

Internally sqoop exports data in insert manner for each row is inserted in a separate insert statement.There are various ways using sqoop can insert data in efficient manner. By enabling the jdbc —batch feature parameter. some database drivers use the ability to send multiple rows to remote databases inside one request to achieve better performance, others might simply send each query separately. 
$ sqoop export --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --export-dir /tbl1Hdfs/ --batch

Another we can set  property sqoop.export.records.per.statement to a value of two or more, Sqoop will create the following query: 
INSERT INTO table VALUES (...), (...), (...), ...; 
As the query is completely generated by Sqoop, the JDBC driver doesn’t alter it, sending it to the remote database as is.
$ sqoop export -Dsqoop.export.records.per.statement=10 --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --export-dir /tbl1Hdfs/

  • Export all data or nothing
We need to be sure that exporting should export all data no nothing mean either our destination should contain all data or better it be empty (doesn’t want half or incomplete data.). Sqoop needs a temporary table (exactly same as destination table) which we called as staging table. Sqoop will first export all data into this staging table (—staging-table )instead of the main table . Sqoop opens a new transaction to move data from the staging table to the final destination, if and only if all parallel tasks successfully transfer data. We should create staging table as sqoop doesn't create table in RDBMS.This approach makes sure that data is exported fully or not. But it comes up with disadvantage on database side.
  1. At a period of time database would have duplicate data (one in staging table another in final table) consuming double storage. 
  2. As the data is first loaded somewhere else and then moved to the final table, using a staging table will always be slower than exporting directly to the final table. 
$ sqoop export --connect jdbc:mysql://192.168.1.1/db100  --username root --password root --table tbl11 --staging-table staging_tbl1 —export-dir /tbl1HDfs

Staging table  needs to be empty in order to end up with consistent data. You can specify the parameter --clear-staging-table to instruct Sqoop to automatically clean the staging table for you. If supported by the database, Sqoop will use a TRUNCATE operation to clean up the staging table as quickly as possible

Tuesday 18 April 2017

Sqoop Commands

Sqoop is mainly used for import and export purpose. But sqoop is capable of doing many operations. We will see different used cases of sqoop.

  • To enquire the Database
If you want to see all the database in the server then you can see through this through the sqoop command also.
$ sqoop list-databases —connect jdbc:mysql://192.168.0.1 —username root —password root
If you want to see the list of tables of a particular database, using sqoop command.
$ sqoop list-tables —connect jdbc:mysql://192.168.0.1/db100 —username root —password root  
If you want to see the result of the sql query through the sqoop command then sqoop has a tool called EVAL tool, which is used to execute the query and show the result of the query.(query can be for selecting, inserting new rows, deleting new row ,updating records etc.)
$ sqoop eval —connect jdbc:mysql://192.168.0.1/db100 —username root —password root  —query “select * from tbl1;” 
[Here while sqoop in no import and export has been done, here just the query is executed and the result is show in your standard output.]

  • To import a table from Mysql server
Import a table to HDFS from RDBMS
$ sqoop import —connect jdbc:mysql://192.168.0.1/db100 —username root —password root   —table tbl1 —target-dir /tbl1_In_HDFS
This import command imports table tbl1 from database db100 to HDFS path /tbl1_In_HDFS
—target-dir is an argument which takes HDFS path as value to import the table and save it in the given HDFS Location.
Importing Subset of Table

Subset of tables means a set from the base table it may be selected number of rows of base table or selected number of columns of base table or may be selected no of rows as well as columns of that table.

I) selected number of rows
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /tbl10 —where “age>
II) Selected number of columns 
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /tbl10 —coumns ID,Name,Gender

 III) Select Number of Rows as well as Columns
$ sqoop import —connect jdbc:mysql://localhost/db100 —username root —password root —table tbl1 —target-dir /tbl10 —coumns ID,Name,Gender —where “age>20”

 The main features of Sqoop is that it can import the table parallel (using more than one mapper).
In all the above import case we have not defined the number of mappers mean sqoop will execute its default number of mappers which is 4. It means in all the above import sqoop command the number of mapper executed are 4. If you go and see your hdfs path where the import has been done you can see the 4 part-m* files. These files are mappers output means these output ar the 4 mappers output.



  • Controlling the Parallelism
Controlling the parallelism mean defining the number of mappers in sqoop commands. If we give the number of mappers in Sqoop command then in that command where we mention the number of mapper, the parallelism of data import will be controlled. It totally depends on the number of mapper you mention in the Sqoop Command.
$ sqoop import —connect jdbc:mysql://192.168.0.1/db100 —username root —password root   —table tbl1 —target-dir /tbl1_In_HDFS -m 2

Here, -m (—num-mappers) means no of mappers 
Here, in this sqoop command the number of parallel import will be 2 not 4(default no. of mapper).
Choosing higher number of mapper is not the best way to optimise the sqoop import rather it will be slow because more number of mapper mean more number of connection hitting to database to read data from the table and more number of connection in database mean more load to the database due to which database will respond slow which will affect the sqoop import. We should choose always less number of mapper as much as we can.
Importing with more number of mappers will create more number of files in HDFS which will increase NameNode load.

  • Importing more than one Table
Sqoop is capable of importing more than one table in single sqoop command. It can be done with the sqoop tool import-all-tables where we can’t use —target-dir argument because the importing of multiple tables happens in sequential manner because the importing of second table cannot be done because of WORM feature of HDFS. To overcome this issue while importing multiple table —warehouse-dir is used which will creates parent directory in HDFS and the corresponding table will create its directory inside the parent directory through which sqoop is capable of importing multiple tables in single command.
 $ sqoop import-all-tables —connect jdbc:mysql://192.168.0.1/db100 —username root —password root —warehouse-dir /tbl1_In_HDFS -m 2

  • How Sqoop can slice Table?
when sqoop executes any import command using only one mapper, then there no slicing of table is done ,also called sequential import. But when sqoop command uses more than 1 mapper then the slicing of table has to be done. 
Sqoop when executes the command it internally search for Primary key and run the query “select min(id), max(id) from tbl1;” which mean the columns id is primary key (unique and can’t be null) then sqoop will assumes total number of rows and divide the total number of rows with the total number of mappers. And then generated the query for all the mappers which will contain the query like if the total number of rows is 10 and the number of mappers is 2. Then sqoop will create jar file where the specified conditions are to be met.
 select * from tbl1 where id>0 && id =<5; all the rows meeting this condition will be imported by 1st mapper
 Select * from tbl1 where id>5 && id=<10 ; all the rows meeting this condition will be imported by 2nd st mapper

Here , main table is sliced into two sub-files 1st which contain the data which satisfies the first mapper and the second file which satisfied the second mapper.


To be Continued . . .

Sunday 16 April 2017

Sqoop Tutorial

Sqoop practicals
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases. Sqoop can be used to import data from a relational database such as MySql or Oracle into the Hadoop Distributed File System (HDFS). It can also be used to transform data stored in Hadoop as part of a MapReduce and then export the data back into a relational database.
After installing Sqoop successfully now you can import and export data from any RDBMS to Hadoop HDFS in order to process it with Hive, MapReduce or Spark.


Installation of Sqoop
Download sqoop from the link http://mirror.fibergrid.in/apache/sqoop/1.4.6/
Download Mysql connector from the link. http://dev.mysql.com/downloads/connector/j/
After successfull downloading now, Go to Downloads Folder ,
Step 1.
Untar sqoop .tar file
$tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
Step 2.
Move the sqoop untar file to /usr/local/sqoop-1.4.6

Step 3.
then go to sqoop-1.4.6 conf directory
rename sqoop-env.template.sh to sqoop-env.sh
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh

Step 4.
add hadoop path :-
export HADOOP_COMMON_HOME=/usr/local/hadoop-2.7.0 export HADOOP_MAPRED_HOME=/usr/local/hadoop-2.7.0
Step 5.
open ~/.bashrc file
$sudo gedit ~/.bashrc
Configure the environment variables in .bashrc file
export SQOOP_HOME=/usr/local/sqoop-1.4.6 export SQOOP_CONF_DIR=$SQOOP_HOME/conf export SQOOP_CLASS_PATH=$SQOOP_CONF_DIR export PATH=$PATH:$SQOOP_HOME/bin
save and close
$source ~/.bashrc to restart bashrc file
Step 6.
Untar mysql-connector-java-5.1.38.tar.gz in Downloads folder
$tar -zxvf mysql-connector-java-5.1.38.tar.gz
Inside untared mysql-connector folder
copy mysql-connector-java-5.1.38-bin.jar into sqoop
s lib directory
$cd /home/user/Downloads/mysql-connector-java-5.1.38/
$sudo cp mysql-connector-java-5.1.38-bin.jar $SQOOP_HOME/lib

Step 7.
To verify Sqoop installation is done Properly

$cd $SQOOP_HOME/bin
$sqoop -version
it should show-
16/08/28 21:48:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Sqoop 1.4.6
git commit id c0c5a81723759fa575844a0a1eae8f510fa32c25
Compiled by root on Mon Apr 27 14:38:36 CST 2015

After successfully configuration of Sqoop we are not ready to work in Sqoop.

In Sqoop working mechanism only mapper job will run and reducer is not required because the complete import and export process doesnt require aggregation.
There are different commands which sqoop uses. They are-
These are the all available tools which Sqoop uses. Mostly Import and Export are in action.

Here, I am assuming mysql is hosted in 192.168.1.0 (if you have hosted on same machine you can use localhost) and database is db1.

IMPORT FROM RDBMS TO HDFS
The import tool is used when you want to transfer data from the relational databases into Hadoop.

Sqoop import connect jdbc:mysql://192.168.1.0/databasename username root password root table tablename target-dir /user/data
--connect:- It contains jdbc url to your 
database. The syntax of the URL is specified for each
database. The URL is followed by two parameters
--username
--password
, which are the credentials that sqoop use connecting to the database
--table:- It contain the name of the table to transfer .
--target-dir:- to specify the directory on HDFS where Sqoop should import your data.

PROCESS:-
First Sqoop will connect to the database to fetch the table metadata (the no. of columns, their names and the associated data types and also the table is partitioned or not). At this point Sqoop will not transfer any data between the database and your machine. Rather, it is querying the catalog tables and views. Based on retrieved metadata, Sqoop will generate a Java Class and compile it using the JDK and Hadoop libraries available in your machines.
Next, Sqoop will connect to your Hadoop cluster and submit a MapReduce Job. Each mapper of the job will then transfer a slice of the table's data


In Sqoop data can be imported bases on uses cases. Sqoop supports different ways to import the data through different use case. Lets explore the Sqoop Import cases.

Here, Database is MySql running on machine 192.168.1.0, database is db1 where the credentials username is root and password is root.



  • Case 1. Importing one table from database
    sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 target-dir tbl1data


    Case 2. Importing only a subset of table

    Subset of table brings two more cases:-
      Selected number of Rows
  •   Selected number of columns
  •   Selected number of Rows and columns
    Sqoop can import data in all three cases.
1. Selected number of Rows
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 where ''age>20 target-dir /tbl1data
Here, Only the rows which satisfies the condition will be imported.

2. Selected number of Columns
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 --columns id,name,age target-dir /tbl1data
Here, Only id, name and age columns data will be imported.

3. Selected number of Rows and Columns
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 -- columns id,name,age where age>20” –target-dir /tbl1data
Here, Only id , name and age will be imported which age will be greater than 20. Case 3. Importing all tables from database
sqoop import-all-tables connect jdbc:mysql://192.168.1.0/db1 username root passoword root warehouse-dir /database1
Importing tables is done in Sequential manner, but the importing of a single table can be done in parallel. By default, Sqoop will create a directory with the same name as the imported table inside Your home directory on HDFS and import all data there
[Here,
warehouse-dir is used but not target-dir, because warehouse-dir will create a parent folder where all the tables will be imported. But this doesn’t happen with –target-dir Sqoop will reject importing into an existing directory to prevent accidental overwriting of data. ]

Case4. Importing all table excluding some tables
sqoop import-all-tables connect jdbc:mysql://192.168.1.0/db1 username root password root exclude-tables tbl1, tbl2 warehouse-dir /database2

In this case, all the tables inside database db1 will be imported except the tables which are given in exclude-tables arguments (tbl1, tbl2).

Case 5. Speeding up transfers
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl2
direct --target-dir /datadirectory
[Rather than using JDBC interface for transferring data the direct mode delegates the job of transferring data to the native utilities provided by the database vendor.]
Note:- Not all databases have available native utilities and also this mode is not available for every supported Databases. Sqoop has support for MySql and PostgreSQL only as of now.

Case 6. Overriding type mapping
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
map-column-java id=Long target-dir /datadirectory
id is column name and Long is data type you want to change for multiple columns if you want to change the data types by:- --map-column-java col1=Float, col2=String, col3=String, .......

Case 7. Controlling Parallelism
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
direct --target-dir /datadirectory --num-mappers 10 or -m also
Note: It is not advisable to give more number of mappers until and unless the size of data is very large. Sqoop default launches 4 mapper which is suitable for most of the cases. Increasing number of mappers may affect your source database server.

Case 8. Incremental Importing
Appending new rows, and you need to periodically sync the table’s state to Hadoop for further processing If the data is increasing, for that type of data we use incremental import which will take two argument check-columns and last-value based on these argument values sqoop will data excluding the last value.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 incremental append check-column Id last-value 25 target-dir /datadirectory

15/05/04 01:22:43 INFO tool.ImportTool: --incremental append
15/05/04 01:22:43 INFO tool.ImportTool: --check-column Id
15/05/04 01:22:43 INFO tool.ImportTool: --last-value 1500
15/05/04 01:22:43 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')


Sqoop when running in incremental mode, always prints out the value of the last import row. This allows you to easily pick up where you left. It is always difficult to remember the last value and columns based on which sqoop imported data so, to make things a type of automatic last-value Sqoop give the ease of creating and saving job, after when we execute job it will automatically save the last-value.

Case 9. Preserving the last-value imported value (Sqoop job)
We can take advantage of built-in Sqoop metastore that allows you to save all parameters for later reuse. The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime. Each saved job has a logical name that is used for referencing.
Sqoop job create jobname -- import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl2 incremental append check-column id last-value 1500 target-dir /datadirectory

Case 10. Sqoop Job commands
[Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job]
sqoop job exec jobname sqoop job list
sqoop job show jobname sqoop job delete jobname
[Executed the sqoop jobs] [Lists the sqoop jobs] [Describe the sqoop job] [Delete the sqoop job]

Case 11: Importing the table data with the Date Reference.
sqoop import --connect jdbc:mysql://192.168.1.0/db1 --username root -P -- table tbl1 --incremental lastmodified --check-column last_update_date -- last-value "2015-05-4 01:05:14" --target-dir /dateIncrementdirectory

Note : specify the data column with the correct format in the table.


The incremental mode lastmodified requires a column holding a date value (suitable types are date, time, datetime, and timestamp) containing information as to when each row was last updated. Sqoop will import only those rows that were updated after the last import. This column should be populated to the current time on every new row insertion or on a change to an existing row.
Internally, the lastmodified incremental import consists of two standalone MapReduce jobs. The first job will import the delta of changed data similarly to normal import. This import job will save data in a temporary directory on HDFS. The second job will take both the old and new data and will merge them together into the final output, preserving only the last updated value for each row.

Case 12: Listing all the databases:
sqoop list-databases --connect:mysql://192.168.1.0 --username root P To view all the list of databases


case 13: Listing all the tables of a database:
sqoop list-tables connect:mysql://192.168.1.0/db1 --username root -P To view all the tables inside the given databases.

case 14: Playing with the format_file: sequence.
Sqoop supports three different file formats; one of these is text, and the other two are binary. The binary formats are Avro and Hadoops SequenceFile. You can enable import into SequenceFile using the --as-sequencefile parameter:
sqoop import --connect jdbc:mysql://192.168.1.0/db1 --table persons2 - username root -P --as-sequencefile --target-dir /sf

(The Default file format is text file format and it has some limitations
1. cannot store carbon datatypes 
2. consumes more space )
so, these limitations is covered by sequence file as it stores everything in binary format it consumes very less space as compared to text file format.

Still sequence file format has the limitation that it does not contain the schema of the data. if you need to use the data then you need to refer to the schema as well. again that problem is solved by Avro file format which contain schema as well as data in the file.

case 15: Playing with the format_file: avro_file
sqoop import --connect jdbc:mysql://192.168.1.0/db1 --table persons2 --
username root -P --as-avrodatafile --target-dir /avro


Case 16: Importing from two different tables(using join function) 
Importing using Complex SQL queries.
sqoop import --connect jdbc:mysql://192.168.1.0/dn1 --username root -P -- query 'SELECT tbl1.id, tbl1.name,tbl2.country,tbl2.salary FROM tbl1 JOIN tbl2 USING (id) WHERE $CONDITIONS' --split-by id --target-dir  /joindata

[Free form query will affect parallel processing. So, it is advised to run the joins in the database and save in temporary table after that you import the table thorugh sqoop]
To help Sqoop split your query into multiple chunks that can be transferred in parallel, you need to include the $CONDITIONS placeholder in the where clause of your query. Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task
scoop incremental importWe can optimise the free-form query by supplying boundary-query
sqoop import --connect jdbc:mysql://192.168.1.0/dn1 --username root -P --
query 'SELECT tbl1.id, tbl1.name,tbl2.country,tbl2.salary FROM tbl1 JOIN tbl2 USING (id) WHERE $CONDITIONS' --split-by id --boundary-query select min(id), max(id) from tbl1--target-dir /joindata

Case 17: Creating Hive Table and loading data from HDFS
The biggest advantage of using Sqoop for populating tables in Hive is that it can automatically populate the metadata for you. If the table in Hive does not exist yet, Sqoop will simply create it based on the metadata fetched for your table or query.

sqoop create-hive-table connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 --fields-terminated-by ',' ;

[This will create the table in hive and if you have already imported data in HDFS then you can load data from HDFS to hive table ]
load data inpath '/path/of/loaded/data' into table tbl1
[This three-step process of importing data to HDFS, creating the Hive Table, and then loading the HDFS-resident data into Hive can be shortened to one step if you know that you want to import straight from a database directly to Hive by using hive-import. ]

Case 18: Importing Direct into the Hive Table
If the table already exists, Sqoop will import data into the existing table.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
hive-import

Case 19: Importing partitioned Hive tables Sqoop supports Hive partitioning out of the box. In order to take advantage of this functionality, you need to specify two additional parameters: --hive-partition-key, which contains the name of the partition column, and --hive-partition-value, which specifies the desired value.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1
hive-import hive-partition-key Id hive-partition-value 500
Sqoop mandates that the partition column be of type STRING. The current implementation is limited to a single partition level

Case20: Importing Data into Hbase
Instead of importing data into an HDFS file or a Hive table, you would like to transfer it into HBase, Hadoop’s real-time database. Sqoop has out-of-the-box support for HBase. To enable import into HBase, you need to supply two additional parameters: --hbase-table and --column-family. The parameter --hbase-table specifies the name of the table in HBase to which you want to import your data.
sqoop import connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 -- hbase-table tbl1 column-family details
Sqoop can transfers data from different relational database systems, such as Oracle, MySQL, PostgreSQL, and Microsoft SQL Server, as well as from enterprise data warehouses, such as Netezza and Teradata.

Case 21 Importing from PostgreSQL Database
sqoop import connect jdbc:postgresql://192.168.1.0/db1 username root password root table
tbl1 target-dir tbl1data

Case 22: Importing from Oracle Database
sqoop import connect jdbc:oracle:thin:192.168.1.0:1521/db1 username root password root
table user.tbl1 target-dir tbl1data

Case 23: Saving ConnectionString in file
Create a file (connectingstring.txt ) write list-tables
--connect
jdbc:mysql://localhost/db1

--username
root
--password
root
save it in localFIle system
$sqoop option-file /full path/of /file
or you can add only connection string in file and provide the required parameters on the sqoop command.


             Exporting  from HDFS to RDBMS
Sqoop Export feature allows us to transfer data from the Hadoop ecosystem to relational databases. Sqoop will transfer data to the relational database using INSERT statements. Instead of transferring data from the relational database using SELECT queries, Sqoop will transfer the data to the relational database using INSERT statements. Sqoop’s export workflow matches the import case with slight differences. After you execute the Sqoop command, Sqoop will connect to your

database to fetch various metadata about your table, including the list of all columns with their appropriate types. Using this metadata, Sqoop will generate and compile the Java class. The generated class will be used in the submitted MapReduce job that will export your data. Similar to the import mode, no data is being transferred through the Sqoop client itself. All transfers are done in theMapReduce job, with Sqoop overseeing the process from your machine.

Case 25: Exporting to RDBMS from HDFS
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1

Case 26: Inserting Data in Batches
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1 batch
sqoop export Dsqoop.export.records.per.statement=10 connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1
[It is like INSERT INTO TBL1 VALUES(),(),()......;]
sqoop export Dsqoop.export.statements.per.statement=10 connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 export-dir /data1
There are the three different methods of exporting data fro HDFS to
RDBMS. We can combine them also to get the  query optimised

Case 27: Exporting with All-or-Nothing
Sqoop will first export all data into this staging table instead of the main table that is present in the parameter --table. Sqoop opens a new transaction to move data from the staging table to the final destination, if and only if all parallel tasks successfully transfer data. Sqoop requires that the structure of the staging table be the same as that of the target table.
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 staging-table temp_tbl1 export-dir /data1
Here, at the period of time the database have two datasets which is
consuming memory. So after exporting we may need to clear the staging
table, so for clearing  staging table.

Case 28: Clearing Staging-table
sqoop export connect jdbc:mysql://192.168.1.0/db1 username root password root table tbl1 staging-table temp_tbl1 export-dir /data1 clear staging-table temp_tbl1

Case 29: Data transfer from Hive or HBase to RDBMS
In Sqoop it is not supported to transfer data directly from Hive or Hbase to RDBMS . This is done in two-step approach
1. Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2. Use Sqoop to export output of previous step to RDBMS 




Thankyou!! If you have any doubts please feel free to contact me on hadooppravin@gmail.com

How to play with csv data in Spark?

Lets see how we can read the csv data [in my case from LFS] and converting the RDD to Dataframe using case class. Sample Data:- 1,Ram,Si...