Sunday, July 30, 2017

Sqoop : CLI

It's basically used for RDBMS / Hadoop data copy
It can be used to copy from and to RDBMS as needed.

Use Case # 1 ELT

Use Case # 2 : ETL for DWG
Use Case # 3 : Data Analysis
As RDBMS are not very scalable.Hadoop is.

Use Case # 4 : Data Archival
Use Case # 5 : Move Reports to Hadoop
Reports if they are too demanding.
Reports on archival data.

Note that if you choose to create HIVE tables with same schema as RDBMS then any existing reports will just work, by just modifying a connection string.

Use Case # 6 : Data Consolidation
Save all data of an org in Hadoop.

Scoop can be installed in
  1. Client based model : execute command line
  2. Server model : Can be deployed in service or server model.
Here is how scoop works
  1. Execute sqoop cli with RDBMS connection tables , hadoop details , table names
  2. Examine structure of tables in RDBMS
  3. Generate java class to import data from RDBMS to Hadoop
  4. Execute map only job
  • sqoop import --connect jdbc:mysql://mysql_server/db_name --username foo --table myTable1
  • sqoop --options-file /user/loc/abc.txt --table=my_table1
  • use -P instead of --password in order to read password from console instead of props file.
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba -P --table products --target-dir
if you dont give --target-dir by default data will reside in user home directory
to check "hdfs dfs -cat /dir/name"

can import columns of table
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba -P --table products --target-dir --columns "one,two,three"

can import subset using where condition

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /folder/name1

Where Clause
Sqoop import can also contain a where clause.
It must contain '$CONDITIONS' in WHERE clause.

if you give a query like

sqoop import --connect jdbc:teradata://192.168.xx.xx/DBS_PORT=1025,DATABASE=ds_tbl_db 
--driver com.teradata.jdbc.TeraDriver --username dbc --password dbc 
--query 'select * from reason where id>20' --hive-import
--hive-table reason_hive --target-dir <hdfs-location> -m 1

you get error
Query [select * from reason where id>20] must contain '$CONDITIONS' in WHERE clause

you have to add AND \$CONDITIONS
--query "select * from reason where id>20 AND \$CONDITIONS"

Sqoop requires to access metadata of table for example column type information. Placeholder $CONDITIONS is by default set to '1 = 0' to ensure that sqoop receives only type information. So,
  1. after executing sqoop command
  2. you will see first query that gets fired is with default $CONDITIONS.
  3. Later on, it is substituted by different values defining different ranges based on number of mappers (-m) or --split-by column or --boundary-query so that entire data set can be divided into different data slices or chunks and chunks can be imported in parallel with as much as concurrency available. 
  4. Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task
Can force sqoop to skip parallel run by skipping $CONDITIONS & setting -numMapper=1. but this will cause MR to run sequentially.

split-by is used to slicing your data into multiple parallel tasks. Usually defaults to primary key of the main table.

Scoop CLI Parallel Import

This can greatly improve the performance of the import.
Default value for parallel data import is 4.
Uses splitting column magic.

but do not give value of parallel resource which is greater than number of available resources. this will hurt the performance greatly.

What scoop does is
  • Identify primary key from DB
  • get low and high values of it from DB
  • it will divide into 4 ( or more ) select queries ... it will give minimum & max values accordingly.
  • Each map operates on evenly sized data set
  •  sqoop cannot split on more than 1 column

No comments: