My Tech Learnings: Sqoop : CLI

It's basically used for RDBMS / Hadoop data copy
It can be used to copy from and to RDBMS as needed.

Use Case # 1 ELT

Use Case # 2 : ETL for DWG

Use Case # 3 : Data Analysis
As RDBMS are not very scalable.Hadoop is.

Use Case # 4 : Data Archival

Use Case # 5 : Move Reports to Hadoop
Reports if they are too demanding.
Reports on archival data.

Note that if you choose to create HIVE tables with same schema as RDBMS then any existing reports will just work, by just modifying a connection string.

Use Case # 6 : Data Consolidation

Save all data of an org in Hadoop.

Scoop can be installed in

Client based model : execute command line
Server model : Can be deployed in service or server model.

Here is how scoop works

Execute sqoop cli with RDBMS connection tables , hadoop details , table names
Examine structure of tables in RDBMS
Generate java class to import data from RDBMS to Hadoop
Execute map only job

Example

sqoop import --connect jdbc:mysql://mysql_server/db_name --username foo --table myTable1
sqoop --options-file /user/loc/abc.txt --table=my_table1
use -P instead of --password in order to read password from console instead of props file.

Import
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba -P --table products --target-dir
if you dont give --target-dir by default data will reside in user home directory
to check "hdfs dfs -cat /dir/name"

can import columns of table
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba -P --table products --target-dir --columns "one,two,three"

can import subset using where condition

$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir /folder/name1

Where Clause
Sqoop import can also contain a where clause.
It must contain '$CONDITIONS' in WHERE clause.

if you give a query like

sqoop import --connect jdbc:teradata://192.168.xx.xx/DBS_PORT=1025,DATABASE=ds_tbl_db

--driver com.teradata.jdbc.TeraDriver --username dbc --password dbc

--query 'select * from reason where id>20' --hive-import

--hive-table reason_hive --target-dir <hdfs-location> -m 1

you get error
Query [select * from reason where id>20] must contain '$CONDITIONS' in WHERE clause

Solution
you have to add AND \$CONDITIONS
--query "select * from reason where id>20 AND \$CONDITIONS"

Reason
Sqoop requires to access metadata of table for example column type information. Placeholder $CONDITIONS is by default set to '1 = 0' to ensure that sqoop receives only type information. So,

after executing sqoop command
you will see first query that gets fired is with default $CONDITIONS.
Later on, it is substituted by different values defining different ranges based on number of mappers (-m) or --split-by column or --boundary-query so that entire data set can be divided into different data slices or chunks and chunks can be imported in parallel with as much as concurrency available.
Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task

Can force sqoop to skip parallel run by skipping $CONDITIONS & setting -numMapper=1. but this will cause MR to run sequentially.

split-by is used to slicing your data into multiple parallel tasks. Usually defaults to primary key of the main table.

Scoop CLI Parallel Import

This can greatly improve the performance of the import.
Default value for parallel data import is 4.
Uses splitting column magic.

but do not give value of parallel resource which is greater than number of available resources. this will hurt the performance greatly.

What scoop does is

Identify primary key from DB
get low and high values of it from DB
it will divide into 4 ( or more ) select queries ... it will give minimum & max values accordingly.
Each map operates on evenly sized data set
sqoop cannot split on more than 1 column

My Tech Learnings

Sunday, July 30, 2017

Sqoop : CLI

No comments:

About Me