Sunday, July 30, 2017

Sqoop : Incremental Imports

Incremental import is a technique that imports only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.

Following Syntax
--incremental
--check-column <column name> : tells which column will be used to check that its incremental value.
--last value <last check column value&gt : tells maximum value of the check parameter.

If you run the sqoop import in 'last modified' mode then there is an assumption that there is last

Example

Consider a table with 3 records which you already imported to hdfs using sqoop

+------+------------+----------+------+------------+
| sid  | city       | state    | rank | rDate      |
+------+------------+----------+------+------------+
|  101 | Chicago    | Illinois |    1 | 2014-01-25 |
|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |
|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |
+------+------------+----------+------+------------+

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P

Now you have additional records in the table but no updates on existing records
 
+------+------------+----------+------+------------+
| sid  | city       | state    | rank | rDate      |
+------+------------+----------+------+------------+
|  101 | Chicago    | Illinois |    1 | 2014-01-25 |
|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |
|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |
|  103 | Charlotte  | NC       |    9 | 2013-04-22 |
|  103 | Greenville | SC       |    9 | 2013-05-12 |
|  103 | Atlanta    | GA       |   11 | 2013-08-21 |
+------+------------+----------+------+------------+
Here you should use an --incremental append with --check-column which specifies the column to be examined when determining which rows to import.
 
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root
 -P --check-column rank --incremental append --last-value 7

The above code will insert all the new rows based on the last value. Now we can think of second case where there are updates in rows

+------+------------+----------+------+------------+
| sid  | city       | state    | rank | rDate      |
+------+------------+----------+------+------------+
|  101 | Chicago    | Illinois |    1 | 2015-01-01 |
|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |
|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |
|  103 | Charlotte  | NC       |    9 | 2013-04-22 |
|  103 | Greenville | SC       |    9 | 2013-05-12 |
|  103 | Atlanta    | GA       |   11 | 2013-08-21 |
|  104 | Dallas     | Texas    |    4 | 2015-02-02 |
|  105 | Phoenix    | Arzona   |   17 | 2015-02-24 |
+------+------------+----------+------+------------+

Here we use incremental lastmodified where we will fetch all the updated rows based on date.
 
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P 
--check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
 
Nice Example
https://shalishvj.wordpress.com/2014/08/12/sqoop-incremental-imports-using-last-modified-mode/
 
Example # 2


sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba
--password cloudera --table products --target-dir /lav/sqoop/retail_db1 
--check-column product_id --incremental lastmodified --last-value 10

Tried running in lastmodified mode with column which is not time stamp.
Got a runtime exception


17/07/30 04:10:48 ERROR manager.SqlManager: Column type is neither timestamp nor date!
17/07/30 04:10:48 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Column type is neither timestamp nor date!
java.lang.RuntimeException: Column type is neither timestamp nor date!
 at org.apache.sqoop.manager.ConnManager.datetimeToQueryString(ConnManager.java:787)
 at org.apache.sqoop.tool.ImportTool.initIncrementalConstraints(ImportTool.java:332)
 at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:498)
 at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615)
 at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
 at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

Scoop Jobs
Can use 'scoop job --create' to create jobs for incremental imports , so that we can run it in future by just specifying new values.

scoop job --create job_name_1  

running this command will not actually import / export any data , but will just create a job.
To see a list of scoop jobs
scoop job --list 

can see additional details
scoop job --show 

can execute
scoop job --exec 

Scoop remembers the last highest values , and automatically uses values ahead of it.
it stores the last highest value in meta data.


Where does it store last incremented value ?


Note its on local file system , and not on hdfs.



No comments: