Incremental import is a technique that imports only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.
If you run the sqoop import in 'last modified' mode then there is an assumption that there is last
Consider a table with 3 records which you already imported to hdfs using sqoop
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
+------+------------+----------+------+------------+
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P
Now you have
additional records in the table
but no updates on existing records
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
+------+------------+----------+------+------------+
Here you should use an
--incremental append
with
--check-column
which specifies the column to be examined when determining which rows to import.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root
-P --check-column rank --incremental append --last-value 7
The above code will insert all the new rows based on the last value. Now we can think of second case where there are updates in rows
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2015-01-01 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
| 104 | Dallas | Texas | 4 | 2015-02-02 |
| 105 | Phoenix | Arzona | 17 | 2015-02-24 |
+------+------------+----------+------+------------+
Here we use incremental lastmodified where we will fetch all the updated rows based on date.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P
--check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
Nice Example
https://shalishvj.wordpress.com/2014/08/12/sqoop-incremental-imports-using-last-modified-mode/
Example # 2
sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba
--password cloudera --table products --target-dir /lav/sqoop/retail_db1
--check-column product_id --incremental lastmodified --last-value 10
Tried running in lastmodified mode with column which is not time stamp.
Got a runtime exception
17/07/30 04:10:48 ERROR manager.SqlManager: Column type is neither timestamp nor date!
17/07/30 04:10:48 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Column type is neither timestamp nor date!
java.lang.RuntimeException: Column type is neither timestamp nor date!
at org.apache.sqoop.manager.ConnManager.datetimeToQueryString(ConnManager.java:787)
at org.apache.sqoop.tool.ImportTool.initIncrementalConstraints(ImportTool.java:332)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:498)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Scoop Jobs
Can use
'scoop job --create' to create
jobs for incremental imports , so that we can run it in future by just specifying new values.
scoop job --create job_name_1
running this command will not actually import / export any data , but will just create a job.
To see a list of scoop jobs
can see additional details
can execute
Scoop remembers the last highest values , and automatically uses values ahead of it.
it stores the last highest value in meta data.
Where does it store last incremented value ?
Note its on local file system , and not on hdfs.