Incremental import is a technique that imports only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.
Following Syntax
--incremental
--check-column <column name> : tells which column will be used to check that its incremental value.
--last value <last check column value> : tells maximum value of the check parameter.
If you run the sqoop import in 'last modified' mode then there is an assumption that there is last
Example
Following Syntax
--incremental
--check-column <column name> : tells which column will be used to check that its incremental value.
--last value <last check column value> : tells maximum value of the check parameter.
If you run the sqoop import in 'last modified' mode then there is an assumption that there is last
Example
Consider a table with 3 records which you already imported to hdfs using sqoop
Now you have additional records in the table but no updates on existing records
The above code will insert all the new rows based on the last value. Now we can think of second case where there are updates in rows
Here we use incremental lastmodified where we will fetch all the updated rows based on date.
Tried running in lastmodified mode with column which is not time stamp.
Got a runtime exception
Scoop Jobs
Can use 'scoop job --create' to create jobs for incremental imports , so that we can run it in future by just specifying new values.
running this command will not actually import / export any data , but will just create a job.
To see a list of scoop jobs
can see additional details
can execute
Scoop remembers the last highest values , and automatically uses values ahead of it.
it stores the last highest value in meta data.
Where does it store last incremented value ?
Note its on local file system , and not on hdfs.
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
+------+------------+----------+------+------------+
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P
Now you have additional records in the table but no updates on existing records
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
+------+------------+----------+------+------------+
Here you should use an --incremental append
with --check-column
which specifies the column to be examined when determining which rows to import.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root
-P --check-column rank --incremental append --last-value 7
The above code will insert all the new rows based on the last value. Now we can think of second case where there are updates in rows
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2015-01-01 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
| 104 | Dallas | Texas | 4 | 2015-02-02 |
| 105 | Phoenix | Arzona | 17 | 2015-02-24 |
+------+------------+----------+------+------------+
Here we use incremental lastmodified where we will fetch all the updated rows based on date.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P
--check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
Nice Example
https://shalishvj.wordpress.com/2014/08/12/sqoop-incremental-imports-using-last-modified-mode/
Example # 2sqoop import --connect jdbc:mysql://localhost/retail_db --username retail_dba
--password cloudera --table products --target-dir /lav/sqoop/retail_db1
--check-column product_id --incremental lastmodified --last-value 10
Tried running in lastmodified mode with column which is not time stamp.
Got a runtime exception
17/07/30 04:10:48 ERROR manager.SqlManager: Column type is neither timestamp nor date! 17/07/30 04:10:48 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Column type is neither timestamp nor date! java.lang.RuntimeException: Column type is neither timestamp nor date! at org.apache.sqoop.manager.ConnManager.datetimeToQueryString(ConnManager.java:787) at org.apache.sqoop.tool.ImportTool.initIncrementalConstraints(ImportTool.java:332) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:498) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615) at org.apache.sqoop.Sqoop.run(Sqoop.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227) at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Scoop Jobs
Can use 'scoop job --create' to create jobs for incremental imports , so that we can run it in future by just specifying new values.
scoop job --create job_name_1
running this command will not actually import / export any data , but will just create a job.
To see a list of scoop jobs
scoop job --list
can see additional details
scoop job --show
can execute
scoop job --exec
Scoop remembers the last highest values , and automatically uses values ahead of it.
it stores the last highest value in meta data.
Where does it store last incremented value ?
Note its on local file system , and not on hdfs.
No comments:
Post a Comment