RDB to HDFS
Increment Loading of Data
Load only specific column from Table
Import all tables from RDB (All table should have primary key)
NOTE
- When using an text column for split-by an extra parameter needs to be enabled Dorg.apache.sqoop.splitter.allow_text_splitter=true
- The data is stored on HDFS under
/user/<username>
(Sqoop Staging Area)- We can save the data to other locations. But when it’s always loaded first into staging area
- Sqoop can only load data once (Data needs to dropped to load the entire dataset)
- For tables that don’t have a primary key we need to specify mappers (Default Mappers: 4)
- Mappers specify the no. of parallel processes that is going to be used to load the data
HDFS to RDB
IMPORTANT
Before exporting data make sure already table exist. If not error will be thrown
RDB to Hive
We don’t have to create a table in hive it can be created automatically. The data is 1st written into the Staging area and then loaded into Hive. If same data exist in staging area the job will fail
Options
Location to save data
--target-dir <hdfs-dir>
Filter Data
--where "ename='David'"
Incremental loading into RDB into HDFS
--incremental append/ lastmodified --check-column <column-name> --last-value <value>
Instead of hardcoding an password prompt will be shown
-P
Makes an new managed table in Hive
--create-hive-table