Auto Loader only supports loading data from directories (Block Storage)
Loading data from DB is not supported
Auto Loader uses Spark Structure Streaming under the hood and the command will run indefinitely
When reading data using Auto Loader we can even use the options that are specific to the data type being loaded
Checkpoint
When configuring Auto Loader to read data a checkpoint location needs to be specified this location houses a key value database (RockDB) which keeps track of all the files that are processed
This location also stores the schema of the data that is automatically inferred
Schema evolution is also performed automatically
The schema related metadata is stored in the _schemas
directory in the checkpoint location
Configure schema inference and evolution in Auto Loader | Databricks on AWS
Auto Loader when using JSON source loads all data as string
_rescued_data
column in the target table is automatically added by Auto Loader to capture records that are malformed and could not fit into the schema
For each batch of new files added into the target table a new entry is made in its history called STREAMING UPDATE
Tables/Views that are created based on the data that is loaded into a table by Auto Loader are auto updated
Unsupported Features
Sorting of data is not allowed on Streaming Data (To expensive and complex operation)
Multiple aggregation operations on the same dataset is not supported
Structured Streaming Programming Guide - Spark 3.5.0 Documentation
Triggers
Triggers are used to define the interval at which new data from the source will be persisted to the sink.
When a trigger is not specified then continuously new data will be saved into sink when the previous batch is completed
When the triggerNow
or trigger_once
triggers are used the resultant table created is static (i.e. not auto updated by upstream new data) and as such we can use any normal spark operations on this data
Structured Streaming Programming Guide - Spark 3.5.0 Documentation