Incremental Load
Using Holistics Import allows you to quickly ETL/sync data from different sources into your data-warehouse automatically.
As your source data grows everyday, incremental Import allows you to load data incrementally, loading only the data that's been changed. This reduces the need to load old/untouched data.
#
When Your Past Data Records Don't ChangeIf your past records/data don't change (example: weblogs, click streams, etc), on every run, Holistics simply figures out the new data that hasn't been imported over and performs the import.
#
Append ModeBased on the schedule you set, this will insert new records to the destination table, while keeping old records in the destination table. This works well for tables that only have new records or rows added to it without changes to old records, especially for tables without an incremental column.
#
Incremental ModeYou need to specify to Holistics what is your increment_column
. It will rely on this column to decide what data to extract from source to load into the data warehouse.
Examples of increment_column:
- ID (auto-increasing primary key)
- timestamp / created_at: timestamp at which the record is created
#
When Your Past Data Records ChangeThis is when your past records data can change over time (example would be bookings
with status
column).
In this case, Holistics would require you to specify an additional column called primary_key
(or primary key / unique column). This is needed to perform an upsert operation (or INSERT ON DUPLICATE OVERWRITE), where existing records in your data-warehouse will be overwritten.
In the above case, your table would need to contain 2 columns:
updated_at
: timestamp at which your record is last updated. Configured asincremental_column
id
: table's unique or primary key. Configured asprimary_key
#
When You Want To Import Records In A Custom RangeBy default, incremental import only imports the records whose incremental_column
value is greater than the current maximum value in the destination table.
In case you want to specify a range of values for the incremental_column
, you can use 'Custom Range' import execution mode.
#
NOTE- Currently, Custom Range import only supports importing data from MongoDB and some SQL databases.
- To use Custom Range import, you need to use Holistics CLI. See this doc to learn how to set up CLI and use it.
#
UsageFirst, you also need to create a Data Import in Incremental mode (similar to the above section).
Again, you can also the primary_key
column to overwrite existing records in the destination table similar to this section.
Then, execute your Data Import in Custom Range execution mode by entering the following command in Holistics CLI:
Arguments:
-j
: The id of your Data Import--custom-range
: Specify the execution mode to be Custom Range mode--range-start
: The starting value (inclusive) of theincremental_column
which will be chosen for import. Date formatYYYY-mm-dd
--range-end
: The ending value (exclusive) of theincremental_column
which will be chosen for import. Date formatYYYY-mm-dd
:::info Note
- The
incremental_column
's data type should bedate/datetime/timestamp
or any type that can be compared correctly using>=
and<
. - Destination Data Type of
incremental_column
should match the data type of sourceincremental_column
. :::
#
ExampleLet's say we have a Data Import id 1
, which is configured to import data from this table:
We execute this command:
As the result, all records whose updated_at
is from Day 2
upto (but not including) Day 4
will be imported to the destination table.
#
Handling Deleted RecordsIn either of the above case, if you have records in a source that get hard-deleted, doing incremental import will not be able to capture these deleted records and apply in the destination table. In this case, there are 2 ways to handle this.
#
Doing Regular Full ImportYou can schedule a full import every week to reload the entire table data.
#
Moving To Soft-DeleteThe recommended ways to handle this is to move your application logic to use soft-delete (i.e just mark the record as deleted instead of actually deleting it). Usually, this is implemented by:
- Adding a
deleted_at
column to your table, default toNULL
(not deleted). A record with non-NULL value is considered deleted. - When a record is deleted, set both
deleted_at
andupdated_at
to the current time.
With this setup, the Incremental approach as described above will work as normal.
#
A Note On Soft-DeleteNote that aside from serving the incremental import purpose above, soft-deletion is also our recommended way:
Operations: You don't lose track of old data - a hard-deleted record is gone forever and you would never be able to restore it. A user is accidentally deleted cannot be restored. A deleted order cannot be traced back.
Analytics: Your records don't just sit alone but it has relational with other data (a booking would have user_id). If we hard-delete the user, we won't be able to make sense of the booking data any longer.