Source freshness and maintenance tasks¶
Source freshness¶
If you have external data uploads, you may find it useful to check the freshness of the data. Check for freshness behaves the same way as wait of dependency. If the source is not fresh, the model will not be executed.
To enable source freshness check, you need to add freshness
section to your source's yaml file (see dbt docs).
sources:
- name: my_source
freshness:
error_after:
count: 15 # positive integer
period: hour # minute | hour | day
loaded_at_field: my_source_loaded_at # column or expression on which to check the freshness
Maintenance tasks¶
In each warehouse, there are some tasks to optimize the storage usage used by tables. In this tutorial, we will cover available tasks for Databricks.
Warning
Support of other warehouses will be added in the future.
Warning
Dbt macros will be moved to separate package in the future.
Tip
To use the maintenance tasks in your project, you need to copy files from macros directory to your dbt project.
Databricks maintenance tasks¶
Set TTL¶
To enable the automatic deletion of data, you can set the Time To Live (TTL) for a table. The TTL is the time in days that a record is kept in the table. After the TTL has expired, the record is deleted.
It's required to have a column with timestamp-like data type to use the TTL feature. Usually, it's a column that specifies the time when the record was created or updated. This column must monotonically increase with time.
To set the TTL for a table, use the maintenance
section in model's config:
config:
maintenance:
ttl:
key: # Table timestamp-like column name
expiration_timeout: # Table expiration timeout in days
additional_predicate: # Additional predicate for filtering expired data
force_predicate: # Force predicate for filtering expired data. It will override base predicate
Examples:
Records that are older than 10 days will be deleted:
Records that are older than 10 days and have status = 'accepted'
will be deleted:
config:
maintenance:
ttl:
key: etl_updated_dttm
expiration_timeout: 10
additional_predicate: "status = 'accepted'"
Records that have rejected
status will be deleted (note that expiration_timeout
is ignored here):
config:
maintenance:
ttl:
key: etl_updated_dttm # ignored
expiration_timeout: 10 # ignored
force_predicate: order_id in (select order_id from orders where status = 'rejected')
VACUUM (docs)¶
OPTIMIZE (docs)¶
Deduplicate Table¶
Deduplicates table using unique_key
and partition_by
columns.
Warning
This feature is not well documented. Please use it with caution.
Persist docs¶
Warning
This feature is not well documented. Please use it with caution.
List of Examples¶
- Basic Project: a single domain, small tests, and a single target.
- Advanced Project: several domains, medium and large tests, and different targets.
- Dependencies management: how to manage dependencies between models in different domains.
- Manual scheduling: domains with manual scheduling.
- Kubernetes tasks: how to run dbt models in Kubernetes.
- Integration with other tools: how to integrate dmp-af with other tools.
- [Preview] Extras and scripts: available extras and scripts.