dbt Full Refresh: Usage and Examples

dbt has revolutionized the way we handle data transformations in modern data stacks. It allows data analysts and engineers to transform raw data in the warehouse into meaningful datasets. One crucial aspect of dbt is the full refresh capability. This tutorial delves deep into this feature, helping you understand and implement it effectively.

Why dbt Full Refresh?

Imagine a scenario where your company has a model containing all user actions that gets materialized into a table. There are millions of actions in that table, so you set it up to load data incrementally. That means each day, only the actions from the past day get added to the table.

After a few weeks you notice there are some edge cases that cause your source data to drift and become slightly different from the dbt user actions table due to some important manual updates being made to the source table. To ensure your model reflects a perfect copy of the source data, you'd need a mechanism to refresh it entirely. This is where dbt's full refresh comes into play.

Configuring Full Refresh in dbt

To initiate a full refresh in dbt, you can use the --full-refresh flag during your dbt run or build.

dbt run --full-refresh --models user_actions

However, if you want certain models always or never to undergo a full refresh, you can set the full_refresh configuration in your model's configuration (this can either be set at the top of your model file or in the projects yaml file.

{{ config(full_refresh = true) }}

dbt Incremental Models vs. Full Refresh

While full refreshes are powerful, they might not always be the most efficient choice, especially for large datasets. This is where incremental models come in. They allow dbt to process only the new or updated rows, saving time and computational resources.

In some cases, like the example from the introduction about millions of user actions, you might want to trigger incremental refreshes daily, then a full refresh once a week or once a month to ensure all historical data is accurate.

Another scenario where you might want a full refresh is if the logic of your incremental model changes.

Trigger Full Refresh on Schema Changes

The on_schema_change configuration in dbt offers granular control when columns in your incremental model change. For instance, if you add a new column, you can decide whether to:

Ignore it (ignore).
Fail the run (fail).
Append the new column (append_new_columns).
Sync all columns, including adding new ones and removing missing ones (sync_all_columns).

Example:

{{ config(
  materialized='incremental', 
  on_schema_change='append_new_columns'
) }}

Working with Full Refresh Models

Scenario: Your e-commerce company wants to maintain a model of customer profiles. Over time, customers update their profiles, and you want to ensure the model is always up-to-date.

Step 1: Set up your dbt environment and create a new model named customer_profiles.

Step 2: Define your model using the table materialization type:

--customer_profiles.sql

{{ config(materialized='table') }}

select
    customer_id,
    first_name,
    last_name,
    email,
    last_updated_date

from {{ source('shopify','customers') }}

Step 3: Run your model:

dbt run --models customer_profiles

Step 4: To simulate profile updates, modify some records in your shopify.customers source.

Step 5: Run a full refresh to update the customer_profiles model:

dbt run --full-refresh --models customer_profiles

Step 6: Validate that the customer_profiles model has the updated records.

Best Practices

Large Datasets: For models with especially large datasets, consider setting full_refresh: false to prevent dbt from dropping and recreating them unnecessarily.
Schema Changes: If your source data schema changes frequently, consider using the on_schema_change configuration to handle these changes gracefully.
Optimization: Regularly monitor the performance of your dbt runs and adjust configurations as needed.

Conclusion

The ability to perform a full refresh is a powerful feature in dbt, ensuring data accuracy and consistency. By understanding when and how to use it, you can optimize your data transformation processes and maintain high-quality datasets.

Additional Resources

Official dbt documentation

dbt Exposures

dbt init

Unified workspace for your dbt workflow

Forget about the painful parts of dbt development, focus on what matters the most - data analysis

Join the beta