WiseAnalytics | Leveraging DBT as a Data Modeling tool

Insights

Leveraging DBT as a Data Modeling tool

7 min read

By Julien Kervizic

DBT is a tool that aims at facilitating the work of analysts and data engineering in transforming data and modeling within a data warehouse. It provides a command-line as well as a documentation and RPC server.

After more than a year working with DBT, I thought it would be good to reflect on what it offers, what it is currently lacking, and what features might be desirable to have incorporated in the tool.

Data Pipeline Automation: The What, How, and Why

‍

Jinja capabilities

Jinja is a python templating engine, used in data tools such as Airflow, Superset, or infrastructure as code tools such as Ansible.

DBT leverages Jinja, at the same time as a wrapper around its model, to provide configuration objects or to define macros. Models can then re-use Jinja artifacts. In addition, DBT uses Jinja templates within its’ core inner workings; as a result, Jinja ends up a first-class citizen in the tool.

I have previously written on the use of templating SQL with jinja for increased code re-use and legibility. Contrary to a typical implementation of Jinja, the DBT’s does not make it easily accessible to enrich the functionalities as easily as in other tools. DBT does not currently offer template inheritance or define custom functions through a context_processor.

Model Features

Snapshot: DBT offers a feature for “Snapshotting” data, which would create some “proxy” scd2 table out of an input dataset. Snapshots are particularly sensitive to granularity changes and can result in table explosion if the grain is not adequately defined. For this reason, snapshotting should be used only on already normalized data. Compared to an approach potentially leveraging an orchestration tool such as airflow to take timely (hourly, daily) snapshots of the data, this approach would result in a more compact dataset, more traditional data warehouses than the functional data engineering airflow approach. We quickly moved away from this feature, given the nature of some of our datasets, e.g., event state change, cumulative record file transfers.

Incremental: DBT supports the incremental load of data. It requires some configuration to specify to treat the model as increment and some templating to make sure only new data gets inserted onto the model.

The default setup, however, has some limitations. There is, for example, no automatic support for migrations, so when a column in the model is modified or added to a DBT model, the model needs to be fully refreshed or risk failing at the next DBT run. This creates an extra cognitive burden and maintenance need that is not typically easy to justify for smaller datasets. But, on the other hand, for larger datasets, it often proves to be a viable tradeoff.

On the DBT discourse channels, there is quite some information on the limitations of this approach, and the DBT util package does provide additional incrementally option through insert_by_period. While this is not a package I have personal experience with, it seems like a solution I would consider for the initial building stages of a data platform from the documentation.

Partitions and Transactions: DBT supports partitioning only for particular databases, it is supported for BigQuery, but there is no support for Postgres, for instance. Postgres and redshift instead have support for transactions. This sometimes has undesired effects given the way DBT runs specific queries. Wrapping operations such as adding indices or primary keys outside the transaction has provided a more consistent/expected experience.

Documentation

DBT can automatically generate a static website containing model definition and lineage from the DBT source code. The tool also allows for rich text extension through markdown. The specific documentation and field documentation can be populated by referencing these documentation blocks in the model definition. Another approach is to provide field documentation in a consolidated dbt_project YAML file.

Both of these approaches have the benefit of tying the documentation to the changes being made. This allows enforcing in code reviews that data engineers populate and update the documentation when they commit their changes. In this way, no code ends up in production without appropriate documentation.

Development & Testing Capabilities

Data Quality tests: DBT can run some tests, what it calls schema tests and data tests. The tool provides a CLI interface for running these tests. The built-in capabilities for running tests with DBT are pretty limited. However, some extensions exist, such as this port of great expectations.

Overall as a data quality testing tool, the native great expectations offer a much better developer experience while providing a good web dashboard to view the different data quality runs.

Unit testing: An area of testing where DBT cruelly lacks is unit testing. By default, DBT doesn’t provide any built-in capabilities for unit testing models. It is, however, possible to implement unit testing capability with DBT. This is, for example, what Shopify did building python tests that interact with DBT models.

Another approach that I have implemented is to leverage the data test functionality of DBT and a custom model reference function written in Jinja. A sample implementation in a data test is provided below:

The model reference function is built in such a way that it allows to 1) inline the model being referred to in the data test as a subquery instead of a reference 2) override the references contained within the model being inlined onto the data test. This approach allows testing models with an input dataset contained in a CTE, for instance.

The implementation of this custom_ref macro leverage’s DBT’s graph context variable, Jinja’s render() method, and string replacement calls.

To run these tests, DBT still needs to run these queries against a database. This can be done as part of a docker-compose setup, however.

Test data: One of the challenges we have been having with leveraging DBT is working locally or having a clean separation of what data to leverage as source data and where to write the data.

By default, the ref function works with whatever is defined in the DBT project as a prefix for the different schemas generated. The source function, on the other hand, allows having an independent project reference. Doing development, we faced with the challenge of needing data for production to test and reconcile our changes.

To do so, we work with schema separation to not have data processed for development purposes mixed in with our datasets used in production. To be able to develop our changes and leverage. The production data in this context typically involves one of two actions. Either manually set the reference in the code, for the particular datasets to be used as a source, or if the input data is relatively static (e.g., day run), re-run the different DBT transformations in our development schemas, duplicating data and taking over the processing time.

The ability to differentiate and handle source and sinks at the pipeline level (e.g., when you define multiple models in a run, that we can determine where to source the tables not contained in the run -models)more clearly would go a long way for developer productivity.

Local development: An extension of the problem faced with test data is developing locally. At the same time, it is possible to run the DBT against a local database or remote database, not distinguishing source and target models at the pipeline, and leverage some form of data federation hinders what you can do for local development with DBT.

RPC

DBT offers the ability to run remote procedure calls. The interface supports the commands supported within the CLI, plus some specific ones such as a result polling command.

This provides external applications with the possibility to execute all or part of the DBT pipelines. In addition, this allows the orchestration of part of the process to be done outside of DBT, for example, in Airflow or another application.

Another feature supported by the RPC API is running arbitrary queries on top of the database. This is something I haven’t made use of extensively but that I could see being useful in some context. However, the overall API for this is a bit clunky and requires multiple steps: compile SQL, run SQL query, then poll for the results. I could see adding value in this area to leverage the computed DBT model as ORM models and to query those in a more structured manner.

Summary

DBTs provide some nice features to speed up the process of modeling, it is particularly helpful for companies in their early stages of maturity who have to deliver fast a lot of data models & products. The tool has quite a lot of room for improvement in certain areas such as improving customization options through python code or in the testing and local development area.

Some of the features offered such as the documentation server and the RPC server also seem to be lacking in features, but these to a certain extent can be considered as first steps towards more feature-complete systems. The documentation generated by DBT for instance could be integrated into other tools, the same ways as a more fully-fledged API could be built to retrieve records from the data store. These features offered by DBT seem first and foremost to offer a quicker time to market in these areas and to allow initially for a less complex system landscape.

Depending on the particular datastore being used, it might also be a bit more challenging to scale than some alternative tooling. The tool is also very much limited to SQL and available to a limited number of “adapters”.

Overall DBT is a good product that is particularly fit for early-stage companies or companies that are primarily trying to provide tools for automating part of their analyst's workflows. The barrier to entry is fairly low, but the learning curve can be quite high.

‍