Data Diffs with Meltano and dbt-osmosis

Introducing the Problem

Why are “data” and “software” fundamentally referred to as two distinct things? In certain situations, the answer is obvious. Data living in a database behind a commerce platform is distinctly different from both the code which turns the cogs of the front end and the API, which interfaces between the layer of persistence and the application. 

The situation where this line is blurred, clearly, is when your software code declaratively expresses the desired “shape” of its output data based on an immutable, or at least semi-consistent, set of input shapes. This ambiguous divide exists with dbt. A dbt project itself has little knowledge of the outside world, but once executed on real data, the artifacts (tables in this case) are valuable point-in-time references. 

What if we could compare those artifacts across changes in the code that declare the intended outcome? The project dbt-osmosis set out to solve for the biggest pain points in continuously developing a dbt project on a day-to-day basis based on real-world dbt iteration in a highly agile environment. 

One of those pain points is what we will call “data diffs.” This is different from data-diff, which was open sourced by Datafold, and instead focuses on changes in dbt code that result in changes in outcome. It’s easy to see if dbt code runs. It’s easy to see if dbt code passes basic assertions around its properties, such as uniqueness and nullness, or even table shape assertions. If you ask anyone today though, it is much more difficult to see exactly how many rows were altered, added, or removed by a change in a line of a dbt model. 

So you are probably wondering, what is dbt-osmosis? dbt-osmosis is a developer productivity tool focused on obviating manual tasks dbt developers engage in by implementing new functionality via a central architecture which wraps dbt-core. These functions include data diffs, an interactive Streamlit workbench, a FastAPI server which integrates with VS code, and automated database-aware yaml management. Meltano augments dbt-osmosis by making it dead-simple across an entire team of developers by handling configuration in a unified, effective, environment-aware manner. It also allows teams to centrally define “commands” which are specific to our software project that allow us to quickly scaffold and centralize dbt-osmosis commands, such as the ability to diff models with a simple

 meltano invoke dbt-osmosis:diff -m stg_github__stargazers

versus

dbt-osmosis diff –project-dir $HOME/some/dir –profiles-dir some/other/dir –target dev –model stg_github__stargazers

There are more advances to make, such as longer-term caching of transformation code output (tables) for diffing purposes and time traveling by using specific git commit hashes to understand models diffs based on code changes over time or versus a pivotal point in time. These changes would allow us to more deeply enter the realm of continuous integration use cases, guaranteeing model parity with point-in-time model output.

Now lets move from the theoretical to what this looks like in practice.

Getting Practical with Data Diffs

Below is a simplified example of a modification to a model. We can see via the git extension in our IDE I flipped a boolean operation and I need to understand the impact quickly. What exists on disk is on the right, what exists in our git data is on the left. We need dbt to parse, compile, and run a query that uses the compiled SQL from both of these revisions.

A simple execution of

 meltano invoke dbt-osmosis:diff 

provides the answer with no context-switching. In real-life our model changes are more nuanced and even more in need of these kinds of checks. For example making a left join an inner join or adjusting a macro which feeds into an `in` clause. In a small amount of time, I have an answer to exactly the impact of my change.

Bar chart rendering is possible too via –output=bar



So if you use Meltano and dbt and want to dive in, it’s really easy. Paste the below YAML into your meltano.yml:

 utilities:
 - name: dbt-osmosis
   namespace: dbt_bigquery
   pip_url: dbt-osmosis dbt-bigquery~=1.1.0
   settings:
     - name: auth_method
     - name: project
     - name: dataset
     - name: keyfile
	commands:
  	workbench:
    		args: workbench --project-dir ${MELTANO_PROJECT_ROOT}/transform --profiles-dir ${MELTANO_PROJECT_ROOT}/transform/profiles/bigquery
    		description: Start workbench
  	diff:
    		args: diff --project-dir ${MELTANO_PROJECT_ROOT}/transform --profiles-dir ${MELTANO_PROJECT_ROOT}/transform/profiles/bigquery
    		description: Diff a query from git HEAD, requires -m [MODEL]
  	diff-cache:
    		args: diff --project-dir ${MELTANO_PROJECT_ROOT}/transform --profiles-dir ${MELTANO_PROJECT_ROOT}/transform/profiles/bigquery --temp-table
    		description: Diff a query from git HEAD caching model, requires -m [MODEL]
  	run:
    		args: run --project-dir ${MELTANO_PROJECT_ROOT}/transform --profiles-dir ${MELTANO_PROJECT_ROOT}/transform/profiles/bigquery
    		description: Organizes all dbt schema yaml files and cascading inherits documentation
      				to keep it DRY
  	compose:
    		args: compose --project-dir ${MELTANO_PROJECT_ROOT}/transform --profiles-dir ${MELTANO_PROJECT_ROOT}/transform/profiles/bigquery
    		description: A slimmer version of run used in precommit hooks
	config:
	# Copy config from dbt-<adapter> transformer YAML and paste below
  	auth_method: …
  	project: …

  
Once inside your Meltano project, the above example will require a small amount of additional configuration for your specific dbt adapter, but once set up, simply run

meltano install dbt-osmosis

After this, you are ready to go with

 meltano invoke dbt-osmosis:diff -m whatever_model_you_have_modified


To get started, have your Meltano installation ready and head over to dbt-osmosis.


Meet the Author

Alex Butler – Senior Analytics Engineer (Harness)

A short bio: I am a scholar and a practitioner of all things software related. I am incredibly excited about the future of technology and data and the ways we can use it to make the world a better place.
A nerd for Rust, Clojure, Go, and Python. Dabbling in ML and large-scale fast-moving data. Passionate about open source. When I am not coding I like to do BJJ, Krav Maga, ride motorcycles, design and theorycraft applications and data systems on whiteboard (old school), read books, and I have a really cute baby boy named Orion.” (I have a picture in my mail account)


Intrigued?

You haven’t seen nothing yet!