For data engineers, everything revolves around data pipelines, a series of data processing steps taken in a specific order. Because each step depends on the previous one, there are many single points of failure. That’s why maintaining data pipelines can be pretty challenging, especially within corporate IT architectures.
While many methods and tools exist, Meltano has taken a targeted approach to help data engineers develop, orchestrate, and monitor data pipelines in line with DataOps best practices. It does so by offering a plug-in-based architecture that packages best-in-class open source data tools, including Singer, dbt, and Airflow. By embracing these tools, Meltano has explicitly embraced the ELT paradigm.
Recently, Meltano has added the long-anticipated
run command, enabling a series of command blocks. In this article, you’ll learn why the
run command is a game-changer for teams maintaining complex pipelines and explore some best practices for using it.
What Is a Composable Pipeline?
To derive insights and develop algorithms, data analysts and data scientists need access to a vast amount of data. However, they wouldn’t run their analysis on data that resides in production systems or third-party SaaS tools. It is the job of data engineers to consolidate this data inside a centralized repository optimized for storing and processing data. These systems are known as data lakes or data warehouses. But their job doesn’t end there. On the other side of the workflow, business people require (enriched) data to be distributed to their marketing and sales tools, like advertising platforms and CRM technology.
This data extraction from one system into another is known as a data pipeline. Data pipelines are generally referred to as either ETL or ELT, each following a specific flow. E stands for Extract, L stands for Load, and T stands for Transform. However, these flows are generally an oversimplification in a world with complex architectures and many dependencies. With data spread over various source systems in many organizations, linear data pipelines are an exception rather than the rule.
Imagine you are responsible for aggregating sales per product category for a large retailer. Every day, your pipeline has to spit out the data for the sales tracker. The product categories, which can change on a daily basis, are stored in a dedicated product catalog in an SaaS tool. On the other hand, your organization keeps ticket data per product SKU in a completely different system. That means your data pipeline will have multiple extracts and loads before the data can be joined, transformed, and analyzed.
That’s where composable pipelines come into the picture. Instead of forcing the data engineer to fit a pipeline into a specific straitjacket, they can break it up into multiple logical pieces.
run command is the implementation of composable pipelines. It enables data engineers to set up data pipelines that don’t follow the simple pattern of extracting from one data source, loading it into another, and wrapping up with a single transformation.
Until now, users would typically invoke Meltano pipelines using the
elt subcommand. This command supports a single tap and a single target, followed by a
transform parameter. It is a very standardized way of setting up data pipelines. However, in situations where one needs to stray away from this simple pattern, setting up complex pipelines with multiple inputs, outputs, or transformations using the
elt subcommand requires some creativity.
meltano elt tap-salesforce target-bigquery --transform=run –job_id=customer_data_from_bq_to_sf
However, Meltano’s new
run subcommand can be parameterized with a series of command blocks. These blocks are discrete units of composable work, described by their plug-in names. Blocks can be Singer taps and targets, dbt transformers, or any other plug-in.
elt command can be translated to a
run command as follows. Once again, you’ll use Meltano to extract data from Salesforce and load it into BigQuery before triggering a transformation via dbt’s
meltano run tap-salesforce target-bigquery dbt:run
Now you know that data engineers can use both the
elt and the
run command for executing simple data pipelines. But let’s assume that your dbt transformation processes data from both Salesforce and an on-premise PostgreSQL database, which would require another extraction before the transformation is triggered. In that scenario, you’d add a PostgreSQL tap and a BigQuery target to your previous command, like this:
meltano run tap-postgres target-bigquery tap-salesforce target-bigquery dbt:run
run command will execute all command blocks from left to right, where each command block is a single point of failure. A failure in any block will cause the entire command to abort, as in a directed acyclic graph (DAG). Using Meltano’s
elt command, this pipeline would require multiple
elt commands scheduled in sequence. And even then, it’s not guaranteed that a failure in one command will abort the whole pipeline.
Keep in mind that Meltano follows the ELT paradigm: it performs transformations on data at rest. Every extraction requires a destination. That’s why taps and targets still need to be chained together. For example, the following command will fail because there’s a dbt plug-in between a tap and a target:
meltano run tap-bigquery dbt:run target-salesforce
There is one exception to this rule: inline stream maps. Instead of using dbt to transform data that has landed in a data warehouse, stream maps apply transformations on in-flight data. This offers several possibilities:
- aliasing streams (Singer jargon for tables) or properties (columns) to provide custom naming downstream.
- filtering streams to exclude records or properties in downstream plug-ins.
- splitting or duplicating streams for downstream targets.
- transforming properties.
- creating new properties from user-defined expressions.
Keep in mind that inline stream maps work on a record-by-record basis. Consequently, some typical transformation capabilities are out-of-scope by design:
- The option to aggregate data remains available only after the data has landed at the target, for example, with a dbt transformation.
- The option to join data is also only available after the data has landed.
- External API lookups can be achieved with a custom mapper plug-in or when the data has landed.
Let’s say that you’d like to add a transformation to your previous command that hashes personal information on the fly. You can achieve this with the following:
meltano run tap-salesforce pii-hasher target-bigquery dbt:run
You can also invoke multiple transformation; for example, you may want to hash personal information and copy one field to another:
Meltano run tap-salesforce pii-hasher close-date-as-win-date target-bigquery dbt:run
As demonstrated by all these examples, Meltano’s
run command opens the door to many other plug-ins, all in a single command. Consequently, it enables tighter integration with dbt. By setting up custom plug-in commands, one could develop dbt tests and integrate them into the data pipeline. The whole pipeline will fail in the following command if the test doesn’t return a successful exit code:
meltano run tap-salesforce target-bigquery dbt:test dbt:run
And it doesn’t stop with dbt. The team behind Meltano has implemented SQLFluff, a linting tool to evaluate queries. It is a first step in growing the plug-in library to expand the
run command’s capabilities. Users can also add any Python or container-based plugin they want to their project and execute it via the
In this article, you’ve learned how Meltano’s
run_command changes the way data teams build complex pipelines using Meltano. While the
elt command assumes that data pipelines all follow the same workflow, the
run command permits complex pipelines with sequential dependencies. Because it doesn’t enforce a specific straitjacket, many plug-ins can be integrated in a single