Our Next Steps for Building the Infrastructure for Your Modern Data Stack

Product

by Meltano

on December 17 2021

The goal of this post is to contextualize our upcoming roadmap given the framing of our product strategy. In the previous posts in this series all about our Mission, Vision, and Strategy, we told you there would be one more post diving into some of the specific tactics of our strategy. This is that post.

In the first post, we introduced our vision for Meltano as your DataOps platform infrastructure. This framing feels right to us because it focuses not just on the data, but also on the work being done with it. It brings DevOps to the table and clearly indicates our focus on data stack developers. We’ve also seen others in the data profession independently come to the same conclusion that a DataOps platform infrastructure is needed. This idea was eloquently discussed by Benn Stancil’s post of the same name, where he discusses the idea of a tool that served as the infrastructure for the modern data stack. We very much agree with the need for a tool that provides this foundation.

Given all this, what is some of the work that must happen for Meltano to become the DataOps platform infrastructure and the foundation of every team’s ideal data stack?

Useful Abstractions

As we build the DataOps platform infrastructure, all parts of our product strategy must consider the relevant concepts and abstractions inherent to data workflows and how they will work with DevOps best practices like version control, code review, automated end-to-end testing, and isolated development environments. Some of these data concepts and abstractions are tried and true, such as backfills, incremental loads, and full refreshes in the case of data replication. Other abstractions are actively being defined via projects like OpenMetadata or OpenLineage and in new areas of the stack like the metrics layer.

These concepts are known to data professionals but many data tools don’t understand them and therefore aren’t able to effectively communicate with each other. Current state of the art is a one-way relationship to send metadata to a central repository for the purpose of observability and governance. This creates a disjointed data stack that is hard to manage and reason about. The DataOps platform infrastructure must understand the underlying data concepts and abstractions to facilitate communication and integration between and among different tools.

Within Meltano today, many of the data abstractions are tightly coupled with the specific technology used. For example, data integration pipelines are tied to Singer, while transformation is done with dbt, and orchestration via Airflow. We must decouple the tool-specific parts of the code from the parts that can represent well-worn data abstractions. These include concepts like backfills, incremental loads, full refreshes, configuration, state, DAGs, tests, models, and more.

This process of better defining abstractions for each part of the data lifecycle will extend to transformation, validation, orchestration, visualization, and more. By having these concepts available for many different solutions, we’ll be able run stand-alone open source tools on top of Meltano, as well as integrate with vendor APIs to enable a consistent way of using any tool in data workflows, no matter the wildly different implementations. Bringing all of these parts of the lifecycle together will be your Meltano Project.

A Meltano Project is the declarative representation of your modern data stack and defines which components on top of the platform infrastructure are being used. By having your stack defined as a Meltano Project you then unlock the power of DataOps with features like environments, cross-plugin workflows, and centralized configuration management in a format that can be versioned and tested automatically.

We will define these abstractions as we add plugins and new features for specific areas of the infrastructure for the modern data stack. We won’t focus solely on generating the abstractions for their own sake, but will build them out as we aim to continually deliver value to our users. Now it’s time to share what our roadmap looks like.

Near-term Work

In January we will deliver several enhancements to Meltano that will continue to advance us towards our DataOps platform infrastructure vision.

Composable pipelines is an effort to bring better abstractions to the command line interface. With the new `meltano run` command, users will be able to run cross-plugin workflows better and won’t be limited to the `elt` syntax that only supports Singer and dbt.

This work unlocks the ability to add Stream Map plugins between taps and targets that can transform data on the fly for common tasks like filtering or pseudonymization. This capability already exists in taps built with the Meltano SDK, but now this will be available to any Singer tap or target. Along with the addition of a generic testing interface for any plugin, users will be able to extract, load, transform, and test their data in ways that best suit their needs. This, coupled with Meltano Environments, enables powerful and customizable ways to run any kind of data workflow.

After those enhancements, we have several areas we aim to improve. First is the dbt integration. With the addition of environments and composable pipelines we’ve increased the usability of all plugins, but we now need to make the experience with dbt better in particular. We aim to make Meltano the best way to run dbt Core, not only because it’s one of the best tools on the market, but also to build trust with users that Meltano can make running plugins other than Singer a great experience and can make combinations of different plugins better than the sum of their parts. Enhancing the dbt experience means specific improvements such as supporting v1.0, better supporting snapshots, autogeneration of sources.yml, easier installation of specific adapters, simplified documentation, and easier invocation.

Next is how Meltano works with discovery.yml and MeltanoHub. While MeltanoHub will index all plugins we support on the DataOps platform infrastructure, Singer connectors happen to be the majority of plugins currently indexed. A subset of all available Singer taps and targets are discoverable within Meltano, meaning connector metadata is automatically populated and configured based on the data stored in discovery.yml. By reworking the file itself and how connector definitions are pulled from MeltanoHub, we expect to make it easier to use any tap and target from within Meltano while also increasing confidence in the installed plugins with better guarantees around versioning.

Finally, we aim to keep increasing the available plugins within Meltano. We expect to add Great Expectations as a discoverable plugin while also unlocking the ability to support non-python containerized plugins as well. This opens the door to supporting more tools and services beyond what exists in the Python ecosystem; imagine installing metriql, which is written in Kotlin, or Rudderstack which is written in Go.

Along with all of these we will improve the user experience of the documentation site by updating the content and design. We will also improve our documentation and support of how Meltano can be deployed as we dogfood our own work.

From here, we shift towards a wider outlook of approximately 3 months.

The Following Three Months

A big focus early in this period will be the enhancement of orchestration plugins. The surface area for improving users’ experience with Airflow is vast. The latest version of Airflow should be supported and it needs to be clear to users how to work with the tool from the command line and the UI. As we develop the relevant abstractions of what a data workflow in Meltano represents, it should be clear to users how schedules and composable pipeline map to Airflow DAGs. Concurrently with Airflow work, we’ll also look to support Dagster as an alternative orchestration tool to continue providing maximum choice for your ideal data stack.

We also aim to support both Lightdash and Superset as “analyzers”. For our own work internally we have several metrics we want to understand. We’re currently using Meltano to handle all of the data integration and we’ll continue to dogfood everything as we add these tools to our stack.

We’ll continue to invest in improving Meltano Environments by enabling sensible defaults and making it easier to manage complex configurations. We’re also considering adding a shell capability to make it easier to work with tools in your data stack within a specific plugin environment and configuration context.

During this time we expect to make great progress with the Singer Working Group. We’re particularly excited about bringing the BATCH message type to the Singer spec and releasing v1 of the Meltano SDK for Taps and Targets. These should bring faster data loads and more stability to any plugin built on the SDK.

We’ll also be improving the Meltano UI with a more scalable foundation for future enhancements. The UI has been stagnant for some time and we aim to upgrade it with better design and usability, as well as increasing parity with the CLI by, for example, enabling the control of environments and plugin UIs.

Throughout all of this work, we’ll also be squashing bugs and working with you on your contributions to Meltano, MeltanoHub, Meltano SDK, and any projects hosted on MeltanoLabs.

Onward

Even after these four months, there will still be plenty to do as the infrastructure for the modern data stack evolves to cover more and more of the data lifecycle and support all the tools that make up your ideal data stack. And we want your help in building the future of data tooling. Meltano is a tool built by and for data professionals and we need your feedback and contributions to make it as great as it can be. We believe the best ideas and contributions come from the community and we continually adjust our roadmap based on your feedback.

Have something to share? Join us on Slack or open an issue! We’d love to hear from you.

Try Meltano Today

Useful Abstractions
Near-term Work
The Following Three Months
Onward

Our Next Steps for Building the Infrastructure for Your Modern Data Stack

Useful Abstractions

Near-term Work

The Following Three Months

Onward

Intrigued?

Table of contents

Join our mailing list