Now Available: Meltano v1.34.2

Today, we are excited to release Meltano version 1.34.2, which fixes a number of bugs related to Meltano’s built-in support for the Airflow orchestrator.

Excited to try it out?

To upgrade your local installation of Meltano, activate the appropriate Python virtual environment and run meltano upgrade from inside a Meltano project, or pip3 install --upgrade meltano from anywhere else. If you’re running Meltano inside Docker, run docker pull meltano/meltano.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since Tuesday’s release of v1.34.0 and v1.34.1:


Fixes

  • #2076 Fix bug that caused Airflow to look for DAGs in plugins dir instead of dags dir
  • #2077 Fix potential dependency version conflicts by ensuring Meltano venv is not inherited by invoked plugins other than Airflow
  • #2075 Update Airflow config and run initdb every time it is invoked
  • #2078 Have Airflow DAG respect non-default system database URI set through MELTANO_DATABASE_URI env var or --database-uri option

Now Available: Meltano v1.34.0 and v1.34.1

Today, we are excited to release Meltano version 1.34.0, which (among other things) contains a number of changes related to how plugin configuration is managed internally, that make it easier to use Meltano with new extractors, loaders, and transformers.

Specifically, it introduces pipeline environment variables that allow loaders and transformers to adapt their configuration and behavior based on the extractor and loader they are run with as part of a meltano elt pipeline. This feature is used to dynamically configure the target-postgres and target-snowflake loaders and dbt transformer appropriately, independent of the specific extractor and loader used, while still allowing users to override any of these defaults as they see fit.

Shortly after releasing v1.34.0, we realized that one of its changes accidentally introduced an install-time dependency on PostgreSQL. This has been fixed in Meltano version 1.34.1.

Excited to try it out?

To upgrade your local installation of Meltano, activate the appropriate Python virtual environment and run meltano upgrade from inside a Meltano project, or pip3 install --upgrade meltano from anywhere else.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since last week’s release of v1.33.0:

New

  • !1664 Automatically populate env properties on newly added custom plugin settings in meltano.yml
  • !1664 Have meltano config <plugin> list print default value along with setting name and env var
  • !1664 Pass configuration environment variables when invoking plugins
  • !1664 Set MELTANO_EXTRACTOR_NAME, MELTANO_EXTRACTOR_NAMESPACE, and MELTANO_EXTRACT_{SETTING...} environment variables when invoking loader or transformer
  • !1664 Set MELTANO_LOADER_NAME, MELTANO_LOADER_NAMESPACE, and MELTANO_LOAD_{SETTING...} environment variables when invoking transformer
  • !1664 Allow dbt project dir, profiles dir, target, source schema, target schema, and models to be configured like any other plugin, with defaults based on pipeline-specific environment variables
  • #2029 Allow target-postgres and target-snowflake schema to be overridden through config, with default based on pipeline’s extractor’s namespace
  • #2062 Support --database-uri option and MELTANO_DATABASE_URI env var on meltano init
  • #2062 Add support for PostgreSQL 12 as a system database by updating SQLAlchemy

Changes

  • !1664 Infer compatibility between extractor and transform based on namespace rather than name
  • !1664 Determine transform dbt model name based on namespace instead of than replacing - with _ in name
  • !1664 Don’t pass environment variables with “None” values to plugins if variables were unset
  • !1664 Determine Meltano Analyze schema based on transformer’s target_schema or loader’s schema instead of MELTANO_ANALYZE_SCHEMA env var
  • #2053 Bump dbt version to 0.16.1

Fixes

  • #2059 Properly handle errors in before/after install hooks

Now Available: Meltano v1.33.0

Today, we are excited to release Meltano version 1.33.0, which contains various improvements related to custom extractors:

  • the meltano add --custom prompts are now more clear and the default values more sensible, and
  • these extractors are now properly displayed in the Connections UI if the label or docs properties are not set or a logo image could not be found.

To upgrade your local installation of Meltano, activate the appropriate Python virtual environment and run meltano upgrade from inside a Meltano project, or pip3 install --upgrade meltano from anywhere else.


Specifically, version 1.33.0 contains the following changes relative to 1.32.1:

Changes

  • #2028 Improve descriptions and default values of meltano add --custom prompts

Fixes

  • #2042 Fix bug causing Connection Setup UI to fail when plugin docs URL is not set
  • #2045 Hide plugin logo in UI if image file could not be found
  • #2043 Use plugin name in UI when label is not set, instead of not showing anything
  • #2044 Don’t show button to open Log Modal on Pipelines page if pipeline has never run

Now Available: Meltano v1.32.1

Today, we are excited to release Meltano version 1.32.1, which fixes a bug that prevented some Singer taps and targets from working with Meltano because of conflicting Python dependencies.

Many thanks to Zafar Toshpulatov for reporting this bug on Slack. The loader he was trying to use, pipelinewise-target-postgres, should now work without issues. 🙂

To upgrade your local installation of Meltano, activate the appropriate Python virtual environment and run meltano upgrade from inside a Meltano project, or pip3 install --upgrade meltano from anywhere else.


Specifically, version 1.32.1 contains the following changes relative to 1.32.0:

Fixes

  • #2024 Have plugin venvs not inherit Meltano venv to prevent wrong versions of modules from being loaded

Why we are building an open source platform for ELT pipelines

This is part 2 of a 2-part series to announce and provide context on the new direction of Meltano.

If you’ve been following Meltano for a while or would like to have some historical context, start with part 1: Revisiting the Meltano strategy: a return to our roots.

If you’re new to Meltano or are mostly interested in what’s coming, feel free to skip part 1 and start here.

If you’re worried that reading this entire post will take a lot of time, feel free to jump right to the conclusion: Where Meltano fits in.

Introduction

If you’ve read part 1 of the series, you know that Meltano is now focused on building an open source platform for data integration and transformation (ELT) pipelines, and that we’re very excited about it.

But why are we even building this?

Isn’t data integration (getting data from sources, like SaaS tools, to destinations, like data warehouses) a solved problem by now, with modern off-the-shelf tools having taken the industry by storm over the past few years, making it so that many (smaller) companies and data teams don’t even need data engineers on staff anymore?

Off-the-shelf ELT tools are not that expensive, especially compared to other tools in the data stack, like Looker, and not having to worry about keeping your pipelines up and running or writing and maintaining data source connectors (extractors) is obviously extremely valuable to a business.

On top of that, writing and maintaining extractors can be tedious, thankless work, so why would anyone want to do this themselves instead of just paying a vendor to handle this burden instead?

Who would ever want to use a self-hosted ELT platform? And why would anyone think building this is a good use of time or money, especially if it’s going to be free and open source?


In part 1, I explained why we have concluded that in order to eventually realize our end-to-end vision for Meltano (a single tool for the entire data lifecycle, from data source to dashboard), we have to go all-in on positioning Meltano as an open source self-hosted platform for running data integration and transformation (ELT) pipelines, and will turn Meltano into a true open source alternative to existing proprietary hosted solutions like Alooma, Blendo, Hevo, Matillion, Pentaho, and Xplenty, in terms of ease of use, reliability, and quantity and quality of supported data sources.

However, the points and questions raised above are totally valid, and were in fact raised by actual data engineers I’ve talked to over the past few weeks. While Meltano (and GitLab, which sponsors its development) have a need for the existence of such a tool, it’s a separate matter entirely whether there are any data engineers or data teams out there who share that need.

Would any data team actually consider joining the community, contributing to Meltano and its extractors and loaders, and eventually migrating to the open source tool, away from whatever proprietary solution they use today?

The problem: pay to play

The idea is that every data team in the world needs a data integration tool, because one way or another you have to get your data from your various sources into your data warehouse so that it can be analyzed. And since every company would be better off if they were analyzing their data and learning from their ups and downs, every company in the world needs a data integration tool whether they already realize it or not.

Since there is currently no true open source alternative to the popular proprietary tools, the data space has effectively become “pay to play”. There are many great open source analytics and business intelligence tools out there (Superset, Metabase, and Redash come to mind, and let’s not forget that Meltano comes with built-in analytics functionality as well), but all assume that your data will somehow have already found its way into a data warehouse.

If for any reason at all you cannot use one of the hosted platforms, you are essentially out of luck and will not get to compete on a level playing field with those companies that can afford to integrate their data and start learning from it. Even if you have everything else going for you, you are massively disadvantaged from day one.

Perhaps, you do not think of these off-the-shelf tools as particularly expensive, you’re fine with your sensitive data flowing through a US company’s servers, and you would happily pay for professional services if you ever need to extract data from a source that isn’t supported already.

However, many around the world will find prices US companies charge prohibitively expensive relative to their local income, may prefer (or be legally required) to have their data not leave their country or their servers, or may find that the locally grown SaaS services they use are often not supported by the existing US-centric vendors.

And to be clear, US companies are not immune to these issues, even if they may be somewhat less affected by the financial argument. Think of HIPAA compliance, for example, which many (most? all?) hosted tools don’t offer unless you sign up for one of their more expensive plans.

If you do not feel the pain of the current situation or see the need for change, recognize that your experience may not be representative.

Data integration as a commodity

This perspective leads me to an argument with an ideological angle, that is particularly compelling to me because of the many parallels I see with the early days of GitLab: the open source project that was founded in Ukraine back in 2011 with the goal of building a self-hosted alternative to the likes of GitHub and Bitbucket, that a few years later became an open core product maintained primarily by the newly founded company that shares its name. To this day, GitLab comes in open source and proprietary flavors, and the functionality included in the Community Edition continues to be sufficient for hundreds of thousands of organizations around the world, that would otherwise have needed to opt for a paid, proprietary alternative. As GitLab is sponsoring the development of Meltano, these parallels are not a coincidence.

Since an ELT platform is a tool every data engineer and every company needs if they want to have the best chance of survival and success, I would argue that it should be a commodity and should be available at a reasonable cost to everyone who wants or needs it. Anything less than that hurts a significant number of companies in their ability to reach their true potential and serve their users and customers as well as they would want to, thereby stifling innovation and competition, and we all end up paying the price because we have to deal with companies and products that are less optimized and improved than they could be.

The obvious question: if this is apparently such a problem, why haven’t tons of competitors popped up already to serve these local markets or inject some competition into the US market? Orchestrating reliable data pipelines is a solved problem, even in the open source space, where great tools like Airflow and Luigi exist and are running in production at thousands of organizations. That’s not to say they’re as easy to configure and get started with as the hosted platforms we’re talking about, but the technology is there, assuming you have an extractor and loader to plug in.

And I think that assumption is at the core of the issue, and at the core of the economic moat that the existing vendors have created around themselves, that makes it hard for new parties to enter the market and compete: the impressive amount of data sources they support out of the box, and their massive (in-house or outsourced) teams that have spent and continue to spend thousands of hours developing and maintaining these extractors and loaders.

If you’ve read part 1 of this 2-part series, you’ll remember that we ran into this ourselves when we offered a hosted version of Meltano’s data connection and analytics interface to non-technical end-users. They could go straight from connecting their data source to viewing a dashboard, but only if we had written the extractor, loader, transformations, and models for that data source beforehand, and if we would continue to maintain these forever. We realized that this wasn’t going to scale, and so would most companies that would decide to just write and maintain their own extractors instead of paying someone else to do it: it’s a lot of work, and it never ends.

The solution: open source

Ultimately, though, the size of the economic moat that exists around these vendors can be measured in terms of developer hours, and there’s no secret sauce or intellectual property that separates the current major players from anyone else out there who has their own hours to bring to the table.

By yourself, as a single company or data engineer, implementing and maintaining extractors for all of the data sources you need to integrate is not feasible, which is why most don’t.

Together, though, that changes. With a big enough group of people capable of programming and motivated to collaborate on the development and maintenance of extractors and loaders, it’s just a matter of time (and continued investment of time by a subset of the community) before every proprietary extractor or loader has an open source equivalent. The maintenance burden of keeping up with API and schema changes is not insignificant, but if open source communities can manage to maintain language-specific API client libraries for most SaaS APIs out there, there’s no reason to think we’d have a harder time maintaining these extractors.

Assuming there is no secret sauce or key intellectual property involved, a sufficiently large and motivated group of people capable of programming can effectively will any new tool into existence: that is the power of open source.

The more common the data source, the more people will want it, the faster it’ll be implemented, the more heavily it’ll be tested, and the more actively it’ll be maintained. It doesn’t need to take long before the segment of the market that only uses these common data sources will be able to swap out their current data integration solution for this open source alternative. It’s not an all-or-nothing matter either: data teams can move their data pipelines over on a pipeline-by-pipeline basis, as extractors become available and reach the required level of quality.

Of course, a self-hosted platform for running data integration pipelines wouldn’t just need to support a ton of extractors and loaders. You would also want to be confident that you can run it in production and get the same reliability and monitoring capabilities you get with the hosted vendors. Fortunately, this is where we can leverage an existing open source tool like Airflow or Luigi, that this hypothetical self-hosted platform could be built around.

Everyone wins

Even if you’re not personally interested in ever using a self-hosted data integration platform, you may benefit from us building one anyway.

Open source is the most promising strategy available today to increase competition in the data integration and data pipeline space. Even if the specific tool we’re building doesn’t actually become the Next Big Thing, the market will benefit from that increased competition.

Developers of new SaaS tools and data warehouse technology would also benefit from an open source standard for extractors and loaders. Rather than wait (or pay) for data integration vendors to eventually implement support for their tool once it reaches a high enough profile or once its users start begging (or paying) the vendor loudly enough, new tools could hit the ground running by writing their own integrations. Today, many companies wouldn’t consider switching to a new SaaS tool that isn’t supported by their data integration vendor at all, putting these tools at a significant competitive disadvantage against their more mature and well-connected competitors.

The only ones who have something to lose here are the current reigning champions. For everyone else it’s a win-win, whether you actually contribute to or use Meltano, or not. If you don’t believe me, just look at the DevOps space and the impact that GitLab has had on the industry and the strategy and offering of the previously dominant players, GitHub and Bitbucket.

If an industry has effectively become “pay to play” because every software engineer in that industry needs to use one of a handful of paid tools in order to get anything done at all, there is a massive opportunity for an open source alternative “for the people, by the people” to level the playing field, and disrupt the established players from the bottom on up.

Of course, GitLab is not just interested in sponsoring the development of such an open source project out of the goodness of its heart. The hope is that eventually, a business opportunity will arise out of this project and its community and ecosystem, because even if a truly competitive free and open source self-hosted option is available, there will always be companies that would still prefer a hosted version with great support and enterprise features, who won’t mind paying for it.

But for everyone else, there will always be a Community Edition, and data integration will forever be a commodity rather than pay to play.

The Singer specification

Of course, we are not the first to be intrigued by the concept of open source data integration. Most significantly, Stitch has developed the Singer specification, which they describe as follows:

Singer describes how data extraction scripts—called “taps” —and data loading scripts—called “targets”— should communicate, allowing them to be used in any combination to move data from any source to any destination. Send data between databases, web APIs, files, queues, and just about anything else you can think of.

There’s a Getting Started guide on how to develop and run taps and targets (extractors and loaders), many dozens of them have already been written for wide range of data sources, warehouses and file formats, a good amount of them are actively maintained and being used in production by various organizations, and the Singer community on Slack has over 2,100 members, with new people joining every day.

Once you’ve written (or installed) a tap and target, you can pipe them together on the command line (tap | target) and see your data flow from source to destination, which you can imagine is quite satisfying.

Once you’ve hit that milestone, though, the next step is not quite so obvious. How do I actually build a data pipeline out of this that I can run in production? Is there a recommended deployment or orchestration story? How do I manage my pipeline configuration and state? How do I keep track of the metrics some taps output, and how do I monitor the whole setup so that it doesn’t fall flat on its face while I’m not looking?

Unfortunately, the Singer specification and website don’t touch on this. A number of tools have come out of the Singer community that make it easier to run taps and targets together (PipelineWise, singer-runner, tapdance, and knots, to list a few), and some of these are successfully being used in production, but getting to that point still requires one to figure out and implement a deployment and orchestration strategy, and those who have managed to do so effectively have all needed to reinvent the wheel.

This means that while open source extractors and loaders do exist, as does a community dedicated to building and maintaining them, what’s missing is the open source tooling and documentation around actually deploying and using them in production.

The missing ingredients

If this tooling did exist and if Singer-based data integration pipelines were truly easy to deploy onto any server or cloud, the Singer ecosystem immediately becomes a lot more interesting. Anyone would be able to spin up their own Alooma/Blendo/Hevo/Matillion/Pentaho/Xplenty-alternative, self-hosted and ready to go with a wide range of supported data sources and warehouses. Existing taps and targets would get more usage, more feedback, and more contributions, even if many prospective users may still end up opting for one of the proprietary alternatives in the end.

Many people who come across the Singer ecosystem today end up giving up because they can’t see a clear path towards actually using these tools in production, even if taps and targets already exist for all of the sources and destinations they’re interested in. You have to be particularly determined to see it through and not just opt for one of the hosted alternatives, so the majority of people developing taps and targets and running them in production today are those for whom not self-hosting was never really an option. Any amount of better tooling and documentation will cause people to take the Singer ecosystem more seriously as an open source data integration solution, and convince a couple more people to give it a try, who would have long given up today.

Developing taps and targets is also not as easy as it could be. The Getting Started guide and singer-tools toolset are a great start, and implementing a basic tap is pretty straightforward, but building one you would actually be comfortable running in production is still a daunting task. The existing taps can serve as examples, but they are not implemented consistently and don’t all implement the full range of Singer features. The singer-python library contains utility functions for some of the most common tasks, but taps end up reimplementing a lot of the same boilerplate behavior anyway. Moreover, a testing framework or recommended strategy does not exist, meaning that users may not find out that a small inconspicuous change broke their extractor or loader until they see their entire data pipeline fail.

All in all, the Singer ecosystem has a ton of potential but suffers from a high barrier to entry, that negatively affects the experience of those who want to use using existing taps and targets, as well as those potentially interested in developing new ones.

Over the past few weeks, I’ve spent many hours talking to various members of the Singer community who have been able to get their Singer-based pipelines running in production, and the observations above are informed by their perspectives and experience. Unanimously, they agreed that the Singer ecosystem is not currently living up to its potential, that change is needed, and that better tooling and documentation around deployment and development would go a long way.

Where Meltano fits in

As I’m sure you’ve pieced together by now, Meltano intends to be that tooling and bring that change.

Our goal is to make the power of data integration available to all by turning Meltano into a true open source alternative to existing proprietary hosted ELT solutions, in terms of ease of use, reliability, and quantity and quality of supported data sources.

Luckily, we’re not starting from zero: Meltano already speaks the Singer language and uses taps and targets for its extractors and loaders. Its support goes beyond simply piping two commands together, as it also manages configuration, entity selection and extractor state for you. It also makes it super easy to set up pipeline schedules that can be run on top of a supported orchestrator like Airflow.

Additionally, Meltano supports dbt-based transformation as part of every ELT pipeline, and comes with a basic web interface for data source connection and pipeline management and point-and-click analytics and report and dashboard creation, enabling you to go from data to dashboard using a single tool, that you can run locally or host on any cloud.

For the foreseeable future, though, our focus will primarily be on data integration, not transformation or analysis.

While we’ve come a long way already, there’s still plenty of work to be done on the fronts of ease of use, reliability, and quantity and quality of supported data sources, and we can’t afford to get distracted.

Let’s get to work!

If any of the above has resonated with you, or perhaps even inspired you, we’d love your help in realizing this vision for Meltano, the Singer ecosystem, and the data integration space in general. We literally won’t be able to do it without you.

Before anything else, you’ll want to see what Meltano can already do today by following the examples on the homepage. They can be copy-pasted right onto your command line, and in a matter of minutes will take you all the way through installation, integration, transformation, and orchestration with the tap-gitlab extractor and target-jsonl and target-postgres loaders.

Once you’ve got that working, you’ll probably want to try Meltano with a different, more realistic data source and destination combination, which will require you to add a new extractor (Singer tap) and/or loader (Singer target) to your Meltano project. To learn how to do this, the homepage once again has got you covered.

And that’s about as far as you’ll be able to get right now, with Meltano’s existing tooling and documentation. Running a Meltano pipeline locally (with or without Airflow) is one thing, but actually deploying one to production is another. As we’ve identified, this is one of the places where the Singer ecosystem and documentation currently fall short, and for the moment, Meltano is no different.

For this reason, the first people we would love to get involved with the Meltano project are those who are already part of the Singer community, and in particular those who have already managed to get Singer-based ELT pipelines running in production. We want to make it so that all future Singer community members and Meltano users will be able to accomplish what they did, and no one knows better what that will take (and how close or far off Meltano currently is) than they do.

If you’re one of these people, or simply anyone with similarly relevant feedback, ideas, or experience, we’d love it if you would:

I can’t wait to see what we’ll be able to accomplish together.

See you soon on Slack or GitLab.com!

Revisiting the Meltano strategy: a return to our roots

This is part 1 of a 2-part series to announce and provide context on the new direction of Meltano.

If you’ve been following Meltano for a while or would like to have some historical context, start here.

If you’re new to Meltano or are mostly interested in what’s coming, feel free to skip this post and start with part 2: Why we are building an open source platform for ELT pipelines.

If you’re worried that reading these entire posts will take a lot of time, feel free to jump right to the conclusion of part 2: Where Meltano fits in.

Background

Meltano, originally called BizOps, was founded inside GitLab about 2 years ago to serve the GitLab Data Team. The goal was to build a complete open source solution for the entire data lifecycle from extraction to dashboarding, that would allow data engineers, analytics engineers, analysts, data scientists, and business end-users looking for insights to come together and collaborate within the context of a single version controlled Meltano project.

At its core, the project would specify a collection of plugins, one or more for every stage of the data lifecycle, that would describe how data should be extracted, loaded, and transformed, how pipelines should be orchestrated, and how the data should ultimately be modeled for analysis. Meltano would be used in the context of such a project, install and manage the plugins, take care of tying it all together, and offer a visual interface for point-and-click analytics and report and dashboard creation.

Plugins for common data sources and use cases could be shared using Git and collaborated on by the community, to be added to any team’s Meltano project with just a few keystrokes, while team-specific plugins could be implemented and stored right inside the Meltano project repository. As a result, anyone with access to the project (or a set of community-built plugins) would be able to go from data to dashboard in a matter of minutes. Pretty intriguing, right?

For a while, the Meltano team was making great progress through its close collaboration with the GitLab Data Team, until it became clear that GitLab’s and the Data Team’s needs were growing at a pace that the Meltano team would not be able to keep up with, at which point the decision was made for the Data Team to switch to a more traditional stack, so that its results would not be adversely affected by its dependence on a tool that was only just getting off the ground.

That didn’t deter the Meltano team, however, who continued to work to realize the end-to-end vision. Because the different users involved in the different stages of the journey—from extraction to analysis—have different needs and skills, and since we could see Meltano bringing value to data teams with engineers on staff, as well as those without, the decision was made to develop the Meltano CLI and UI in parallel, so that Meltano could serve both technical and non-technical users.

Since every next stage of the data lifecycle depends on the result of the one that came before it (we’re talking about data pipelines after all), our prioritization process had a cyclical nature. We would start with the first stage (extraction and loading), and keep iterating on Meltano’s capabilities in that area right up to the point where they would unlock further improvements in the next stage (transformation). Then we’d move on and direct our focus there, until it once again became time to move on to the next stage (orchestration), and so on. Once the final stage (analysis) was reached, we would go all the way back to the first stage and the journey through the stages would start again.

Version 1 and the startup founder persona

In October 2019, we released Meltano Version 1, which marked the completion of one such journey through the stages, and the realization of the end-to-end vision in principle. Assuming a data warehouse or simple PostgreSQL database had already been set up, you could now install Meltano on your local machine, initialize a project, spin up the web-based UI, install a data source (which adds the relevant extractor, loader, transformation, and model plugins to your project), enter your connection details, hit the Connect button, and find yourself looking at reports and dashboards visualizing your data a few minutes later. A most impressive demo, if you ask me!

Some early feedback of Meltano had challenged its very premise, suggesting that data was simply too company-specific and that a one-size-fits-most solution would not be feasible. What we accomplished, though, was exactly that: we had managed to build an actual end-to-end tool for the data lifecycle that could benefit any team or individual using one or more of the supported data sources.

In spite of our aspirations, though, we had not yet managed to attract an open source community to build this tool (and its plugins!) with us, and while there had been a lot of interest in the project from the beginning, no data teams were jumping at the chance to replace their existing stack with Meltano yet. While we had certainly proven the concept, we had not yet gotten it to a place where the value it could bring was significant or obvious enough that new or existing data teams would actually consider it an alternative to more traditional stacks.

To address that, we decided to focus on the user who ultimately gets the most value out of any data pipeline: the end-user who consumes the reports, has an insight, and then uses it to improve they way their business is run. Specifically, we picked a persona for whom the no-coding-needed batteries-included end-to-end nature of Meltano would be significant selling point: the non-technical startup founder. They may not have a data stack at all yet, but are very likely to be using a bunch of common SaaS tools to run their business, which means a lot of data is being collected that they could (and should!) be learning from.

What this meant is going all-in on turning Meltano into a UI-first analytics and dashboarding tool with built-in support for those data sources most commonly used at early startups, that could be connected with a click and would come with a set of default dashboards and reports to save the user having to build these all from scratch.

Since this user could not be expected to be comfortable installing a Python package or setting up a PostgreSQL database, we decided to make installation as easy as clicking a button, by offering hosted Meltano instances.

Similarly, these users could not be expected to implement extractors, transformations, or models, so we took on this responsibility ourselves. We were, after all, attempting to prove the value that Meltano could bring by showing that assuming that plugins exist, it is really powerful to have a single tool take care of everything from extraction to analysis, so that an end-user can forget about the nitty gritty details and simply go straight from connecting their data source to viewing a dashboard.

In doing so, we effectively put ourselves into the shoes of the data and analytics engineers on a data team who would be tasked with maintaining and deploying a Meltano project and writing Meltano plugins (extractors, loaders, transformations, and models), while the end-user took the role of, well, the end-user looking for insights. This directly exposed us to all of the challenges one might face when using Meltano in one of these roles, and we were able to make many improvements to the tooling and user experience because of it.

Working closely with users of Meltano’s analytics functionality also let us iterate heavily on the UI and cater it better to people not already intimately familiar with data pipelines. We wrapped the Extract, Load, and Transform concepts up into “Connections”, and introduced a new “dashboard” plugin type that allowed us to build default reports and dashboards that would be installed automatically along with a data source’s extractor, transformations, and models.

This past March, about 6 months into this adventure, we found ourselves in a much better place with Meltano’s analytics UI, and were about to embark on a new effort to not just ship transformations, models and default dashboards and reports for individual data sources, but for combinations of data sources as well, since it’s from the combinations of different but related data sets that the most interesting insights will often arise. Meltano already supported this in principle, meaning that more technical users setting up Meltano and building plugins themselves could figure it out, but the challenge would be in allowing these connections to be made through the UI, and offering our non-technical end-user out-of-the-box support for all reasonable combinations of data sources that they had connected.

Since the value that hosted Meltano could bring was a function of both the actual functionality the Meltano tool offered, and the extent of its out-of-the-box data source support, improving the experience for our non-technical end-users would come down to us taking on the significant burden of developing and maintaining all of the plugins they would ever want to use, ourselves.

By then, the 6 month experiment had certainly proven the point that “Meltano can bring significant value to non-technical end-users as an integrated tool to go from data to dashboard“, and from that perspective should be seen as a great success. It had also made us realize, however, that as a way of demonstrating the value of Meltano to the data teams that might one day adopt it and offer its analytics interface to their own non-technical end-users, the approach of simply compensating for its current lack of data source support by trying to implement all of the plugins ourselves was not going to scale.

We had built a pretty useful hosted analytics tool for startup founders looking to learn from their sales funnel performance, and we had learned a lot along the way, but continuing further on this path would not bring us closer to our overarching goal of building an end-to-end tool for all data teams and all data-related use cases.

So what’s next, then?

If the non-technical user experience of Meltano will only ever be as good as the quantity and quality of its plugins, we need to get more people involved in writing them, so it’s time to pivot back and return our focus to open source, the self-hosted experience, and the technical end-user!

But wasn’t that exactly what we had been doing for the first year and a half of Meltano’s life, while we were working towards Version 1, before the pivot to the startup founder persona?

What we had been doing then had worked in at least one significant way: we managed to build an actual end-to-end tool for the data lifecycle! But while that end-to-end vision had gotten many excited, that by itself did not ultimately turn out to be enough to attract a community of active users and contributors to Meltano and its plugins.

If we wanted to do better on that front this time around, we would have to come up with a new strategy.

To figure out what that could look like, and where Meltano might go from here, I’ve spent much of the past few weeks diving deep into and talking with various people (inside and outside of GitLab) about:

  1. Meltano’s history: What was the original vision? How was it presented? What resonated with people? What got some of them to contribute? What led to most of them ultimately losing interest?
  2. The state of the industry: How do data teams currently solve these problems? What tools do they use for data integration, analysis and everything in between? How do these tools stack up against each other?
  3. Meltano’s opportunity: Where does it fit in the space? Where can it bring the most value, both in terms of what it can actually already do today, and what it might be able to do tomorrow?

The most significant conclusion came pretty quickly: As an open source project, Meltano’s scope is simply too broad and ambitious.

Open source and scope

At first glance, it may seem like a broader scope would be a good thing: if the project intends to solve more different problems for more different people, it’ll get more people excited, and therefore more people will contribute, right?

If only it were that simple. Excited people do not always convert into users and contributors, and there’s a difference between being excited about a vision and the prospect of one day getting to use the tool, and being excited about contributing to actually make that vision a reality.

To get an individual to invest their time and energy in anything, including an open source project, they need to feel like they’ll get something out of it. The reward doesn’t need to be monetary or even tangible, and it doesn’t need to arrive right away, but no one does anything for no reason at all. In open source, the reward is typically a solution to a problem an individual is facing, and the investment is them making changes to a project that almost solves their problem, to get it closer to actually solving their problem, even if it may still require additional changes after that. Many people contribute to open source projects for ideological reasons or simply because they enjoy it, not because they’re facing a specific problem right at that moment, but they still wouldn’t pick a project that they could never see themselves benefiting from in any way.

Of course, how much you’d be willing to invest depends on the size of the reward (how much you care about solving the problem), the size of the investment (the amount of changes you’d need to make to the project to have a meaningful impact), your confidence that the investment would actually eventually lead to the reward (how close the project already is to solving the problem, and whether your changes would get it all the way there or if future contributions would still be necessary), and how long you could expect it to take for the reward to actually arrive once you’ve put in your investment.

And that’s where a broad and ambitious scope can hurt: if you asked an entire data team to evaluate whether contributing to Meltano would be a good investment, they might say yes (as the GitLab Data Team did!), because they can look at the vision and see how impactful Meltano could be to their team’s productivity and effectiveness in the future, even if it would still take a while for that vision to actually be realized and for the investment to pay off.

If you had separated the team and asked the data engineer(s), analytics engineer(s), and analyst(s) individually, though, their evaluation of the situation would have looked quite different, and it’s not so clear any of them would have ended up motivated to start contributing at all. The project doesn’t seem to be focused on solving a problem they are personally facing, any hypothetical value they would get out of the end-to-end vision seems very far away and dependent on unclear external factors like “would my company ever actually consider migrating to this tool?”, and they would need to contribute a significant amount of changes before they would feel like they are actually having a meaningful impact on moving the needle in the direction of that final reward.

Hence, Meltano’s scope is simply too broad and ambitious to attract open source contributors in any significant numbers at this relatively early stage. So, if we can’t go straight for end-to-end, where do we start?

Meltano’s open source opportunity

The plugins that will make Meltano’s end-to-end vision come true for any given data source will have to be written in order, by people with different roles and skillsets, so if we want to make that vision a reality eventually, we have to start with data engineers writing extractors and loaders, so that later analytics engineers can write transformations and models, so that after that data analysts and business end-users can use the models to create reports and dashboards and gain better insights.

The hope is that if we build a community around open source data integration and get data engineers to collaborate on extractors and loaders, analytics engineers who come across the project will be empowered to also write transformations because dbt is supported out of the box. Once Meltano ships with extractors and transformations for various data sources, analytics engineers and analysts would also be motivated to give its built-in analytics functionality a try and write some models, and once those are done for a handful of data sources, we’re ready for the non-technical end-user who won’t need to contribute at all, because the plugins they need will have been written already.

All of this is to say that while the end-to-end vision of Meltano has been realized in principle and is still our long-term aspiration, getting there requires us to not be distracted by it in the short-to-medium term. Instead, we will go all-in on positioning Meltano as an open source self-hosted platform for running data integration and transformation (ELT) pipelines, trusting that if it gains traction in this area, the rest will follow in due time.

The goal is to turn Meltano into a true open source alternative to existing proprietary hosted ELT solutions like Alooma, Blendo, Hevo, Matillion, Pentaho, and Xplenty, in terms of ease of use, reliability, and quantity and quality of supported data sources, since without it, an open source tool for the entire data lifecycle can never exist.

And everything that’s old is new again, since GitLab Staff Data Engineer Taylor Murphy suggested pretty much exactly this not too long after the project was originally founded:

From my perspective, given what Meltano can be, the real opportunities for building a community around Meltano are initially around Extractors and Loaders and eventually Meltano Analyze. dbt already has quite a community and working well with them would be a quick way to get more people excited.

With all of the benefits of hindsight, I couldn’t agree more.

To learn why the opportunity to build an open source platform for ELT pipelines gets us just as excited as we have been about the end-to-end vision up to this point, check out part 2:

Why we are building an open source platform for ELT pipelines

Now Available: Meltano v1.32.0

Today, we are excited to release Meltano version 1.32.0, which contains a couple of bug fixes and other improvements related to the Meltano CLI.

We ran into some of these ourselves while working on turning the homepage into a code sample-rich getting started guide for the CLI. Like most of the changes made over the past few weeks, this is related to our recent decision to go all-in on Meltano as an open source self-hosted data pipeline platform. Stay tuned for further details and context in a pair of blog posts later this week!

Specifically, version 1.32.0 contains the following changes relative to 1.31.0:

New

  • #2019 Ask for setting names when adding a new custom plugin

Changes

  • #2011 Make tap-gitlab private_token setting optional for easier extraction of public groups and projects
  • #2012 Add target-jsonl loader

Fixes

  • #2010 Fix bug causing dot-separated config keys to not be nested in generated tap or target config
  • #2020 Fix bug that caused meltano select to add select option to every plugin in meltano.yml instead of just the specified one
  • #2021 Only ask for capabilities when adding a custom extractor, not a loader or other plugin

Now Available: Meltano v1.31.0

Today, we are excited to release Meltano version 1.31.0, which contains the first set of changes related to our recent decision to go all-in on Meltano as an open source self-hosted data integration pipeline platform and direct our focus away from the non-technical startup founder persona, the sales analytics use case, and the UI as the primary means of interaction with Meltano. Expect further details and context in a blog post later this week!

Specifically, version 1.31.0 contains the following changes relative to 1.30.1:

Changes

  • #1987 Restore GitLab and Zendesk data sources in UI
  • #2005 Add “Don’t see your data source here?” option in UI
  • #2008 Clarify that pipelines UI only supports target-postgres
  • #2007 Don’t install airflow, dbt and target-postgres by default as part of ‘meltano init’
  • #2007 Only run ‘airflow scheduler’ as part of ‘meltano ui’ when airflow is installed
  • #2007 Install airflow, dbt, and target-postgres on DigitalOcean images

Now Available: Meltano v1.30.1

Every day, users like yourself share with us how Meltano could be more useful to them and more valuable to their business, and every day, we work hard to address this feedback and improve the Meltano experience. Twice a week, on Monday and on Thursday, we then bundle these improvements up and release them to all of our users in a new version of the Meltano application.


Today, we are excited to release Meltano version 1.30.1, which fixes a bug that prevented `meltano init` from successfully installing the `airflow` plugin, because it is incompatible with the latest version of one of its dependencies.

For a complete list of changes, scroll down!

Excited to try it out?

If you’re using a hosted Meltano instance, the new version and improvements will be available automatically. If you’re self-hosting, you can upgrade manually.

If you’re not using Meltano yet, sign up for a hosted instance now, and go from data to dashboard in minutes!

What’s new?

The list below (copied from the changelog) covers all of the changes made to Meltano since this Monday’s v1.30.0 release:

Fixes

  • #1985 Fixed bug with Airflow installs WTForms 2.3.0 instead of 2.2.1, which is incompatible

Now Available: Meltano v1.30.0

Every day, users like yourself share with us how Meltano could be more useful to them and more valuable to their business, and every day, we work hard to address this feedback and improve the Meltano experience. Twice a week, on Monday and on Thursday, we then bundle these improvements up and release them to all of our users in a new version of the Meltano application.


This week is no different, and today marks the release of Meltano version 1.30.0!

Check out the video below to learn more about the most exciting features added since last week’s v1.29.0 release:

Excited to try it out?

If you’re using a hosted Meltano instance, the new version and improvements will be available automatically. If you’re self-hosting, you can upgrade manually.

If you’re not using Meltano yet, sign up for a hosted instance now, and go from data to dashboard in minutes!

But wait, there’s more!

The list below (copied from the changelog) covers all of the changes that were made to Meltano since last week’s v1.29.0 release:

New

  • #1953 Show design attribute descriptions in tooltips in report builder
  • #1787 Show Shopify extractor in UI
  • #1948 Show Intercom button in bottom right on MeltanoData.com instances
  • #1930 Add button to remove report from dashboard when editing dashboard
  • #1845 Add button to delete report to report builder interface
  • #1849 Add button to rename report to report builder interface
  • #1951 Add button to edit dashboard name and description to dashboard page

Changes

  • !1611 Only show design description if it is different from design label
  • !1607 Move date range picker into results area of report builder interface
  • !1608 Make report title more prominent in report builder