Joining the team: Taylor Murphy shares why he’s excited

Today is my first official day working on the Meltano team! In early February 2021 I announced on Twitter that I would be leaving the GitLab Data Team, where I’ve worked for the past 3 years, and joining Meltano. There was a short thread that accompanied the announcement which I promised to expand upon. I want to dive deeper on a few points in that thread and share more why I’m excited about the project and role.

In May of 2020 Douwe shared the new focus of Meltano: make the power of data integration available to all by turning Meltano into a true open source alternative to existing proprietary hosted ELT solutions. He directly cited my comment from 2018 where I indicated this was a potential avenue Meltano could take to build a real community. (Let’s ignore the part where I said we should create our own Meltano spec! I’m now fully aligned with our embrace of the Singer Spec.)

This renewed focus on being an excellent way to run open source extractors and loaders is part of why I’m so excited. Much of what makes communities like dbt and Airflow great is that they are open source tools that people are using to get real work done. For dbt in particular, the ability to network within the community, achieve real results, and contribute back to the project are part of why it has seen such success.

Meltano can be an excellent tool within the data space for many years to come and I want to share why I believe that.

Open Source, Always

GitLab is an open core company that has a free open source (FOSS) version of its product and is committed to having it be free forever (see the stewardship section of the handbook for more information). This model has served it well and is something we will emulate with Meltano. As Douwe stated in the part 2 of the pivot blog posts: “there will always be a Community Edition, and data integration will forever be a commodity rather than pay to play.” This is a critical aspect of our mission and strategy which makes it an exciting place to work.

I also believe that open source is essential to actually achieving our vision since one company can’t possibly manage and maintain the entire world of extractor and loader combinations. This reality leads to the next reason why I’m excited about the future of Meltano.

Facing the Challenge

Meltano is an exciting project too because it’s tackling a hard problem. As I said in a tweet a while ago, much of data engineering is digital plumbing – it’s critical work that often goes unnoticed and underappreciated until something breaks. It seems like it shouldn’t be as difficult as it is, but there is a long tail of data sources and data problems that are all slightly unique. 

Solving hard problems that not a lot of people want to touch has been a through-line in my career and it’s something that excites me. The process can be arduous but the feeling of accomplishment from crafting great solutions to challenging problems that help real people is immensely fulfilling. And as Meltano grows I think the scope of data problems it solves can continue to grow as well (keep in mind Meltano does a lot more out of the box than just extraction and loading even if we aren’t focused on that from a development perspective right now!).

But if we had to solve these problems alone then I don’t think I would be as excited. We can’t meet the needs of everyone if everything is closed source and internal only. Open source is part of the solution to the challenge, but so is collaborating with a large network of data professionals.

Building a Community

The data community has been wonderful for me personally and professionally. I’ve met and connected with so many wonderful people via the dbt and Locally Optimistic communities (and Twitter!) that I never would have known had I just focused on my local Nashville community. It’s also enabled me to share my work from the GitLab Data Team in a transparent way that has helped numerous people in the community. There’s no better feeling than being able to answer someone’s question in Slack with the answer “here’s how we did it and here’s a link to our code – let me know if you have more questions!”. 

I want Meltano to build on these communities with a focus on open source data integration. We see the project as working very well with dbt and other open source tools and I know it can be a great place for data engineers and other data professionals to collaborate on solving real problems. And it will help address that long (infinite) tail of data integration challenges by crowdsourcing the efforts via open source software.

I also share Douwe’s belief that Meltano can make world-class data engineering more inclusive of a broader audience. The “standard” data stack of Fivetran, Airflow, Snowflake, Census, and Looker can be expensive and much of it is pay to play. Much like how GitLab enables people to have a world-class DevOps experience with a free product, Meltano should do the same for DataOps.

Elevation of the Data Profession

I recently gave a talk with the inimitable Emilie Schario at Coalesce 2020. The gist of it was that Data Teams should view themselves as building a Data Product that serves their entire organization. This view is born out of the deep belief that most organizations aren’t realizing the full value of their data and that to close that gap a rethinking of how teams work is needed. 

I see Meltano as being able to help enable this vision. The data profession and lifecycle tool chain is still in its early days with companies and startups everywhere getting early funding (dbt, Census, Airbyte, Hightouch, Fivetran, etc.). Many of these companies are addressing real pain points in the modern tool chain and will probably be successful businesses. But a proliferation of closed-source point solutions won’t enable the next level of data awareness that is required to close the data utility gap. None of these tools have the community or broad vision that I believe Meltano can and will have.

I want Meltano to follow the GitLab model: be truly excellent at a core piece of technology and then broaden to include more of the workflow in a way that makes sense for the community (and eventually customers). Meltano will be excellent at running Singer extractors and loaders while also enabling people to easily build their own (see our SDK effort). This will be the core of what Meltano does well.

The next level though is having a meta understanding of how data is flowing throughout your entire Data Product. It’s great that you can connect Fivetran to Snowflake and Snowflake to Census (this blog post is a compelling discussion on how they close the loop). But there is a not-so-hidden challenge in that diagram: it is difficult to have a holistic understanding of what’s happening throughout the stack. A single, metadata-focused view of how data is flowing requires a lot of data engineering work not pictured. While Meltano doesn’t solve this now, we’re building the tool in such a way that we can address this challenge better without requiring you to check multiple tools or manage the metadata yourself.

While good tools don’t solve all of your organizational challenges with data, I believe great tools can go a long way towards improving the current state of data operations. 

Looking to the Future

I believe the next 10 years of the data profession has so much to offer and I want to be a part of building the tooling for that journey. Meltano can be a central player in the data community, helping people build their careers and get real work done. GitLab has shown the power of building a true community and enabling collaboration on a scale never before seen. It’s also built an amazing business that enables it to support the open source project and drive digital transformation in thousands of organizations. 

Meltano will do the same, but first we want to focus on building a strong community. A strong community means:

  • It’s easy to get up and running with Meltano to get real work done
  • There are real people working on the project who are kind and helpful (Douwe, myself, and 1 more soon!)
  • There are tools that exist to make it easy to contribute to the project and to build new taps and targets (see our Singer SDK effort)

We’re making great progress on these fronts and will continue to expand our efforts as the community grows. 

There are also some additional questions that I’m excited to be thinking about too:

  • How do we enable the metadata-first view on Meltano so that data about your data flow is easy to use? (See this issue if you have thoughts!)
  • How can we help build trust in community taps and targets with an open testing and validation framework, with the goal of having a central place to learn about the behavior, supported features, and maintenance status of all taps and targets in the ecosystem? 
    • Airbyte’s connector health page is an example of what I’d like to see, but we’d need to implement it in a way that’s open to the community. See this issue to add your thoughts!
  • How do we build Meltano as a business in a way that’s a win-win for everyone?

And more! I don’t have the full answers and we need the community to help us answer these questions. I’m excited to continue bringing the GitLab values to the project and build upon the great foundation that Douwe and many others have started. 

Questions or feedback? Come say hi to me on our Slack or on Twitter, or open an issue in the Meltano project. 

Now Available: Meltano v1.70.0

Today, we are excited to release Meltano version 1.70.0, which:

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.69.0 on February 16:

New

  • #2590 Add hotgluexyz variant of tap-chargebee
  • #2593 Add hotgluexyz variant of tap-intacct

Changes

  • #2356 Disallow two pipelines with the same job ID to run at the same time by default

Fixes

  • #2585 Fix bug with finding a schedule based on namespace for a custom plugin

Now Available: Meltano v1.69.0

Today, we are excited to release Meltano version 1.69.0, which adds out-of-the-box support for the Quickbooks source (thanks Hassan Syyid of Hotglue!) and adds support for Airflow 2 (thanks Michel Radosavljevic!).

You can add tap-quickbooks to your project using meltano add:

meltano add extractor tap-quickbooks

Airflow 2 is not the default yet, but you can use it in your project by adding the following to your meltano.yml project file (or modifying pip_url in your existing entry for airflow):

orchestrators:
- name: airflow
  pip_url: apache-airflow==2.0.1 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.0.1/constraints-3.6.txt

Change 3.6 to 3.7 or 3.8 in accordance with your Python version. Then run meltano install orchestrator airflow to install.

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.68.0 on February 11:

New

  • #2558 Add support for Airflow 2.0
  • #2577 Add hotgluexyz variant of tap-quickbooks

Now Available: Meltano v1.68.0

Today, we are excited to release Meltano version 1.68.0, which adds support for entity/attribute selection to tap-gitlab (thanks Charles Julian Knight for contributing!) and bumps Airflow to version 1.10.14 (support for Airflow 2.0 is on the way!).

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.67.0 on January 26:

New

  • #2557 Add support for entity and attribute selection to tap-gitlab

Changes

  • #2559 Bump Airflow version to 1.10.14

Fixes

  • #2543 Fix packages dependencies that claim Python 3.9 is supported when it actually isn’t.

Now Available: Meltano v1.67.0

Today, we are excited to release Meltano version 1.67.0, which fixes two bugs with meltano schedule run <name>: if the schedule’s meltano elt command fails with a nonzero exit code, it now does as well, and it no longer requires the meltano executable to be in the PATH.

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.66.0 on January 18:

Fixes

  • #2540 meltano schedule run exit code now matches exit code of wrapped meltano elt
  • #2525 meltano schedule run no longer requires meltano to be in the PATH

Building Meltano in Public: Bimonthly recap

Last week, it was once again my turn to host a GitLab Group Conversation (a publicly live streamed Q&A on the GitLab Unfiltered YouTube channel) on Meltano!

I used the opportunity to share a recap of:

If you’re curious, check out the presentation on Google Slides and the Q&A on YouTube. The presentation content is also reproduced below, as is an embedded video of the Q&A!

Group Conversation Presentation

7 releases since the last GC (2020-11-19)

  1. V1.59.0 makes sure that all meltano elt errors properly make it into the log file and UI and that meltano select --list prints entities and attributes in alphabetical order.
  2. V1.60.0 adopts Poetry for dependency and build management and hides settings of kind object or array in the UI.
  3. V1.61.0 adds out-of-the-box support for the BigQuery source using the tap-bigquery extractor, adds a new meltano schedule run <name> command to easily run a scheduled pipeline by name, shows array and object settings in the plugin configuration UI as unsupported, and fixes the “Run Now” button in the pipelines UI to take into account the schedule’s overridden environment variables.
  4. V1.62.0 introduces plugin inheritance to let you have multiple configurations of the same package in your project at the same time, in the form of separate plugins that inherit their base plugin (package) description from an existing plugin and can then override (parts of) the inherited configuration.
  5. V1.63.0 automatically retries failed connections to the system database, and lets you tweak this behavior using new database_max_retries (default: 3) and database_retry_timeout (default: 5 seconds) settings.
  6. V1.64.0 fixes runaway memory consumption when an extractor outputs records at a much higher rate than the loader can process them, by enabling flow control with a 64KB buffer size limit. 
  7. V1.65.0 lets you tweak the size of the buffer that stores records output by the extractor (tap) while they are waiting to be processed by the loader (target), using a new elt.buffer_size setting.

(Earlier today, we also released V1.66.0!)

17 recent contributions by 11 community members

Done

  1. Enable `pool_pre_ping` for the project’s DB engine by Suyash Behera (Goldman Sachs)
  2. Update the ELT runner to explicitly log any error messages by Allan Whatmough (Run with AI)
  3. Added sorting for `meltano select –list –all` by Nil
  4. Added missing `mysql-logo.png` by Nil
  5. Adding Poetry for better dependency and build management by Tobias Macey (MIT)
  6. Make tap bigquery known to meltano by Niall Woodward (Tails.com)
  7. Adding pre-commit and linting configurations by Tobias Macey (MIT)
  8. Tap-salesforce:  Pass is_sandbox to authenticator by Kevin Ford
  9. Files-dbt: Add bigquery profile by Daniel Pettersson (Smartr)
  10. Linting fixes and contributor guide update by Tobias Macey (MIT)
  11. Target-postgres: Fix non-null falsey values by Charles Julian Knight (FIXD)

In development

  1. Search and replace “entities” >> “streams”, “attributes” >> “properties” by AJ Steers (Slalom)
  2. Add pipx-based install instructions by AJ Steers (Slalom)
  3. Singer SDK: Accelerated tap development framework (v0.0.1-alpha) by AJ Steers (Slalom)
  4. Singer SDK: Initial mock-up for target-base by AJ Steers (Slalom)
  5. Added libpq required target-postgres dependency install instructions by Geoff Langenderfer
  6. Target-snowflake: Dynamic precision fix by Bryan Wise (Halosight)

Recent weekly Slack activity

As expected, activity dropped significantly over the holidays, but it’s steadily climbing back to our previous records of 157 “weekly active members” (Dec 9) and 30 “members who posted” (Nov 18).

Join us on Slack!

Other exciting recent and ongoing developments

  • AJ Steers (Slalom) is working on the Singer SDK: a new framework and set of tools for building high-quality Singer taps and targets
  • This quarter, we intend to hire 2 active contributors from the community onto the Meltano team at GitLab as Backend Engineers to work full-time on Meltano and related projects like the Singer SDK! If you’re interested, please reach out to project lead Douwe Maan on Slack.

This week’s priorities

Milestone issue board

Upcoming priorities

Milestones issue board

Epics:

Group Conversation Q&A

Now Available: Meltano v1.66.0

Today, we are excited to release Meltano version 1.66.0, which (among other things) prevents pipelines from getting stuck in the “running” state forever when their meltano elt process is killed unceremoniously by the operating system or some other mechanism.

This is realized by automatically detecting stale runs in the system database and marking them as “failed” before meltano elt runs a new pipeline with the same Job ID, and before meltano schedule list lists the scheduled pipelines and their status (which meltano invoke airflow scheduler does periodically).

As long as a pipeline is running, meltano elt now records a heartbeat timestamp on the pipeline run row in the system database every second. A run is considered stale when 5 minutes have elapsed since the last recorded heartbeat. Older runs without a heartbeat are considered stale if they are still in the “running” state 24 hours after starting.

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.65.0 on January 12:

New

  • #2483 Every second, meltano elt records a heartbeat timestamp on the pipeline run row in the system database as long as the pipeline is running.
  • #2483 Before running the new pipeline, meltano elt automatically marks runs with the same Job ID that have become stale as failed. A run is considered stale when 5 minutes have elapsed since the last recorded heartbeat. Older runs without a heartbeat are considered stale if they are still in the running state 24 hours after starting.
  • #2483 meltano schedule list (which is run periodically by meltano invoke airflow scheduler) automatically marks any stale run as failed.
  • #2502 Add User-Agent header with Meltano version to request for remote discovery.yml manifest (typically https://www.meltano.com/discovery.yml)
  • #2503 Include project ID in X-Project-ID header and project_id query param in request for remote discovery.yml manifest when send_anonymous_usage_stats setting is enabled.

Now Available: Meltano v1.65.0

Today, we are excited to release Meltano version 1.65.0, which lets you tweak the size of the buffer that stores records (and other messages) output by the extractor (tap) while they are waiting to be processed by the loader (target), using a new elt.buffer_size setting with a default value of 10MiB.

The length of a single line of extractor output is limited to half the buffer size, making the default maximum message size 5MiB.

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.64.0 on January 7:

New

  • #2392 Add ‘elt.buffer_size’ setting with default value of 10MiB to let extractor output buffer size and line length limit (maximum message size) be configured as appropriate for the extractor and loader in question.

Fixes

  • #2501 Don’t lose version when caching discovery.yml.

Now Available: Meltano v1.64.0 and v1.64.1

Today, we are excited to release Meltano version 1.64.0, which fixes runaway memory consumption when an extractor outputs records at a much higher rate than the loader can process them, by enabling flow control with a 64KB buffer size limit.

As a result of this bug, meltano elt pipelines composed of fast extractors and slow loaders would sometimes be terminated by the operating system before completing, to prevent the system from running out of memory entirely.

Making the buffer size (and the related Singer message length limit) configurable is being tracked in a separate issue that is also being worked on this week.


Shortly after v1.64.0 was released, Yordan Ivanov reported a new critical bug introduced by this “fix”: when the extractor finishes before the loader, not all messages (records) would actually make it to the loader, but meltano elt would finish successfully anyway. This has been fixed in Meltano version 1.64.1, released on January 8.

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.63.0 on January 4:

Fixes

  • #2478 Fix runaway memory usage (and possible out-of-memory error) when extractor outputs messages at higher rate than loader can process them, by enabling flow control with a 64KB buffer size limit

Now Available: Meltano v1.63.0

Today, we are excited to release Meltano version 1.63.0, which automatically retries failed connections to the system database, and lets you tweak this behavior using new database_max_retries (default: 3) and database_retry_timeout (default: 5 seconds) settings!

Special thanks go out to Suyash Behera for contributing this functionality.

Excited to try it out?

To upgrade Meltano and your Meltano project to the latest version, navigate to your project directory, activate the appropriate virtual environment, and run meltano upgrade. This will upgrade the meltano package and apply any necessary changes to your project.

What else is new?

The list below (copied from the changelog) covers all of the changes made to Meltano since the release of v1.62.0 on December 23:

New

  • #2308 Verify that system database connection is still viable when checking it out of connection pool.
  • #2308 Add database_max_retries and database_retry_timeout settings to configure retry attempts when the first connection to the DB fails.

Fixes

  • #2486 Remove state capability from tap-google-analytics because it’s not actually currently supported yet