Kickstart Your 2023 with these 6 Articles – The Meltano Teams Favorite Data Articles of 2022

A curated list of the top 9 must read blogs on data.

The data world is in turmoil and lots of exciting things happen every day, week and year.

At Meltano we’re ourselves avid users of data, data engineers, data PMs, and data enthusiasts through and through.

At the end of 2022 we decided to collect the blogs we enjoyed the most over the year.

Happy reading!

Contracts in Data

The Contract-Powered Data Platform by Jake Thomas

Between 6 River Systems and CarGurus, a very significant amount of my time over the past five years has been dedicated to data platform automation, reducing cross-team friction, and improving data quality.

Schemas have played a critical role in the process; this post outlines the why and the how. […]

  • Schemas empower the “producer” <-> “consumer” relationship” […]
  • Schemas are data discovery […]
  • Schemas power data validation in transit […]
  • Schemas help stop bad instrumentation from being implemented in the first place […]
  • Schemas improve code quality […]
  • Schemas power automation […]
  • Schemas as observability […]
  • Schemas power compliance-oriented requirements […]
  • Schemas are the foundation of higher-order data models […]
  • Schemas are the foundation of data products […]

Unbreakable Data

The day you stopped breaking your data by Gleb Mezhanskiy

“In the recent edition of the Analytics Engineering Roundup, dbt Labs CEO Tristan Handy writes, “Rather than building systems that detect and alert on breakages, build systems that don’t break.”

Every vendor that has something to do with data quality will tell you that it’s precisely their solution that will finally deliver you this peace of mind. It won’t – at least not exclusively on its own and not entirely. Data quality is a complex problem that involves technologies, tools, processes, culture, and cosmic rays.

But we can certainly do better, especially in the area of preventing data quality issues from occurring in the first place.

[…]

Data breaks for two reasons

Either we break it

or they break it

It’s that simple.”

Related reading: Interfaces and breaking stuff – Analytics Engineering Round Up by Tristan Handy.

SQL Jinja isn’t enough

SQL + Jinja isn’t enough – why we need dataframes by Furcy Pin

“In October 2021, Max Beauchemin, creator of Apache Airflow, wrote an excellent article about 12 trends that are currently affecting data engineers. One of them was beautifully named “Mountains of Templated SQL and YAML”, and it really echoed with my own perception. He compared the SQL + Jinja approach to the early PHP era… […]

“If you take the dataframe-centric approach, you have much more “proper” objects, and programmatic abstractions and semantics around datasets, columns, and transformations.

This is very different from the SQL+jinja approach where we’re essentially juggling with pieces of SQL code as a collage of strings”

So I started an open-source POC in Python to illustrate that point, and in particular, to demonstrate how much further the dataframe-centric approach can bring us.”

Related reading: How The Modern Data Stack is Reshaping Data Engineering by Maxime Beauchemin.

Data Engineers & SaaS

The future history of Data Engineering by Matt Arderne

“The core premise of this post is:

Most businesses’ data engineering needs have been solved or will shortly be solved by managed services that 10 years ago would require endless and extensive self-built ETL pipelines, databases and tools.

For the exceeding majority of businesses, this means they can and should focus on building capacity for business logic, analysis and predictions instead of data engineering.

Completely Versioned Data Stacks

Modern Data Stacks in a Box with DuckDB by Jacob Matson

“Why build a bundled Modern Data Stack on a single machine, rather than on multiple machines and on a data warehouse? There are many advantages!

  • Simplify for higher developer productivity
  • Reduce costs by removing the data warehouse
  • Deploy with ease either locally, on-premise, in the cloud, or all 3
  • Eliminate software expenses with a fully free and open-source stack
  • Maintain high performance with modern software like DuckDB and increasingly powerful single-node compute instances
  • Achieve self-sufficiency by completing an end-to-end proof of concept on your laptop
  • Enable development best practices by integrating with GitHub
  • Enhance security by (optionally) running entirely locally or on-premise

[…]

this approach is more of an “Open Source Analytics Stack in a box” than a traditional MDS. It sacrifices infinite scale for significant simplification and the other benefits above.”

You want to continue reading more interesting Meltano team picks? We got some more.

Bonus picks

Just picking 5 is so cruel for voracious readers! We had so many more fun reads this year, if you want some more, these are a few the Meltano team also enjoyed. Taylor, our Head of Product threw in a bunch that aren’t directly data related but the implications for the data world are still big:

Intrigued?

You haven’t seen nothing yet!

Join our mailing list

Stay current with all things Meltano