Why Data Engineers Need DataOps
The last few years have brought about significant change in the way industries and organizations look at their tech stack. While most organizations utilize a core of legacy software, they are consistently adding smart data-based capabilities to their stack. After all, why not? It’s never been easier to turn data into value.
Yet, many organizations still struggle with industrializing their data assets in a timely and qualitative manner. Borrowing methods, concepts, and tools from DevOps, DataOps is the data industry’s answer to failures across the data lifecycle.
In this article, you’ll learn what DataOps is, its history, its advantages, and its limitations. By the conclusion, you’ll be able to make an informed decision on whether DataOps is the best fit for your team.
What Is DataOps?
To understand DataOps, it’s worthwhile to first look at its somewhat more mature counterpart: DevOps.
The term DevOps was coined around 2010 in order to combine development and operations into a single term. Some go so far as to call it a culture. However, for most people, it’s a set of practices and tools that integrates the process of developing and deploying software.
The advent of DevOps was facilitated by several organizational and technological evolutions:
- Agile project methodology and scrum break up work in short iterations (sprints) where all stakeholders work together on the highest priorities.
- Cloud computing and on-demand deployment of computing services enable software developers to deploy their own infrastructure.
- Infrastructure-as-code (IaC) and the provisioning of identical environments make collaborating on the same project within the same conditions less burdensome.
DevOps aims to integrate the processes between software development and IT teams with the two-pronged goal of increasing an organization’s ability to deliver applications at a higher quality and velocity. These goals are achieved through three core principles:
- Collaboration between development and operations
- Automation of the software development lifecycle
- Continuous improvement
Extrapolate the same goals, principles, and evolutions to data products and you arrive at DataOps: a method to deliver high-quality data products at high velocity. These goals are achieved through:
- Close collaboration of data scientists, analysts, data engineers, IT, and quality assurance
- Automation throughout the end-to-end analytical process
- Continuous improvement and integration of new features into data products
Let’s make it more tangible with an example. Historically, within most organizations, it was hard to set up an automated report like a quarterly sales tracker. And that wasn’t always due to technological constraints. Data needed to be extracted from source systems, transformed, aggregated, made available in an analytical database, and visualized in a business tool. It needed to pass through numerous hands, across several departments. The data used to take months or even years before raw sales figures could be turned into a bar chart. By applying DataOps, organizations can drastically shorten that time to delivery.
DataOps vs DevOps
Now you’ve seen that DataOps is the result of applying DevOps principles to the data lifecycle. However, there are considerable differences between the two.
Within IT and software development, individuals are usually highly technical and they generally have a common set of skills and tools. However, within the data lifecycle, this is rarely the case.
Data analysts and BI engineers work with drag-and-drop dashboarding tools. Even though data scientists are avid coders in Python or R, taking their analysis and resulting algorithms out of a notebook and into deployment is uncommon. For this reason, DataOps tools need to abstract away a lot of the underlying complexity that characterizes a data pipeline.
Unlike the software development lifecycle, the data lifecycle isn’t a loop. Creating value from data involves a lot of exploration, and not all ideas turn out to be the killer features business users expect them to be. Consequently, lots of exploratory work never reaches the deployment phase.
On the other hand, some data might not exist yet, and setting up the proper data collection or ingestion could delay deployment. Planning resources for DataOps take a lot more flexibility than planning resources for DevOps.
Usually, when a software engineer joins a team, he starts with setting up his sandbox. In this isolated development environment, he’ll be able to write and test new features. This sandbox environment is often created in a couple of hours or days via specialized software or by running a couple of scripts.
Sandboxes for data teams are often spread across multiple tools and technologies that can’t be set up at the press of a button. Furthermore, depending on its volume, copying an organization’s data assets can quickly become very expensive. Often representative data sets need to be created.
Software engineers are often clustered together in the same team or department. However, DataOps needs to account for the fact that analytical roles usually exist in every corner of the organization. It might be adequate to centralize several constituents of the data pipeline, like extracting data from source systems and making them available in a data lake or warehouse. Other parts, like reporting and business intelligence, are often performed in a more efficient manner when they are decentralized.
Why Data Engineers Need DataOps
Although DataOps has borrowed DevOps practices, it has to cope with a different reality. Depending on the organizational model, value from data isn’t always delivered to the end-user of the data lifecycle, but rather to the intermediate self-service analyst or data scientist. In that regard, DataOps is very flexible.
Its principles are often restricted to the more technical parts of the data lifecycle, the data, analytics, and ML engineering work, while excluding others like exploratory analysis, reporting, and business intelligence.
For data engineers, DataOps is important in various ways, including its two-pronged goal of speed and quality, but also data security and fostering a sustainable work culture and global operation.
Central to DataOps is its promise to increase the speed at which data products are delivered. It does this through various mechanisms:
- Aligning human resources towards the same high-value priorities.
- Knowledge sharing with project management tools like JIRA and Confluence and a shared repository where code and queries are available for reuse.
- Automation of the deployment workflow, for example via GitHub actions.
- Automated testing and monitoring of new code and data.
Data products require statistical rigor. So what good is speed without quality? Quality control isn’t only about the data products themselves, but also about the data behind them. A dashboard shouldn’t only have the right dimensions and metrics, the requested widgets, interactivity, alignment, and color scheme. It’s also about the data that is below the dashboard.
Data is a dynamic asset that continually changes state. It might be at rest in a source system, and in-flight when extracted by a SaaS tool (like Fivetran or Airbyte or a Singer tap before landing in a sink like a data warehouse (Snowflake, BigQuery, etc.) or data lake.
Throughout the data lifecycle, data is processed by a variety of tools. DataOps streamlines the processing and accessing of sensitive or personal data.
Sustainable Work Culture
DataOps fosters a sustainable work culture in many ways:
- By automating repetitive parts of the workflow, developing data products is a lot less burdensome.
- Instead of burying people under loads of work, clear prioritization helps focus on planned work.
- DataOps enables constant feedback between data developers and the users of their products.
DataOps can be leveraged to develop a global data outreach, especially when it comes to remote work. DataOps streamlines the infrastructure and tooling and should allow its global workforce to plug into it easily in a cost-efficient and secure manner. By adopting popular tools and best practices, onboarding a team member is easy as a breeze, and time-to-productivity is drastically shortened.
How You Can Get Started With DataOps
If you’re looking to get started with DataOps, there are several aspects to consider and focus on. This article focuses on three: organizational requirements, collaboration, and automation.
In many organizations, data initiatives are scattered across departments. They have different maturity and use a mishmash of tools. Adopting DataOps requires proper ownership of an organization’s data assets. While centralized models are easy to set up, federated decentralized models (like the data mesh) are becoming more popular.
Too often, data engineers and IT are in a perpetual war with each other. While data engineers want to release data that is locked up in source systems, IT is concerned about the security and integrity of their systems. Aligning Data and Operations and fostering collaboration between them will reduce the friction when setting up data pipelines.
At the core of DataOps is collaboration within data teams. Not only is this a matter of a happy and competent workforce, but it’s also about the processes, work environment, and tools that allow them to work towards the same goals. Collaborating in a digital (and often remote) world requires proper tooling:
- knowledge sharing: Notion, Confluence
- versioning and shared code repository: Nexus, JFrog, GitHub
- project management: Basecamp, JIRA
Automation is an essential part of DataOps and is mainly driven by tools and technology. Here are several parts of the data lifecycle that are worth automating and the tools enabling it – the list is non-exhaustive.
- orchestration: Airflow, Dagster
- data extraction: Airbyte, Fivetran, Stitch, Singer.
- data transformation: Alteryx, dbt, Matillion
- deployment: GitLab CI, GitHub Actions, GCP Cloud Build
- data monitoring: Great Expectations, Monte Carlo
For laymen, DataOps looks like technical mumbo jumbo, covered in corporate sauce with some marketing newspeak added in. This article helped you learn about what DataOps is, its benefits are and why data engineers should adopt its practices. You also learned about some high-level requirements for getting started with DataOps.
Nevertheless, rolling out DataOps remains a time-consuming task. Several tools facilitate DataOps practices across the data lifecycle. Try Meltano, an open-source DataOps platform to help you manage all the data tools in your stack.
Guest written by, Roel Peters. Thanks Roel!