Why Every Data Team Should Incorporate Code Reviews into their Workflow

Photo by Andrej Lišakov on Unsplash

Code review is an integral part of any modern development cycle. In fact, teams usually have automated processes to check whether specific commits or pull requests have been approved by at least one or two reviewers. Any code that doesn’t meet this high bar doesn’t get pushed forward to production. By putting every line of code under careful scrutiny, software engineers can help each other identify bugs and opportunities to improve overall code quality.

Sometimes, code review can seem like a chore, especially when engineers are constantly being pestered for reviews. Thus, as a manager or engineer of any DataOps team, it’s important for you to foster a positive culture around code review. At the same time, make sure your team covers all core aspects of code review, such as code design, scoping, functionality, complexity, testing, and style. In this article, you’ll explore all of these aspects in depth and learn how your team can benefit from focusing on code reviews.

What is a Code Review?

Say that you’ve just completed the implementation for an API that you’ve been working on. Your next step is to send your code out for review. To do this, you might create a merge request using GitLab, or a pull request using GitHub. These tools offer built-in functionality for viewing file diffs, and give reviewers the ability to comment on specific lines in your code. Code reviews typically need at least one or two approvers. In this case, say you request feedback from Jane and John, two fellow engineers on your team.

In their spare time, Jane and John each take a look at your code. If everything looks good, they can approve the code review. Both Jane and John must give approvals before you can merge your code into the main branch.

However, Jane and John might also leave comments regarding the correctness, efficiency, or style of the code. In addressing their comments, you may have to make additional revisions to your code until you get the necessary approvals.

Anything that will eventually serve users should undergo code review. This includes files that might not be strictly considered “code,” like configuration files or Meltano project files.

What are the Main Concerns of Code Review?

Let’s talk a bit more about the types of concerns reviewers might focus on during code review. This article will discuss nine of them: design, scoping, functionality, complexity, testing, naming, comments, style, and documentation.

Design

Before jumping into individual lines of code, start by looking holistically at the code’s design. This can include questions like:

  • Is this change necessary right now?
  • Does the new code integrate seamlessly with existing parts of the system?
  • Did the author add or modify the right files, or should this functionality be implemented elsewhere?

Scoping

Properly scoped code means that it should address the appropriate requirement, API, or functionality—no more, no less. Questions on scope may include:

  • What were the requirements for this change? Does the author’s code satisfy those requirements?
  • Does the code cover more scope than necessary? If so, would this add unnecessary complexities to the codebase?

Functionality

The code should do what the author intended. When reviewing functionality, reviewers may ask questions like:

  • Is the code correct? Are there any edge cases where the code may break?
  • Does the code introduce issues related to backwards-incompatibility?

Complexity

K.I.S.S. has been a mantra in software development for good reason. Address issues related to complexity with questions like:

  • Consider the experience of someone using this code. Are there any areas that are unnecessarily complex?
  • Consider the readability of the code. Are there obscure libraries or constructions used where simpler ones would suffice?

Testing

Developers should always include their test code as part of the review. Don’t just gloss over it! Ask the following questions:

  • Are there sufficient unit and integration tests? Do they cover all relevant edge cases?
  • How else did the author test these changes (i.e. end-to-end tests)?

Naming

Properly naming variables is a crucial part of maintaining clean, readable code. Questions may include:

  • Do all the variable names make sense? Are any of them not descriptive enough?
  • Do the variable names conform to other files in the codebase?

Comments

Different teams have different opinions on leaving comments in code. You might ask:

  • Would it be beneficial to leave a comment explaining a piece of code?
  • Are there perhaps too many comments in the code? Do they hinder overall code readability?

Style

If the team has preferences on overall coding style, reviewers can enforce them with questions like:

  • Are the import statements organized and formatted correctly?
  • Are tabs 2-spaces or 4-spaces?
  • Does the code adhere to any line-length restrictions (i.e. column limit of 100)?

Documentation

Sometimes, modifying a key piece of code also requires updating a README file. Make sure your code documentation is up to date by asking the following:

  • Does the code change the overall user experience? If so, are the relevant documentation files up to date?
  • If the author is deprecating old code, is it clear to other users that the code is deprecated?

Code Reviews in Data Engineering

In the world of DataOps, one of the main goals is to rapidly innovate and deliver meaningful data insights to customers. Code reviews are an important mechanism in being able to fulfill that. In fact, if you take a look at the DataOps manifesto, it’s nearly impossible to adhere to some of the principles without code reviews. Here are some notable examples:

  • #8: Reflect. The manifesto calls for data analytics teams to constantly reflect on their operations based on both external and internal feedback. Code reviews are an excellent opportunity for internal reflection.
  • #13: Simplicity. Recall that code reviewers should always be on the lookout for over-engineering (complexity). The manifesto reminds teams to strive for simplicity in their tools, processes, and code design.
  • #15 Quality is paramount. The manifesto focuses on ensuring that analytics pipelines have built-in mechanisms for detecting issues. However, reviewers have a similar responsibility in ensuring that no code goes live without meeting quality checks.

Meltano can help you achieve many of the items on this list. Since Meltano’s project files are just text-based files within a directory, you can easily integrate Meltano with your team’s existing best practices surrounding code review.

Biggest Concerns of Code Review in Data Engineering

Let’s extend the earlier discussion of code review concerns and talk about them in the context of DataOps. Suppose that you want to take data from a CSV file and import it into your main PostgreSQL database. To achieve this, you use the psycopg library to produce the following code snippet:

import psycopg2

HOST = 'HOSTNAME'
DB = 'DBNAME'
USER = 'USERNAME'
PASSWORD = 'PASSWORD'
PORT = 'PORT'

conn = psycopg2.connect(host=HOST,database=DB,user=USER,password=PASSWORD,port=PORT)

# Copies data from a CSV file into a PostgreSQL table.
def copy_data(schema, table, csv_file, delimiter=',', null='NULL', columns=[]):
    cursor = conn.cursor()
    filepath = open(csv_file,'r')

    cols_str = ','.join(columns)
    copy_sql = f'''COPY {schema}.{table} ({cols_str}) FROM STDIN WITH CSV DELIMITER E'{delimiter}' QUOTE '"' NULL '{null}';'''

    print(copy_sql)
    cursor.copy_expert(copy_sql, filepath)
    cursor.execute('COMMIT;')

If you were to send this code snippet for review (hint: there are quite a few issues with it), here are some things reviewers might consider, categorized by the nine types of concerns outlined earlier.

Design

Design refers to the overall “fit” of the new code within the existing codebase or system. With that in mind, questions related to design will depend on the intent and usage of the code. For example, this code might be one out of many files in a scripts package that helps data engineers perform common operations. Given that, reviewers may ask things like:

  • Has this functionality already been implemented (completely or partially) somewhere else?
  • Is there a need to dedicate a new file for this functionality? Can it be added to an existing file instead?

Scoping

The scope of requirements for this change is pretty well-defined: a user should be able to use this code to take data from a CSV file and import it into a PostgreSQL database. The code does just that, so there are no issues with scoping. However, it may be worth asking whether related metadata or project files, such as Meltano files, need to be updated along with this change.

Functionality

Functionality refers to both the overall correctness of the code, as well as how easy the code is to use from a UX perspective. Reviewers may have a couple of concerns with the given sample:

  • Credentials are hard-coded at the top of the file. A reviewer may suggest functionality to supply the password with a file that’s read via standard input. Alternatively, they may suggest a service like AWS Secrets Manager.
  • There’s actually a bug in the code. The copy_data method contains an input columns, which defines the column names for the table. It’s possible for columns to be an empty list. In fact, that’s the default value you gave it. A reviewer may suggest the following fix with a simple if-else control:
    if len(columns) > 0:
        cols_str = ','.join(columns)
        copy_sql = f'''COPY {schema}.{table} ({cols_str}) FROM STDIN WITH CSV DELIMITER E'{delimiter}' QUOTE '"' NULL '{null}';'''
    else:
        copy_sql = f'''COPY {schema}.{table} FROM STDIN WITH CSV DELIMITER E'{delimiter}' QUOTE '"' NULL '{null}';'''

Complexity

Since the code here is relatively simple, there aren’t really any major things to point out regarding complexity. A reviewer might raise some minor issues, such as:

  • Is the QUOTE clause necessary? The default value is already double-quoted.
  • Why include null as an input value?

Testing

The code snippet didn’t come with test code, so that should set off some alarm bells. Reviewers should be quick to notice when code hasn’t been tested, and not offer approvals until they’re confident that the changes work as intended. On top of requesting unit and integration tests, reviewers might ask if these tests meet the team’s bar for code coverage.

Naming

As a whole, the naming in this code is okay, but reviewers may have some suggestions for improvement. For example, the function name copy_data could be more descriptive. Something like copy_csv_data_to_table might work better, and it allows you to get rid of the comment directly above it.

Comments

There is only a single comment in this file, and if the author takes the naming suggestion, you can get rid of it. There aren’t really any other areas where including a comment is necessary in this sample.

Style

Style refers to coding best practices shared across a team. Reviewers might point out the following:

  • The copy_sql command is one long line of code that may exceed the codebase’s maximum line length. Consider breaking it up into multiple lines and formatting it so that it’s easier to read.
  • The team might consider it a best practice to close and delete the cursor once you’re done with it. This means adding the following two lines to the end of the function implementation:
    cursor.close()
    del cursor

Documentation

This code snippet didn’t come with supporting documentation on how to use it. If this code is part of a larger scripts package, the author should document how to use it in a README file.

Benefits of Code Review in Data Engineering and the Value-Add for Data Teams

As you saw from the previous section, even with a relatively small code change, code reviews can quickly uncover issues that the author wasn’t originally aware of. If you’re not already convinced, code review helps create a healthy DataOps environment through the following main benefits:

Code Quality Improvements

In the example above, code review not only revealed a bug in the implementation, but also encouraged the author to use team best practices to maintain high code quality. Having one or two experts on the team review code helps strike the best balance between staying agile as a team and enforcing a high bar. Overall, your code will have fewer errors and better style.

Knowledge Sharing

Code review can help junior and senior developers alike learn and grow through knowledge sharing. The example highlighted the section above where the author hard-coded user credentials into the program. By looking into the suggestions, the author is better equipped to handle similar problems in the future.

Improved Documentation

Unfortunately, documentation is often seen as an afterthought to many developers. By focusing on documentation during code review, teams can ensure that it doesn’t get cast aside during the development cycle.

Conclusion

Start improving your code review processes in your DataOps team! In this article, you reviewed nine types of concerns that are frequently brought up during code review: design, scoping, functionality, complexity, testing, naming, comments, style, and documentation. To illustrate this, you performed an analysis on a data engineering code sample that copies data from a CSV file into a PostgreSQL database, and saw how doing so could significantly improve the original author’s code. To maintain high code quality moving forward, you might consider creating a checklist with all nine of these concerns to make sure they’re all addressed during each review. This helps keep your team’s best practices top-of-mind at all times.

Guest post written by Alexander Yu, thank you Alex!

Intrigued?

You haven’t seen nothing yet!