Optimizing MySQL Backfills: How We Made The Wise Connector Backfills Fast Enough to Rely On At Scale

Blog

by Aaron Phethean

on June 29 2026

When our customers choose to use the Pipelinewise MySQL connector, originally developed by Wise (formerly TransferWise), we have seen they commonly run into issues with large MySQL resyncs and historic backfills. These backfills have a hard deadline: finish within the binlog retention window or start over. We dug into every layer of the pipeline to find out why backfills were slow and shipped a set of concrete improvements that make a real difference.

Full resyncs are the kind of pipeline operation that stays invisible until something goes wrong. A lost binlog position, a failed replica, a required clean-slate reload. Suddenly the question is not whether you can backfill, but whether you can do it fast enough.

MySQL retains binary logs for a fixed window, the exact duration defined by your read replica configuration, typically seven days. If a full resync takes longer than that window to complete, the CDC pipeline has no valid start point when it finishes. The only option is to restart and the clock resets. For large databases, getting this comfortably inside the window is not an optimisation exercise. It is a reliability requirement.

1B+ rows targeted for sub-48hr resync
24h+ previous runtime for medium schemas
Small schemas now complete in under 1 hour
Binlog retention window set by your read replica config

Why the Connector Slows Down at Scale

The first step was measurement, not guessing. We ran a full pipeline against a large MySQL source, collected the logs and looked at what was actually happening at the OS level, specifically how CPU was being distributed between the tap process and the target process.

Network is always the first thing to check. Chatty connections, where a connector makes a large number of small round-trips instead of retrieving data in bulk, look like slow extraction but are really a network efficiency problem. Ruling that out early matters.

What we found was a two-stage problem. The tap had genuine inefficiencies that were capping extraction speed. But even after pushing the tap to its practical limit, the pipeline was still slow because the target was the bottleneck. CPU utilisation on the target process was consistently high, a clear sign of back pressure where the target could not process records as fast as the tap was producing them. Records were building up in memory and the whole pipeline slowed to match the target’s pace.

The pipeline was not slow because MySQL was slow. It was slow because of inefficiencies in how the tap fetched data, how it serialised records and how the target processed them. The connector was not designed for billion-row resyncs against a hard deadline and that is not a criticism, it is just where the limits showed up when we stress-tested it. Each layer needed attention separately.

Optimizing MySQL Backfills: Five Improvements to tap-mysql That Move the Needle

1. Batch fetching with fetchmany() and the native MySQL client

The connector was fetching one row at a time using fetchone(). The underlying Python client had implemented fetchmany() as a loop over fetchone() internally so calling it offered no real benefit. Switching to mysql-connector-python gave us real batch fetching: rows retrieved in bulk, far fewer round-trips and a significant reduction in per-row network overhead. For large tables, this is the single biggest throughput improvement.

2. Faster serialisation with orjson

Every record written to stdout goes through JSON serialisation. The original connector used Python’s stdlib json module (or simplejson), both of which become a meaningful bottleneck at millions of records per run. Replacing them with orjson, a Rust-backed library, cuts serialisation time significantly. At scale, the saving compounds across every row in every table.

3. Eliminating redundant computation inside the record loop

The tap was computing timezone conversions for the time_extracted field on every single record. Since all records in a run share the same extraction timestamp, we moved this outside the loop, computed once and applied to all. Small per-record savings add up quickly when you are processing tens of millions of rows.

4. Opt-in BATCH message support

Beyond record-by-record streaming, tap-mysql now supports opt-in BATCH messages. Rather than emitting one Singer RECORD message per row, BATCH mode writes data in bulk files, significantly reducing the overhead of the tap-to-target handoff for large tables. This is particularly impactful for full historical resyncs where the volume of individual record messages would otherwise create meaningful serialisation and I/O pressure. BATCH support is opt-in so existing pipelines are unaffected until you choose to enable it. Full handling for edge cases such as decimal.
NOTE-Decimal column types is included from v1.4.1 onwards.

5. Handling mid-query disconnections

For very large tables, long-running extraction queries can occasionally be interrupted by transient network or server-side disconnections. v1.3.2 adds graceful handling for mid-query disconnections, making large backfills more resilient without requiring a manual restart from the beginning of the affected table, something that matters a great deal when you are hours into scanning a very large table.

The Hidden Snowflake Bottleneck Slowing Down Your Loads

Once the tap was optimised, the target became the visible constraint. Profiling the Snowflake target revealed a query pattern that could explode at scale: for every column of every table being loaded, the target was issuing a separate INFORMATION_SCHEMA query to inspect the schema. With many tables and many columns, that adds up to a very large number of network calls to Snowflake, each carrying its own latency.

This issue had been patched in the meltanolabs variant of the target. We applied the same fix to the Matatika fork to bring both into line so schema inspection is handled efficiently regardless of which target variant is in use.

Cut Backfill Time Further by Running Tables in Parallel

Even with a well-tuned tap and target, sequential table extraction has a ceiling. Scanning one large table is inherently serial but there is no reason different tables need to wait for each other.

By splitting a backfill into separate pipeline runs, same configuration with different stream selections, and running those in parallel, it is possible to make full use of available resources. For a flat database source where tables have no parent-child dependency, there is minimal contention as long as the number of parallel connections stays within what the database handles comfortably.

In practice this approach delivered faster end-to-end backfill times than equivalent sequential runs and faster than what the same source had taken on a comparable platform. The principle is straightforward: parallelise where the data model allows it and let the infrastructure do the work.

Before and After: What a Backfill Actually Looks Like Now

Before (medium schemas, 20M to 350M rows): Exceeding 24 hours with no clear end point. Impractical for disaster recovery planning against any realistic binlog retention window.

After (medium schemas, 20M to 350M rows): Completing well within the target window. Small schemas under 20M rows now finish in under an hour.

Full Pipeline Visibility: How We Diagnosed This and How You Can Too

The ability to profile this pipeline at the OS level, checking process CPU, inspecting logs and measuring per-stage throughput, is only possible because Meltano runs fully open. You can run it standalone, examine exactly what is happening under the covers and make decisions based on real data rather than opaque platform metrics.

Meltano Cloud now also emits structured logs from Singer SDK-based plugins. If an error occurs, it surfaces as a structured exception that is searchable by type and filterable by stage. Teams using Datadog or similar observability platforms can route these logs there and track performance trends over time, not just in the moment.

How to Get the Most Out of Your Next MySQL Backfill

Update to the latest tap-mysql. The fetchmany() batch reads, mysql-connector-python client, orjson serialiser, BATCH message support and mid-query disconnection handling are all in the current release (v1.4.1+). Upgrade before starting any large resync.

Co-locate your runner with your database. Deploy the Meltano pipeline runner in the same cloud region as your MySQL instance. Reduced per-row network latency matters significantly when pulling millions of rows per table.

Consider enabling BATCH mode for large backfills. For full historical resyncs, opt-in BATCH message support can significantly reduce tap-to-target handoff overhead. Check the tap-mysql documentation for how to enable it and confirm your target version supports BATCH ingestion.

Parallelise across tables for large backfills. Split the pipeline by stream selection and run separate jobs in parallel. Keep the number of concurrent connections within a reasonable limit for your database and you will see meaningful end-to-end speed gains.

Know your binlog retention window. Confirm how long binary logs are retained in your read replica configuration. If your largest source would take close to the retention limit to backfill sequentially, that is a risk worth addressing before you need to recover.

Running MySQL on Meltano Cloud?
Get in touch with the team here to review your backfill setup and confirm you are on the latest optimised version of tap-mysql.

Optimizing MySQL Backfills: How We Made The Wise Connector Backfills Fast Enough to Rely On At Scale

Intrigued?