Iceberg vs DuckLake: 100x More Queries Without 100x Cost

What an open table format actually is

Alex started with the basics. Before table formats existed, teams had two options. Either move every dataset into a warehouse, which worked but locked the data into one engine and made every change a migration. Or sit on a pile of Parquet files in object storage with no consistency guarantees and no easy way to evolve a schema.

Open table formats sit between the query and the files. As Alex put it:

“I’m Alex Merced. But really, I’m just a bunch of cells. I don’t think of myself as cell one, cell two, cell three. I think of myself as Alex Merced. What Iceberg, Delta, and Duck Lake do as a metadata layer is they allow you to think of the table not as Parquet file one, two, three, or four, but as the sales table, as the orders table.”

That metadata layer is what gives you ACID guarantees on top of Parquet, and what lets multiple engines read the same dataset without copying it. Iceberg, originally out of Netflix, took a file-based approach to that metadata. Duck Lake takes a different one.


Why Duck Lake exists

Mike framed Duck Lake as “Iceberg with hindsight.” The DuckDB team had the benefit of watching almost a decade of Iceberg’s problem-solution-problem-solution cycle, and made a different design choice. Put the catalog in a database from day one.

“The Duck Labs folks, having a pretty hard bias towards databases, looked at, hey, what if we solve all these catalog issues with a database from day one and just required that you have this one other piece of infrastructure?”

The first Duck Lake blog post, Mike noted, was titled “Is your lakehouse on acid?” Putting a database in front of the catalog gives you transactional guarantees for free, instead of designing around a file system.

A practical knock-on effect is something Duck Lake calls data inlining. Small writes go straight into the catalog database (Postgres, DuckDB, or another SQL store) instead of producing a flurry of tiny Parquet files. When you query, the engine reads those small writes directly from the catalog. For workloads with lots of small commits, that is much friendlier than reading a hundred single-row Parquet files.


Why analytics teams should care about openness

Both speakers landed in the same place on the question of why open formats matter, with different emphases.

For Mike, it is about control over the compute layer and giving customers an escape hatch:

“You want an easy story to tell around, if I ever want to leave or I ever want to use another tool on top of this data, how is that done? Open formats give you that story. And it’s not just a story. It’s insanely easy. You can just see, hey, it’s basically a list of Parquet files and I know how to operate on Parquet.”

For Alex, who works with much larger organisations, it is the cost of migrations:

“Every time you do a migration, it takes 18 months and costs a lot of money. The idea is that if people are going to do another migration, they want it to be that last migration. Having table formats gives you a lot of flexibility. If you need to migrate from tool X to tool Y, it’s just a matter of pointing tool Y to your catalog.”

Aaron flagged a moment that captures the shift well. At Snowflake Build in London, Snowflake spent its own keynote demonstrating Iceberg interoperability with Databricks. A few years ago that would have been unthinkable. The lock-in story has moved up the stack.


Performance is an engine question, not a format question

This was a useful nuance Alex returned to several times. Benchmarks that compare Iceberg to Delta to Duck Lake usually end up measuring the engine, not the format.

“Different engines have optimised for different formats from the get-go. How you cache that metadata, use that metadata. So someone will say, hey, this format is faster, but oftentimes it’s really an engine-dependent thing.”

The practical advice for an analytics engineer choosing a format is to start from the engines you already run and the team you already have. Mike reinforced the point. Open formats are not a product. You don’t get a vendor to call when something breaks. If you spin one up yourself, you own it.


The AI query problem

This is where the conversation got interesting for analytics teams that are about to hand their warehouse over to agents.

Mike described a use case Definite has started to find compelling: pulling segments of data out of the lake and into a local DuckDB process for agents to operate on.

“If I were Capital One, already burning millions of dollars a year on Snowflake, and then adding agents on top of that, that’s a hard no for me as the CTO. When you watch Claude Code run locally, you can see the tool calls it’s making. If you’re running that against files, it’s zero cost. If you ask it a really hard question against your data warehouse and you start seeing it run 15 queries in parallel, all of those things are racking up.”

The pattern is to keep the warehouse as the source of truth, but let agents do their exploratory thrashing locally, against a hundred gigabytes pulled into DuckDB, where queries take ten or fifteen milliseconds and cost nothing. Both Iceberg and Duck Lake make that pull-down cheap because you are reading a list of Parquet files.

Alex’s view, from Dremio’s side, was that the answer is also semantic. Dremio is investing in autonomous reflections (algorithms that watch query patterns and create and prune materialised views without a data engineer in the loop) and in semantic layers that get the agent to the right query faster, so it stops doing exploratory probing in the first place.

“The semantic layer means it doesn’t send as many queries you’d expect, because it’s getting to the right query faster, versus having to do a lot of exploratory queries beforehand.”

Two different bets. Mike’s is that the right place to run agent queries is often outside the warehouse. Alex’s is that the warehouse can get smart enough to handle them. Both are betting against today’s default, which is letting agents run unbounded queries on metered compute.


Where the formats still struggle

A question came in from the audience about the limits of these architectures. Both speakers pointed to the same place: streaming and very low latency writes.

Alex:

“Generally the way these formats work is that the commit happens after you write the files to Parquet. So there’s always that latency. I have to write all this data to Parquet, then make a commit before the next query can actually see the data. For most use cases you can get pretty good latency. But when you’re talking about millisecond latency, it’s still a difficult story.”

Duck Lake’s data inlining helps with small commits, but Mike was clear it is not the answer for real-time analytics either. For sub-second freshness, both speakers pointed elsewhere, towards ClickHouse and dedicated streaming engines. For the analytics workloads most teams actually run, where one or two seconds of latency is fine, both formats are in good shape.


Governance is the next frontier

Alex called out the part of the Iceberg ecosystem he is watching most closely: catalog-level governance.

Today, role-based access controls are interoperable across engines. If you set a rule in Polaris, Dremio honours it, Snowflake honours it. Row-level and column-level controls are not yet portable. The direction of travel is for tables to declare a trusted engine that enforces the finer-grained rules, with other engines deferring to it.

“Once you have really full governance, that’ll be the same across all tooling. That’s a pretty big difference than the world we’ve been living in, where you’re creating a lot of baroque governance rules across different tools.”

For analytics engineers who have been the unofficial owners of “who can see what” across two or three platforms, that change matters.


What to watch next

Closing out, Aaron asked where the next wave of innovation lands. Both speakers agreed that natural language analytics is the obvious near-term focus, but the harder problem sits one layer down.

Mike’s framing was the autonomous data team:

“How do you build this autonomous data team that’s constantly looking at your data without blowing up your cloud bill? You don’t want an agent just running 24/7, doing exploratory data analysis from scratch every 24 hours.”

Alex’s framing was efficient performance under heavy AI load:

“Now that you’re not going to see as many subsidised tokens, you’re going to want to be efficient with those tokens. So how do you get performance with efficient token usage?”

The shared concern is the same. AI makes the first 80% of any data project easy. The last 20%, including governance, cost control, and avoiding false positives in agent output, is the part that still needs human judgement.


Where Meltano sits

Meltano’s job in this picture is the simple one. Get data into your lake or warehouse cleanly, with no lock-in, no surprise pricing, and no operational firefighting. The Duck Lake target was published early on and is the primary one in use today. Whether you choose Iceberg or Duck Lake on the storage side, the loading layer should be the boring part of the stack.

👉 Watch the full LinkedIn Live on YouTube:
https://www.youtube.com/watch?v=pBu0FEHKM_w&t=1650s

Intrigued?

You haven’t seen nothing yet!