Why is traditional batch compaction risky for streaming data workloads?

Traditional full-table compaction jobs risk creating commit contention with frequent streaming writes, increasing query latency, and serving stale data because they block snapshot availability for long periods.

What is an effective strategy for compacting streaming Iceberg tables?

An effective strategy is to perform incremental compaction targeting only 'cold' partitions (e.g., data older than an hour) that are no longer being actively written to, completely avoiding conflicts with ongoing streaming ingestion.

How can metadata-driven triggers improve streaming compaction?

Instead of fixed schedules, metadata-driven triggers use metrics from Iceberg's metadata tables: such as file age, average file size, or file count per partition, to dynamically and incrementally orchestrate compaction precisely when and where it is needed most.

Optimizing Compaction for Streaming Workloads in Apache Iceberg

In traditional batch pipelines, compaction jobs can run in large windows during idle periods. But in streaming workloads, data is written continuously: often in small increments, leading to rapid small file accumulation and tight freshness requirements.

So how do we compact Iceberg tables without interfering with ingestion and latency-sensitive reads? This post explores how to design efficient, incremental compaction jobs that preserve performance without disrupting your streaming pipelines.

The Challenge with Streaming + Compaction

Streaming ingestion into Apache Iceberg often uses micro-batches or event-driven triggers that:

Generate many small files per partition
Write new snapshots frequently
Introduce high metadata churn

A naive compaction job that rewrites entire partitions or the whole table risks:

Commit contention with streaming jobs
Stale data in read replicas or downstream queries
Latency spikes if compaction blocks snapshot availability

The key is to optimize incrementally and intelligently.

Techniques for Streaming-Safe Compaction

1. Compact Only Cold Partitions

Don’t rewrite partitions actively being written to. Instead:

Identify “cold” partitions (e.g., older than 1 hour if partioned by hour)
Compact only those to avoid conflicts with streaming writes

Example query using metadata table:

SELECT partition, COUNT(*) AS file_count
FROM my_table.files
WHERE last_modified < current_timestamp() - INTERVAL '1 hour'
GROUP BY partition
HAVING COUNT(*) > 10;

This can drive dynamic, safe compaction logic in orchestration tools.

2. Use Incremental Compaction Windows

Instead of full rewrites:

Compact only a subset of files at a time (e.g., oldest or smallest)
Avoid reprocessing already optimized files
Reduce job run time to minutes instead of hours

Spark’s RewriteDataFiles and Dremio’s OPTIMIZE features both support targeted rewrites.

3. Trigger Based on Metadata Metrics

Rather than scheduling compaction at fixed intervals, use metadata-driven triggers like:

Number of files per partition > threshold
Average file size < target
File age > threshold

You can track these via files and manifests metadata tables and use orchestration tools (e.g., Airflow, Dagster, dbt Cloud) to trigger compaction.

Example: Time-Based Compaction Script (Pseudo-code)

# For each partition older than 1 hour with many small files
for partition in get_partitions_older_than(hours=1):
    if count_small_files(partition) > threshold:
        run_compaction(partition)

This pattern allows incremental, scoped jobs that don’t touch fresh data.

Tuning for Performance

Parallelism: Use high parallelism for wide tables to speed up job runtime

Target file size: Stick to 128MB–256MB range unless your queries benefit from larger files

Retries and check-pointing: Make sure jobs are fault-tolerant in production

Summary

To maintain performance in streaming Iceberg pipelines:

Compact frequently, but narrowly
Use metadata to guide scope
Avoid active partitions and large rewrites
Leverage orchestration and branching when available

With the right setup, you can keep query performance and data freshness high - without sacrificing one for the other.