Skip to content

Optimizing Compaction for Streaming Workloads in Apache Iceberg

Published: at 09:00 AM

Optimizing Compaction for Streaming Workloads in Apache Iceberg

In traditional batch pipelines, compaction jobs can run in large windows during idle periods. But in streaming workloads, data is written continuously: often in small increments, leading to rapid small file accumulation and tight freshness requirements.

So how do we compact Iceberg tables without interfering with ingestion and latency-sensitive reads? This post explores how to design efficient, incremental compaction jobs that preserve performance without disrupting your streaming pipelines.

The Challenge with Streaming + Compaction

Streaming ingestion into Apache Iceberg often uses micro-batches or event-driven triggers that:

A naive compaction job that rewrites entire partitions or the whole table risks:

The key is to optimize incrementally and intelligently.

Techniques for Streaming-Safe Compaction

1. Compact Only Cold Partitions

Don’t rewrite partitions actively being written to. Instead:

Example query using metadata table:

SELECT partition, COUNT(*) AS file_count
FROM my_table.files
WHERE last_modified < current_timestamp() - INTERVAL '1 hour'
GROUP BY partition
HAVING COUNT(*) > 10;

This can drive dynamic, safe compaction logic in orchestration tools.

2. Use Incremental Compaction Windows

Instead of full rewrites:

Spark’s RewriteDataFiles and Dremio’s OPTIMIZE features both support targeted rewrites.

3. Trigger Based on Metadata Metrics

Rather than scheduling compaction at fixed intervals, use metadata-driven triggers like:

You can track these via files and manifests metadata tables and use orchestration tools (e.g., Airflow, Dagster, dbt Cloud) to trigger compaction.

Example: Time-Based Compaction Script (Pseudo-code)

# For each partition older than 1 hour with many small files
for partition in get_partitions_older_than(hours=1):
    if count_small_files(partition) > threshold:
        run_compaction(partition)

This pattern allows incremental, scoped jobs that don’t touch fresh data.

Tuning for Performance

Parallelism: Use high parallelism for wide tables to speed up job runtime

Target file size: Stick to 128MB–256MB range unless your queries benefit from larger files

Retries and check-pointing: Make sure jobs are fault-tolerant in production

Summary

To maintain performance in streaming Iceberg pipelines:

With the right setup, you can keep query performance and data freshness high - without sacrificing one for the other.