Skip to content

Bufstream Apache Iceberg integration#

Bufstream streams data directly from any topic to Apache Iceberg™ tables, eliminating the need for complex and costly ETL pipelines. As data flows through the broker, Bufstream transforms it to Parquet format, layering Iceberg metadata on top. The transformed data is stored in object storage as a single source of truth for your table and Kafka topic, immediately queryable by analytics tools like Apache Spark™, Amazon Athena™, Dremio®, Trino™, and Starburst™ without duplicating storage or copying data into separate lakehouse tables.

Bufstream uses a zero-copy architecture and optionally updates your existing Iceberg catalog with configurable options, reducing the time required to make streaming data available for analytics workflows.

Supported Catalogs#

Bufstream integrates with your existing data workflows through native support for:

  • REST Catalogs (including Apache Polaris™)
  • AWS Glue™ Data Catalog
  • Google BigQuery™ Metastore

Bufstream's REST catalog support allows you to deploy a REST adapter in front of Iceberg catalogs that don't have direct integration, enabling compatibility with any existing catalog as well as bespoke implementations.

Why Iceberg?#

The existing processes for getting data out of Kafka and into a data lake, warehouse, or analytics tool are prone to human error, introduce duplicative storage, and increase operational expenses. Bufstream eliminates these complex and brittle processes, delegates data transformation to the broker, and makes object storage the single source of truth for ready to query data—leading your team to critical insights faster.

Over the last few years, Iceberg has grown to be the leading standard for storing big data sets in the data lakehouse, and as a result we've seen the ecosystem grow to unite data platform teams and analytics teams with tools like Apache Spark, Amazon Athena, Dremio, Trino, and Starburst. These conditions made Iceberg a great fit for Bufstream.

Today to shift data out of Kafka and into a data warehouse teams must do some or all of the following:

  • Set up a consumption workflow that requires additional compute and storage, utilizing Kafka Connect or bespoke direct-to-data-lakehouse engines.
  • Create and maintain a complex pipeline of operations that transform the data to a columnar format (like Parquet), materialize the data, and address any schema changes or evolution.
  • Guard against degraded performance by manually cleaning up the small files that pile up in object storage as a resulting from continuous transformations of streaming data.

As a result, teams require twice the time and expense to use the same data in a downstream system.

What does Bufstream do differently?#

Bufstream shifts all of the work to materialize, transform, and validate data into the streaming data pipeline, reducing the maintenance, cost, and operations burden for data platform and analytics teams. Bufstream's brokers are schema-aware, semantically intelligent, and able to transform data in transit, meaning there is only one tool and process needed to stream and process your data to get it ready for the lakehouse and analytics engines.

Bufstream's broker-side data quality enforcement ensures that data entering your lakehouse or used by query engines conforms to a known schema from your schema registry as well as meets validation requirements for each field, forbidding malformed or invalid data from entering your lakehouse. Once the data has been assessed for quality, Bufstream transforms your data into Parquet and materializes the Iceberg metadata from the approved schema, eliminating manual transformation tools that need routine maintenance for every change made to application data. As a result, lakehouse compatible data rests in object storage as a source of truth without having to transform, materialize, and persist a new copy of the raw data just for analytics use-cases.

Bufstream supports the Iceberg REST catalog allowing you to deploy a REST adapter in front of Iceberg catalogs that don't have an explicit integration with Bufstream. Therefore, Bufstream is able to integrate with any existing catalog as well as bespoke implementations making it easy to get high quality data from your application events into your lakehouse without any extra time or effort on your part.

What's next?#

OSZAR »