Sifting Through Data in Kafka

December 10, 2023 - 4 minutes read - 831 words

Apache Kafka is a powerful distributed data streaming platform. It allows you to build an event-based architecture where the consumers of data are decoupled from its production. An event source could be your production database, containing all the events for your e-commerce site, or clickstream data from your web app. If required, you can even maintain the order of events on a per-entity basis. This is a huge advantage over similar solutions, such as Amazon SQS. But how can we analyze and make sense of the data in a topic of interest? Let’s explore a few solutions, but first, we must understand a few core concepts.

Core Concepts

Kafka is a distributed system that runs on a set of nodes (brokers) in a cluster. Data is replicated across multiple instances to ensure fault tolerance. There’s plenty of distributed systems theory that we could delve into here, but that’s not the intent of this article.

Data makes its way onto the cluster when Producers publish to a Topic. A topic is a set of events or related messages. A Topic is subdivided into Partitions. There can be any number of Partitions in a Topic, and this number is defined when the topic is created. When a Producer wants to write a message, it must define the Topic, Partition, Key, and Message. Keys are usually dictated by a “natural key,” such as a user ID, and the partition is selected by taking the hash of the key. This ensures that all messages with the same key are always produced to the same partition. Kafka guarantees message order within each partition. The Message is simply a set of bytes, usually formatted in JSON, AVRO, or Parquet.

There can be any number of consumers of a given topic. Consumers can consume Messages on a Topic at any given time. Kafka bookmarks the last message read by a Consumer Group. The bookmark is called an Offset. Offsets are monotonically increasing numbers, unique to each partition. A Consumer Group is a pool of processes that consume messages from a Topic. Once a message is read by a process in the CG, Kafka marks that message as processed by that CG. When a CG is active, Kafka will assign each Partition to a single process.

This description is very brief but should suffice for this blog post.

How Do I Pull and Analyze this Data?

Any process that wants to pull data from a Kafka topic must iterate over each message, as it is not a relational store. A process that consumes the data could be a microservice that wants to update its state based on the messages in a topic. We won’t touch on this use case. Instead, we will focus on ad-hoc analysis and stateless reformatting of messages. A few tools for this are KSQL, Kcat and DuckDB, and Benthos.

KSQL

KSQL is built on top of Kafka Streams. It’s a SQL-like language that allows you to process data using streams and tables. Streams can be re-published to other topics. After a stream or table is created, it will be continuously updated as new messages come in. The downside of this tool is that it requires setup on the existing Kafka installation or on a separate cluster, yet another tool to pay for or have your infra team manage.

Kcat and DuckDB

Kcat is a CLI tool built by Confluent. It allows you to consume from and produce to Kafka topics. On its own, it has no ability to sift through and analyze data, but since it’s a CLI app, it can be combined with other specialized tools. This is where DuckDB comes in. DuckDB is an embedded, columnar database that enables querying flat files (CSV, JSONL, and Parquet) as if they were SQL tables. Often, you don’t even need to define the schema. It will sample the first few lines of the file and infer its structure.

You can combine these tools by piping the output of Kcat to a file, then querying the file with DuckDB. I might come back and add some sample commands.

This combo is best for when you need to quickly sift through data that can fit on your local machine.

Benthos

Benthos is a stream processing application that allows you to perform stateless transformations on streams of data. It has tons of inputs and outputs. You can consume messages from Kafka, then call an API to enrich the data, then add the data to a Redis cluster. It has its own simple language, Bloblang, that allows you to manipulate the data. It’s configured through YAML files and can be run as a command-line tool or in the cloud through a Docker container. Bloblang, while simple, is not the most intuitive language, and the configs can be a bit clunky to set up. For these reasons, I think this tool is best for setting up long-running stream processing, especially if you don’t have KSQL as an option at your company.