Enriching Streaming Data from Redpanda
Data enrichment is a popular method for enhancing data to make it more suitable or useful for a specific purpose. It involves adding additional data from a third-party database or internal data source to the existing data. In this tutorial, you will learn how to write a Python dataflow for inline data enrichment using Redpanda, a streaming platform that uses a Kafka-compatible API (it can be replaced with Kafka easily). We will show you how to consume a stream of IP addresses from a Redpanda topic, enrich them with third-party data to determine the location of the IP address, and produce data to Kafka. This example will leverage the built-in kafka input and kafka output to do so.
Kafka/Redpanda Kafka and Redpanda can be used interchangeably, but we will use Redpanda in this demo for the ease of use it provides. We will use docker compose to start a Redpanda cluster.
bytewax==0.15.0 requests==2.28.0 kafka-python==2.0.2
The data source for this example is under the data directory It will be loaded to the Redpanda cluster via the utility script in the repository.
You'll build a data enrichment pipeline: consume a stream of data from Redpanda or Kafka & enrich it with third-party data. You will also run the pipeline locally or deploy it to the cloud (bonus: AWS example).
Step 1. Redpanda Overview
Redpanda is a pub/sub style system that allows many producers to write to one topic, and many consumers to read and receive the same data. It also features topic partitions that increase parallelism and throughput, making it possible to scale your data as it grows. Like Kafka, Redpanda has a concept of consumer groups that enable multiple consumers to read in a coordinated manner from a topic, without receiving the same data. This further increases throughput and scalability.
By leveraging the Kafka-compatible API of Redpanda, Bytewax can consume from a Redpanda cluster in a similar way to Kafka. The code we will write in the following sections will be agnostic to the underlying streaming platform, so the tutorial can be adapted to work with Kafka as well.
Step 2. Constructing the Dataflow
Every dataflow will contain, at the very least an input and an output. In this example the data input source will be a redpanda topic and the output sink will be another redpanda topic. Between the input and output lies the code used to transform the data. This is illustrated by the diagram below.
Let's walk through constructing the input, the transformation code and the output.
The input is a crucial component of our dataflow, as it allows us to consume data from a Redpanda topic. Bytewax provides a built-in input source called
KafkaInputConfig that is configurable and can be used as the input source for our dataflow (thanks to Redpanda's Kafka-compatible API).
KafkaInputConfig provides several advantages, such as allowing the code to be parallelized across input partitions by establishing a connection on each worker. It also handles offset and consumer group configuration for recovery purposes, making it easy to work with data in Kafka.
To define our input in the dataflow, we use the
input method on a Dataflow object (see the code snippet). This method takes two arguments:
step_id is used for recovery purposes and the
input_config is where we will use the
KafkaInputConfig to set up our dataflow to consume from Redpanda.
By configuring the
KafkaInputConfig, we can ensure that our dataflow is consuming data from the correct Redpanda/Kafka topic and can handle any potential failures that may occur during the processing of the data.
A Quick Aside on Recovery: With Bytewax you can persist state in more durable formats. This is so that in the case that the dataflow fails, you can recover state and the dataflow will not be required to recompute from the beginning. This is oftentimes referred to as checkpointing for other processing frameworks. With the
KafkaInputConfig configuring recovery will handle the offset and consumer groups for you. This makes it easy to get started working with data in Redpanda or Kafka.
In our dataflow, the data transformation is a critical step, as it enables us to modify and enrich the incoming data. Bytewax provides several built-in operators, including the
map operator, which we use in this example.
map operator takes a Python function as an argument, which contains the code to modify the data payload. In our example, we use this function to make an HTTP request to an external service to enrich the data with third-party information.
While this is just a demonstration, it is important to note that the function we provide should be optimized to reduce latency and minimize the risk of network errors or bottlenecks.
By leveraging the
map operator, we can efficiently transform and enrich our streaming data inline using our favorite Python libraries.
The output component of our dataflow is responsible for capturing the data that has been transformed and enriched, so that it can be used for further analysis or processing. Bytewax provides a built-in output configuration called
KafkaOutputConfig that can be used to write out the enriched data to a new Redpanda or Kafka topic.
To define our output in the dataflow, we use the
capture method on a Dataflow object. This method takes a configuration as its argument, and we use the
KafkaOutputConfig to configure our dataflow to write the enriched data to a new Kafka topic
Now we can capture the results of our pipeline!
Kicking off execution
With the dataflow code written, the final step is to determine how the pipeline will be executed. Bytewax provides methods in the execution module that can be used to define the execution method, whether as a single-threaded process on a local machine, or as a scalable process across a Kubernetes cluster. You can find detailed information on how to use these methods in the API documentation.
Bytewax offers two types of workers: worker threads and worker processes. In most cases, it is recommended to use worker processes, as this approach can maximize the efficiency of data processing.
With Bytewax you can easily optimize your workflow for your data streaming needs, whether working on a small project or processing large volumes of data.
Step 3. Deploying the Dataflow
Deploying your dataflow can be done in several ways, depending on your needs. While you can run dataflows as a regular Python script for local testing
> python dataflow.py, the recommended way to work with dataflows in production is to use the waxctl command line tool to easily run the workloads on your cloud infrastructure or on the bytewax platform.
To run this tutorial, you can clone the repository to your machine and run the
run.sh script, which will start a container running Redpanda, load it with sample data, while building the docker image. From there, it will run the dataflow from the tutorial and enrich the data.
Deploying to AWS
If you're looking to deploy your Bytewax dataflow to a public cloud like AWS, you can do so easily with the
waxctl command line tool and minimal configuration.
To get started, you'll need to have the AWS CLI installed and configured. Additionally, you'll need to ensure that your streaming platform (whether it's Redpanda or Kafka) is accessible from the instance.
waxctl aws deploy kafka-enrichment.py --name kafka-enrichment \ --requirements-file-name requirements-ke.txt
waxctl will configure and start an AWS EC2 instance and run your dataflow on the instance.
To see the default parameters, you can run the help command and see them in the command line (see the output snippet).