Extend The Open Source Bytewax Library with Modules

By Oli Makhasoeva

Bytewax is known as a complete data processing solution, combining our core open-source library with a robust platform for orchestrating and governing your data flows. Today, we’re excited to introduce the Module Hub, a powerful expansion of the Bytewax ecosystem!

The Module Hub brings pre-built connectors and advanced operators to our open-source dataflow framework, designed to save your team time and let data engineers focus directly on high-impact projects. As highlighted in the State of Data Management report, up to 44% of data engineering time is often spent building in-house connectors. With our modules, you can reclaim that time!

While, with 500,000 downloads and a rapidly growing user base across industries worldwide, Bytewax’s traction speaks to its value, we would like to give our users even more. Now, focus on what matters most—building real-time streaming pipelines 5x faster with 80% lower TCO, connecting to all your sources and sinks effortlessly, and enabling cutting-edge AI use cases from edge to cloud.

Zander quote Modules.png 🔍 Click to enlarge image

If you are already convinced, please go ahead and see the Module Hub. If you need a little more explanation and details, you'll find them in this blog.

Introduction to Bytewax Modules

To understand what a module is, we must understand the steps in data flows and stream processing. In stream processing, you often come across the terms source and sink. The data flow typically involves these steps:

  • Source: Where the data originates
  • Transformation: The data is processed, filtered, or enriched
  • Sink: Where the transformed data is stored

Connectors

Each data source or sink in Bytewax is called a connector. A source refers to any API, file, database, or data warehouse from which you want to ingest data, while a sink defines where you want to send your processed data—be it a data lake, database, data warehouse, or analytics tool. Each connector module falls into one or both categories. For example, DeltaLake module is a sink connector, AWS IoT Gateway module is a source connector, and Apache Kafka module has both.

Operators

Operators are the transformation building blocks of Bytewax. Each operator provides a specific “shape” for data transformation, while you give them logic functions to customize them to your specific task. Together, an operator and its custom logic function form a dataflow step. By chaining these steps in a dataflow, you can address your high-level data processing challenges.

If you’ve used Python's built-in functions like map, filter, or functools.reduce (or similar functions in other languages), you’re already familiar with this concept. If not, no worries—our documentation includes examples for each operator in bytewax.operators to help you get started.

Open-source Bytewax comes equipped with a range of operators fundamental to building flexible dataflows. Plus, you can add custom operators to handle specific, complex semantics or tackle particularly tricky data transformations. While working with the community, we noticed some advanced operators in high demand and we are happy to present them today as modules!

End-to-end dataflows

This one’s in the “coming soon” category. Imagine setting up a real-time vector embedding pipeline that captures changes in your S3 document store and streams embeddings directly into Pinecone—or any vector database—using Bytewax and the embedding model of your choice. How useful would it be to have a tried-and-tested dataflow module that you can install with ease? As open-source adoption grows, we’re seeing patterns emerge and where dedicated end-to-end dataflow modules could greatly accelerate development work. Stay tuned!

We have changed our offerings to adapt to user requests as they grow with Bytewax. I am excited to show our take on how we have added pre-built extensions to the open source framework in Bytewax modules. Modules are standalone Python packages that contain connectors, operators, or complete dataflow code to speed up development and increase capabilities. We endeavor for the software we build to align with the principle of making one developer go faster and further and we think modules do just that! Modules are commercially licensed and source available so you can give them a spin locally before you push to production with a license.

Zander Matheson, CEO, Founder at Bytewax

List of modules

For the modules below, you can either use them with an Apache 2, Open Source license ("open source"), or purchase Premium connectors in our store (also available as part of our platform). Every connector is well tested and in production today, and as we want to ensure that your experience is curated and that the Bytewax team provides the best guidance, we invite you to join our Slack.

ID Name License Module Type
1 Apache Kafka open source Sink, Source
2 Google BigQuery premium Sink
3 Hopsworks FS premium Sink
4 AWS Kinesis Streams premium Sink, Source
5 Clickhouse premium Sink
6 MongoDB premium Sink
7 MQTT premium Sink, Source
8 DeltaLake premium Sink
9 Amazon S3 premium Sink
10 Azure EventHub premium Sink, Source
11 RabbitMQ premium Sink, Source
12 AWS IoT Gateway premium Source
13 Azure IoT Data Hub premium Sink
14 Redpanda open source Sink, Source
15 Amazon MSK open source Sink, Source
16 Confluent open source Sink, Source
17 Redis premium Sink, Source
18 Websockets premium Sink, Source
19 Snowflake premium Sink, Source
20 Qdrant premium Sink
21 Milvus premium Sink
22 Pinecone premium Sink
23 MySQL premium Sink, Source
24 Google Vertex AI premium Source
25 Amazon SageMaker premium Sink
26 Feast premium Sink
27 Weaviate premium Sink
34 InfluxDB premium Sink, Source
35 Azure AI Search premium Sink
36 SingleStore* open source Sink
37 Interval Join premium Operator
38 Ordering premium Operator
39 Stateful timeout premium Operator
40 Timers premium Operator
41 Select Timerange premium Operator

*This is a community-contributed connector by Tom Kühl; Bytewax was not directly involved in its creation. The connector is open-source—please refer to the project’s repository for licensing details and credits.

Getting Started: A Quick Example

To illustrate how easy it is to get started, let's walk through an example using the InfluxDB module.

Prerequisites
Make sure your InfluxDB instance is up and running.

Installation

pip install bytewax-influxdb

Configuration

We begin by setting up our InfluxDB credentials and details:

import os

TOKEN = os.getenv(
    "INLFUXDB_TOKEN",
    "my-token",
)
DATABASE = os.getenv("INFLUXDB_DATABASE", "testing")
ORG = os.getenv("INFLUXDB_ORG", "dev")

Dataflow

Next, define a dataflow:

    from bytewax.dataflow import Dataflow

    flow = Dataflow("a_simple_example")

Set up the InfluxDB source:

    from bytewax.influxdb import InfluxDBSource
    from datetime import timedelta, datetime, timezone

    inp = op.input(
        "inp",
        flow,
        InfluxDBSource(
            timedelta(minutes=30),
            "https://us-east-1-1.aws.cloud2.influxdata.com",
            DATABASE,
            TOKEN,
            "home",
            ORG,
            datetime.fromtimestamp(1724258000, tz=timezone.utc),
        ),
    )

The InfluxDBSource operator pulls data from an InfluxDB instance. In this example, the source reads data from the home at 30-minute intervals, starting from a specified timestamp. The data is streamed from https://us-east-1-1.aws.cloud2.influxdata.com, which you can replace with your specific InfluxDB instance URL.

That's it! Now you can access your data, transform it, and pass it downstream.

Running

To run the dataflow, simply execute the following command:

python -m bytewax.run path.to.this.file:flow

For more details on the sink part, please refer to the module. Enjoy your dataflow with InfluxDB, and don't forget to purchase the license for production use!

To stay updated on more examples and use cases for other Bytewax modules, be sure to follow our blog and subscribe to our newsletter. We regularly share insights, tutorials, and advanced use cases to help you get the most out of Bytewax!

Get in touch

Interested in seeing a new Bytewax module? We'd love to hear from you! At Bytewax, it's our privilege to create tools that fuel innovation, and partnering with other companies and projects to bring new ideas to life is an honor. If there's a specific connector or sink you'd love to see, reach out – let's make it happen together! As an open-source project, Bytewax's module catalog is continuously expanding, with contributions from the community and the Bytewax team. Bytewax encourages you to build new modules and contribute enhancements, bug fixes, or entirely new modules for inclusion in the catalog. Learn more about how you can contribute to Bytewax modules here.

Stay updated with our newsletter

Subscribe and never miss another blog post, announcement, or community event.

Oli Makhasoeva

Oli Makhasoeva

Director of Developer Relations and Operations
Oli is a passionate technologist with a background in engineering, consulting, and community building. On a break from creating content, she loves to network online & in person at meetups, conferences, and forums.
Next post