A Closer Look at Arroyo and Bytewax

This article was first published on Dec 13th, 2023 on Medium.

At Bytewax, we're always on the lookout for technologies that are shaping data engineering. That's why we're excited to share this article by Edmondo Porcu, which delves into two emerging solutions in the field: Arroyo and Bytewax.

The field of data engineering is witnessing an exciting phase of innovation and growth. While established technologies like Apache Flink, known for its streaming capabilities, and Spark Streaming, popular for micro-batching, have been at the forefront, emerging solutions like Arroyo and Bytewax are starting to make their mark.

This blog post aims to shed light on these two newer projects, examining how they compare, their distinct features, and their potential impact on the data engineering landscape. As we discuss Arroyo and Bytewax, it’s important to note that the world of open-source technology is dynamic. What we cover today might evolve tomorrow, reflecting the continuous contributions and improvements from their vibrant communities.

Navigating the Stream

Drawing from the founders’ experiences of operating Flink at Lyft and Splunk, Arroyo addresses the intricacies and challenges encountered in such environments. It champions a “serverless operations” philosophy, offering a SQL-first system that minimizes operational complexity. Arroyo uses Arrow DataFusion, a popular Rust library, for parsing SQL queries and generating plans.

Bytewax empowers users to fully utilize the Python ecosystem, enhancing the Timely Dataflow framework with essential additions, targeting those who seek a detailed and customizable framework within the Python environment.

Arroyo and Bytewax prioritize performance and scalability, leveraging Rust’s capabilities, while offering distinct experiences tailored to different user needs in the stream processing domain and taking care of known hard problems such as state persistence and recovery.

Looking more in depth, some important differences emerge:

Several execution modes: Bytewax supports a variety execution modes, including single worker (thread), local cluster, and manually managed cluster, as well as Kubernetes. Arroyo also can be executed on Kubernetes via the helm chart and have different schedulers: a process scheduler, a node scheduler (similar to the local custer) and a nomad cluster. However, at the time of writing Arroyo requires multiple services to be started even for local development.
Programmatic dataflow management: Bytewax introduces waxctl for managing dataflow operations when running a cluster on Kubernetes
Managed Service Offering: Arroyo offers Arroyo Cloud, a ready-to-use managed service, providing an accessible platform for users who prefer a hands-off approach to infrastructure management.

Where do these projects excel?

Bytewax use cases are numerous, given you could use your favorite Python libraries:

Real-time embedding pipelines for retrieval augmented generation (RAG) powered AI
Event processing and real-time analysis for IoT systems
Online machine learning — Anomaly detection and classification
Feature extraction pipelines to support machine learning (real-time and batch)
Quantitative and algorithmic trading
Real-time analysis for cybersecurity systems

Arroyo use cases are also numerous, given the large set of problems that can be expressed through SQL transformations:

Detecting fraud and security incidents
Real-time product and business analytics
Real-time ingestion into your data warehouse or data lake
Real-time ML feature generation

Wrapping Up: Arroyo and Bytewax in Perspective

Arroyo and Bytewax mark significant strides in stream processing, each tailored to specific user needs while addressing core challenges in data engineering. Arroyo simplifies complex processes with its SQL-centric approach, and Bytewax offers flexibility through Python. Both frameworks excel in performance, scalability, and tackle the critical issue of state persistence and recovery. As they evolve, Arroyo and Bytewax will continue to shape the future of real-time data processing, reflecting the dynamic and innovative spirit of the open-source community.

We would like to express our gratitude to Edmondo Porcu for mentioning us.

For further reading, please refer to the original article or check out our articles about reasoning behind Bytewax: Whywax and Reasoning about Streaming vs Batch with a Case Study from GitHub.