Like our project? Star us on Github!
Star

Comparison of Stream Processing Frameworks Part 2

Stream Processing

Stream processing lets you query continuous data streams to analyze and make decisions about the data within a short period of receiving it. The time can vary based on the use case. For example, an app producing weather reports may tolerate a delay of five minutes when analyzing data, whereas a stock market app would need to analyze data within milliseconds in order to make informed decisions about the stock.

Stream processing has two subsets:

  • Stateful stream processing
  • Stateless stream processing

Computations on streaming data maintain a contextual state with stateful stream processing, which is not maintained in its stateless version. Products that support stateful stream processing use different data stores like RocksDB, HDFS, S3, and Cloud Datastore.

Why Do You Need Stream Processing?

Stream processing allows organizations to combine high-speed data feeds from various sources and analyze them in real time. This is critical for data-intensive organizations like financial institutions and cybersecurity-monitoring centers that rely on time-sensitive data for decision-making.

Stream processing also helps develop adaptive and responsive applications and enables improved decision-making with increased context. This enhances the customer experience.

This article will compare the following stream processing products: Structured Streaming (and its new initiative, Lightspeed), Faust, Streamz, Google Cloud Dataflow, Benthos, Quix, and Bytewax.

We consider these frameworks as valuable and interesting to compare for a few reasons:

  • They support both stream and batch processing, which allows optimal value for investment for enterprises for their data processing workloads.
  • They’re able to parallel processes in complex applications. This, in turn, can mean faster data processing and more throughput.
  • They can be used to create data pipelines for modern applications. Popular examples of such workloads include machine learning for e-commerce websites, real-time bidding, and mobile gaming.

As applications become more distributed and gather more data, they need robust frameworks for supporting the pipelines that process this data, and the frameworks described in this article provide such facilities.

Comparing the Stream Processing Frameworks

In the sections below, you’ll be presented with a breakdown and analysis of the popular stream processing frameworks outlined above.

Structured Streaming / Project Lightspeed

The Spark Structured Streaming engine is built on the Apache Spark SQL. It natively supports Scala and Java via an SDK and UDFs, Python and R. It’s built around a microbatch processing engine that queries data streams in small batch jobs. The output is written to external storage media like HDFS or S3. Currently, Structured Streaming supports the following features:

  • Streaming aggregations
  • Join operations (such as stream-to-batch joins, stream-stream joins, inner and outer joins, etc.)
  • End-to-end, exactly-once stream processing
  • Persistent state
  • Batch and stream processing

Due to its ability to process time-stamped event data and its microbatch functionality, Structured Streaming is best suited for applications requiring late data handling. Late data is information that arrives at the Structured Streaming cluster later than expected. An example of such workloads is IoT applications that have to process and aggregate data arriving at different intervals and is often late.

Project Lightspeed was recently introduced as an initiative to take the Spark Structured Streaming framework to the next level of processing efficiency.

Project Lightspeed enhances Spark Structured Streaming in the following areas:

  • Predictable low latency
  • New operators for data/event processing
  • New connectors
  • Observability

To ensure low latency, Lightspeed introduces asynchronous and configurable offset management, asynchronous checkpointing, and configurable frequency for state checkpoints. Individually, each of these features can increase processing speeds up to thirty percent.

Project Lightspeed also aims to introduce more connectors to data sources and add extra features to existing ones. It offers new processing functionality, like advanced windowing functions, multiple stateful operators for stream processing, state management, and a new API for managing connections to external data sources and asynchronous data processing. Finally, Lightspeed unifies metric collection from stream processing pipelines—something that previously required coding for collection and visualization.

Lightspeed can collect these metrics and export those to different observability systems and formats. Developers will be able to visualize Lightspeed pipelines and see the operators, tasks, and executors, as well as drill down to executors to view their logs and metrics.

Like Structured Streaming, Lightspeed is licensed under the Apache License 2.0, and can thus be a great stream processing solution for its speed, ease-of-use, and rich features.

Faust (a Robinhood Project)

Faust is an open source library for building streaming applications in Python. It’s included in our list because it is fast, distributed, and offers high availability—it can survive network problems and server crashes.

Faust provides an extension to RocksDB for storage. It works with Python libraries like Django, Flask, SQLAlchemy, NLTK, NumPy, SciPy, and TensorFlow. Faust has a producer/consumer architecture that allows Apache Kafka streams to be fed into the Faust agent as the stream processor.

Faust is licensed under the new BSD license and supports the following features:

  • Persistent state
  • Batch and stream processing
  • Distributed aggregations and joins

With these features and its support for popular Python data processing packages, Faust is best suited for applications like machine learning, asynchronous tasks, data denormalization, intrusion detection, and real-time WebSockets. As it’s Python native however, Faust can’t be used in other data processing languages like Scala or R. Faust was left unmaintained and has been forked to a new repo.

Streamz

Streamz is an open source product for building simple pipelines for streaming data. We included it to our list for its support of both Pandas and cuDF—allowing it to operate on streaming tabular data.

The Streamz architecture is based on the Streamz core, which is responsible for data preprocessing and data transformation. Streamz includes the following features:

  • Python support
  • Batch and streams support
  • Aggregation and join support

Streamz is licensed under the new BSD license. It’s best suited for processing any continuous but structured data (e.g., high-speed CSV logs, stock trading information, etc.) due to its support for Panda and cuDF. Streamz can be used by data-orchestration workflows that use branching, joining, flow control, feedback, back pressure, etc. Under such a mechanism, both data preparation and processing can happen in parallel using the Streamz framework, increasing application throughput. Like Faust, Streamz is also Python native and only suitable for data processing applications written in Python.

Google Cloud’s Dataflow

Dataflow by Google Cloud is a serverless, fast, and efficient streaming analytics service. It’s included in this list because of its features like being fully managed, horizontal autoscaling, reliability, consistency, and exactly-once processing.

It’s built on top of the Google Compute Engine (GCE), and a Dataflow job is executed on GCE instances. The Apache Beam SDK is installed when the job is launched on each worker. Google Cloud’s Dataflow supports any language via the Dataflow Runner v2, which lets you build multilanguage pipelines and uses cloud data stores as storage.

Google Cloud Dataflow has support for the following features:

  • Batch and stream processing
  • Joins and aggregations (supported via Dataflow SQL)
  • Persistent state

Google Cloud Dataflow falls under Google Cloud Platform’s standard licensing scheme. Being GCP native, it’s best suited for ETL applications that use Google Cloud data platforms like BigQuery. For example, Cloud Dataflow can be configured to process data files arriving at a fast rate and load data from those files into BigQuery tables. However, this also means you can’t use it for applications outside GCP.

Benthos

Benthos is an open source product chosen for this list because of its powerful mapping language and easy deployment and monitoring facility. It also has a web UI that allows for easy configuration management.

Benthos implements transaction-based resiliency that guarantees at-least-once delivery without needing to persist messages during transit. It uses HDFS, DynamoDB, S3, Cassandra, and MongoDB for storage. Benthos supports the following features:

  • Joins
  • Aggregations
  • Batch and stream processing

Benthos does not support a persistent state. It uses Bloblang for mapping data to a wide variety of forms. Due to its support for different sources and sinks like Blob storage, S3, and Cassandra, Benthos is good for complex ETL-style applications where multiple data sources and destinations are involved (e.g., log aggregations).

Benthos is licensed under the MIT license.

Quix

Quix is another open source product: it’s a data and workflow engine designed around a message broker like Kafka rather than a database. This allows developers to work with live data in memory—one of the key reasons it’s included here. The Quix stack has a web UI, APIs, and an SDK. It includes fully managed Kafka topics, a serverless environment, and a metadata-driven data catalog. It uses in-memory data processing as storage. Quix provides Python and C# SDKs. Other features include the following:

  • Batch and stream processing
  • Persistent state (via built-in state management)
  • Aggregations (by providing the aggregationType), although it does not support joins

Quix’s primary use cases include the following:

  • E-commerce applications where purchasing history can be combined with real-time browsing data to offer a more personalized experience
  • Online multiplayer gaming, where stream processing can provide a better player experience
  • Fraud detection in high-volume, high-speed financial transactions — due to its in-memory processing nature — like those used in the financial markets.
  • Live news feeds in media applications

It’s licensed under the Apache License 2.0.

Bytewax

Bytewax is an open source Python framework for building highly scalable data flows. It’s included in this list as it’s cloud native, it offers a simple scaling mechanism, and there’s no need to tune JVMs. Bytewax has operations such as capture, filter, map, and reduce.

Bytewax’s fault tolerance allows you to recover stateful data flows. For storing state, Bytewax uses storage systems like SQLite or Kafka. Its architecture includes a Python-native binding to the Timely Dataflow library responsible for the processing. Other features include the following:

  • Python support
  • Joins and aggregation support
  • Batch and stream processing
  • Persistent state

Bytewax is especially suited for applications written in Python that require parallel stream processing. The operators in Bytewax allow it to work on individual parts of a dataflow concurrently. This can be helpful in compute-intensive operations like streaming audio or video data processing. It also means better CPU utilization as all the cores are utilized, ultimately speeding up the process.

Bytewax is licensed under the Apache License 2.0.

Conclusion

Stream processing allows you to react to real-time information within a short period of receiving the data. The streaming frameworks covered in this article—Structured Streaming (and Lightspeed), Faust, Streamz, Google Cloud Dataflow, Benthos, Quix, and Bytewax—allow you to conduct real-time data analysis through stream processing.

With Bytewax, an open source Python framework, you can build highly scalable data flows for streams or batches. You can also develop your code locally and easily scale it for production purposes. Learn more about Bytewax and get started building streaming applications today.

Star us on Github
Join our Slack community! Join our Slack!
Any questions? Join our community! Join our Slack!