New inputs/outputs, new windowing, and recent improvements in Bytewax v 0.16.0

By Oli Makhasoeva
bytewax

We are excited to announce the release of Bytewax v0.16, which brings major improvements to the way custom Bytewax inputs are created, as well as enhancements in windowing and execution. This update aims to make your experience with Bytewax smoother and more efficient. To help you navigate the new features and breaking changes, we have prepared a migration guide (with code examples!) that covers the updates made to our API for upgrading to v0.16. Please find it here. Jump on and see the release on our GitHub!

What's Changed

Features:

  1. BREAKING Multiprocessing: Reworkes the execution model. run_main and cluster_main have been moved to bytewax.testing as they are only supposed to be used when testing or prototyping. Production dataflows should be ran by calling the bytewax.run module with python -m bytewax.run <dataflow-path>:<dataflow-name>. See python -m bytewax.run -h for all the possible options. The functionality offered by spawn_cluster are now only offered by the bytewax.run script, so spawn_cluster was removed.

  2. BREAKING Partitioned input: Changes the input API to be partition based, requiring users to subclass bytewax.inputs.CustomPartInput to create their own input. This update involves direct use of Python objects in Rust, re-implementation of the Kafka connector in Python and state management, and the addition of a bytewax.connectors.files module for simpler file-based inputs.

  3. BREAKING Partitioned output: Refactors the output system to resemble the partitioned input system, introducing two base output types, PartOutput and DynamicOutput, and their corresponding operators,part_output and dynamic_output. Partitioned output ensures consistent data routing and supports stateful recovery, while DynamicOutput is a stateless output that runs on each worker with less strict delivery guarantees. The Python output definitions are directly integrated into Timely operators for optimized dataflow semantics.

  4. BREAKING Source and sink classes: restructures the input/output system by subclassing {Stateful,Stateless}{Source,Sink} instead of using specialized closures or generators, improving the overall stability and parallel construction between input and output. Additionally, the Rust-side input module is refactored to match the output module, and the concept of EpochConfigs is removed in favor of a top-level epoch_interval argument.

  5. filter_map operator: Introduces the filter_map operator for data transformation that behaves like a map, but filters out any item that is None.

  6. Hopping window: Implements hopping windows of fixed duraction. The windows can be configured in three ways: if the offset is equal to the length, tumbling windows are created; if the offset is less than the length, windows overlap; and if the offset is greater than the length, there will be gaps between the windows. Users can customize the window length, offset, and starting time using the length, offset, and start_at parameters, respectively. The starting time allows to align all windows to specific timeframes, such as hours. Closes #172

  7. Session window: Implements session windows for data grouping based on user sessions by creating a new session every time an event comes, then immediately merge all compatible sessions, and only after that assign the item to the appropriate session (similar to Flink). Closes #174

  8. KafkaInput with multiple topics : Enables KafkaInput to accept multiple topics instead of a single topic.

Bug Fixes:

  1. Fix "interleaved executions": - Fixes an issue with SQLite-based recovery. Previously you'd always get an "interleaved executions" panic whenever you resumed a cluster after the first time.

  2. Fixes system time windows: Fixes 221 related to system time windows closing late.

  3. Avoid built-in hash: Defaults to calling zlib.adler32 as a simple globally-consistent hash. Related: Bugfix: adler32: ensures adler32 accepts only bytes-like objects.

Code Improvements:

  1. add_pymethods macro: Replace manual implementations with the add_pymethods macro for better code consistency.

  2. Clippy fixes: Apply suggestions from the Clippy linting tool for code improvements.

  3. BREAKING Windows renaming: Refactors window naming for better clarity and consistency. Renames TumblingWindowConfig to TumblingWindow

  4. BREAKING Part{Input,Output} to Partitioned renaming: Removes ManualInputConfig and ManualOutputConfig. See bytewax.inputs and bytewax.outputs for more info. Dataflow.capture operator is renamed to Dataflow.output. KafkaInputConfigandKafkaOutputConfighave been moved tobytewax.connectors.kafka.KafkaInputandbytewax.connectors.kafka.KafkaOutput`.

  5. flake8: Enables flake8 on examples and documentation.

  6. Sqlx: Updates the project to use sqlx for managing schema evolution.

  7. Error handling: introduces a trait to easily create a chain of errors with location tracking (using the #[track_caller] attribute) that can be bubbled up directly to the Python interpreter.

  8. Glossary: adds a glossary to the documentation to clarify terms and concepts.

All in all, the recent improvements in Bytewax highlight it as an active open source project focusing not only on features and bugs but also on the quality of life improvements! Enjoy! As an open-source project, we understand the importance of community involvement in driving innovation and improvements. We invite you to join our Slack, where you can engage with other users, ask questions, and share your experiences.

Don't forget to star our GitHub repository. This helps us gain visibility within the open-source community and encourages more people to contribute. Remember, Bytewax is only as strong as the community behind it, so let's work together to make it the best it can be!

Stay updated with our newsletter

Subscribe and never miss another blog post, announcement, or community event.

Previous post
Oli Makhasoeva

Oli Makhasoeva

Director of Developer Relations and Operations
Oli is a passionate technologist with a background in engineering, consulting, and community building. On a break from creating content, she loves to network online & in person at meetups, conferences, and forums.
Next post