New inputs/outputs, new windowing, and recent improvements in Bytewax v 0.16.0
We are excited to announce the release of Bytewax v0.16, which brings major improvements to the way custom Bytewax inputs are created, as well as enhancements in windowing and execution. This update aims to make your experience with Bytewax smoother and more efficient. To help you navigate the new features and breaking changes, we have prepared a migration guide (with code examples!) that covers the updates made to our API for upgrading to v0.16. Please find it here. Jump on and see the release on our GitHub!
What's Changed
Features:
BREAKING Multiprocessing: Reworkes the execution model.
run_main
andcluster_main
have been moved tobytewax.testing
as they are only supposed to be used when testing or prototyping. Production dataflows should be ran by calling thebytewax.run
module withpython -m bytewax.run <dataflow-path>:<dataflow-name>
. See python-m bytewax.run -h
for all the possible options. The functionality offered byspawn_cluster
are now only offered by thebytewax.run
script, sospawn_cluster
was removed.BREAKING Partitioned input: Changes the input API to be partition based, requiring users to subclass
bytewax.inputs.CustomPartInput
to create their own input. This update involves direct use of Python objects in Rust, re-implementation of the Kafka connector in Python and state management, and the addition of abytewax.connectors.files
module for simpler file-based inputs.BREAKING Partitioned output: Refactors the output system to resemble the partitioned input system, introducing two base output types,
PartOutput
andDynamicOutput
, and their corresponding operators,part_output
anddynamic_output
. Partitioned output ensures consistent data routing and supports stateful recovery, while DynamicOutput is a stateless output that runs on each worker with less strict delivery guarantees. The Python output definitions are directly integrated into Timely operators for optimized dataflow semantics.BREAKING Source and sink classes: restructures the input/output system by subclassing
{Stateful,Stateless}{Source,Sink}
instead of using specialized closures or generators, improving the overall stability and parallel construction between input and output. Additionally, the Rust-side input module is refactored to match the output module, and the concept ofEpochConfigs
is removed in favor of a top-levelepoch_interval
argument.filter_map
operator: Introduces thefilter_map
operator for data transformation that behaves like amap
, but filters out any item that isNone
.Hopping window: Implements hopping windows of fixed duraction. The windows can be configured in three ways: if the offset is equal to the length, tumbling windows are created; if the offset is less than the length, windows overlap; and if the offset is greater than the length, there will be gaps between the windows. Users can customize the window length, offset, and starting time using the
length
,offset
, andstart_at
parameters, respectively. The starting time allows to align all windows to specific timeframes, such as hours. Closes #172Session window: Implements session windows for data grouping based on user sessions by creating a new session every time an event comes, then immediately merge all compatible sessions, and only after that assign the item to the appropriate session (similar to Flink). Closes #174
KafkaInput with multiple topics : Enables
KafkaInput
to accept multiple topics instead of a single topic.
Bug Fixes:
Fix "interleaved executions": - Fixes an issue with SQLite-based recovery. Previously you'd always get an "interleaved executions" panic whenever you resumed a cluster after the first time.
Fixes system time windows: Fixes 221 related to system time windows closing late.
Avoid built-in
hash
: Defaults to callingzlib.adler32
as a simple globally-consistent hash. Related: Bugfix: adler32: ensuresadler32
accepts only bytes-like objects.
Code Improvements:
add_pymethods
macro: Replace manual implementations with theadd_pymethods
macro for better code consistency.Clippy fixes: Apply suggestions from the Clippy linting tool for code improvements.
BREAKING Windows renaming: Refactors window naming for better clarity and consistency. Renames
TumblingWindowConfig
toTumblingWindow
BREAKING Part{Input,Output} to Partitioned renaming: Removes
ManualInputConfig
andManualOutputConfig
. Seebytewax.inputs
andbytewax.outputs
for more info.Dataflow.capture
operator is renamed toDataflow.output
. KafkaInputConfigand
KafkaOutputConfighave been moved to
bytewax.connectors.kafka.KafkaInputand
bytewax.connectors.kafka.KafkaOutput`.flake8: Enables flake8 on examples and documentation.
Sqlx: Updates the project to use sqlx for managing schema evolution.
Error handling: introduces a trait to easily create a chain of errors with location tracking (using the #[track_caller] attribute) that can be bubbled up directly to the Python interpreter.
Glossary: adds a glossary to the documentation to clarify terms and concepts.
All in all, the recent improvements in Bytewax highlight it as an active open source project focusing not only on features and bugs but also on the quality of life improvements! Enjoy! As an open-source project, we understand the importance of community involvement in driving innovation and improvements. We invite you to join our Slack, where you can engage with other users, ask questions, and share your experiences.
Don't forget to star our GitHub repository. This helps us gain visibility within the open-source community and encourages more people to contribute. Remember, Bytewax is only as strong as the community behind it, so let's work together to make it the best it can be!
Stay updated with our newsletter
Subscribe and never miss another blog post, announcement, or community event.