Inputs and Outputs
Every bytewax dataflow needs to be able to connect to and receive or retrieve data and also requires instruction on where to send and what to do with processed data.
There are two operators that allow you to configure this: Dataflow.input
and Dataflow.output
.
Input
The first thing we need to define on a Dataflow is the input
.
We do this by passing a specific input configuration to the input
operator.
Bytewax offers three input connectors (more to come in the future):
- KafkaInput
- FileInput
- DirInput
KafkaInput is a
specific input configuration tailored for Apache Kafka (and kafka-api compatible
platforms, like Redpanda).
This input generator uses the Python confluent-kafka library and automatically handles
recovery.
KafkaInput
accepts four parameters: a list of brokers, a list of topics, a tail
boolean and a starting_offset.
Additional configuration can be passed as a dictionary with the keyword argument
add_conf
.
See our API docs for bytewax.connectors.kafka
for more on Kafka configuration.
FileInput is an input source that emits lines from a file as input elements. It also handles recovery by keeping track of the last line that was read.
DirInput is an input source that emits lines from all the files in the given directory. It also handles recovery by keeping track of the last line that was read.
Bytewax also exposes the DynamicInput and PartitionedInput base classes so that you can build your own input connectors.
Output
Output, similarly to input, is configurable and this is accomplished with the output
operator.
Bytewax offers four output configurations:
- StdOutput
- KafkaOutput
- FileOutput
- DirOutput
StdOutput simply prints all of the output to standard out, and it does not require any other configuration.
KafkaOutput is a specific output configuration tailored for Apache Kafka (and
kafka-api compatible platforms, like Redpanda).
This output connector is provided by the Python library confluent-kafka.
KafkaOutput
expects only two parameters: a list of brokers and a topic.
Its input must take the form of a two-tuple of bytes, where the second element
is the payload and the first is an optional key.
Additional configuration can be passed as a dictionary with the keyword arg add_config
.
See our API docs for bytewax.connectors.kafka
for more on Kafka configuration.
FileOutput is an output source that writes emitted items as lines on a file.
DirOutput is an output source that writes to a set of files in a directory line-by-line.
As with inputs, Bytewax exposes DynamicOutput and PartitionedOutput base classes to can build your own output connectors.