Custom input connectors/connector hub

Connectors play a crucial role for Bytewax workloads as they define how data flows into and out of your data processing pipelines. Connectors are categorized into data sources and data sinks. This section provides a comprehensive guide on creating custom connectors to share with the community and understanding the underlying concepts in Bytewax.

Overview of Connectors

Bytewax offers a flexible framework for connecting to various data sources, transforming data, and then directing it to appropriate data sinks. The emphasis on extensibility within Bytewax stems from the understanding that data systems often require tailored solutions. The ability to customize a data source or sink is a fundamental aspect of the framework, ensuring that it can adapt to a wide range of data processing requirements.

The Bytewax Connector Hub

The Bytewax Connector Hub is a registry of Python packages that includes connectors contributed by partners, the community, and Bytewax itself. Connectors are self-contained Python packages that can be downloaded independently. The Connector hub is continuously expanding, offering a growing list of connectors for various data systems. If you have specific needs for a connector not currently available, we encourage you to contact us.

Building a Custom Connector

Key Concepts

When designing a custom connector, consider the following key concepts:

  1. Flexibility: Design connectors to handle various data formats and sources.
  2. Efficiency: Optimize for speed and minimal resource consumption.
  3. Reliability: Ensure connectors are robust and can handle errors gracefully.

Steps to Build a Connector

  1. Define the Interface: Start by outlining the interface your connector will use to interact with the data source or sink. This should follow the Bytewax connector API describing partitions, snapshots and how to iterate.
  2. Implement Data Handling: Code the logic for data ingestion or emission, ensuring compatibility with Bytewax's dataflow model.
  3. Error Handling and Logging: Include comprehensive error handling and logging mechanisms for troubleshooting and reliability.
  4. Testing: Rigorously test your connector to ensure stability and performance under various conditions.

Advanced Connector Concepts

For more technical details on how to work with custom IO for Bytewax, please refer to the custom connector documentation

Partitions

In Bytewax, partitioning refers to the process of dividing a data source into smaller, manageable segments, each of which can be processed independently. This enhances parallel processing and scalability. Connectors should try to leverage partitions where possible.

  • Partition Identification: Assign unique identifiers to each partition.
  • Data Distribution Logic: Implement logic to distribute data evenly across partitions.
  • State Management: Ensure that each partition maintains its state independently.

For more detail on input and output API design and how partitions work, check the detailed documentation.

Recovery

Recovery mechanisms are vital for connectors to handle failures gracefully.

  • Checkpointing: Implement checkpointing to save the state of the data processing at regular intervals.
  • Retry Logic: Include retry mechanisms to handle temporary failures in data sources or sinks if required.
  • State Restoration: Ensure the connector can restore its state from the last checkpoint after a failure.

For more detail on how recovery works, check the detailed documentation.

Contributing

To contribute to the Bytewax connector hub, you can use the Bytewax Connector Template as a starting point. This template provides a framework for creating new input connectors for Bytewax. To begin, use the template to create a new repository and customize the pyproject.toml file with necessary dependencies. The template includes optional development dependencies, pre-commit hook configurations, and testing setups using pytest. For publishing, GitHub actions are configured to build and publish the package to PyPI. Documentation can be built and published to GitHub Pages using Sphinx, triggered upon creating a GitHub release. The template offers a basic implementation of PartitionedInput and DynamicInput. For more detailed instructions and guidelines, please refer to the Bytewax Connector Template on GitHub.

Contribution Process

  1. Documentation: Provide detailed documentation for your connector, including setup, configuration, and usage instructions.
  2. Code Standards: Ensure your code adheres to Bytewax's coding standards and best practices.
  3. Publish the Python Package: Publish your connector on pypi using the template github actions.
  4. Publish to the Connector Hub: Submit your connector via a pull request to the Bytewax Connector Hub repository.
  5. Community Review: The contribution will go through a final review process by the Bytewax community and team.
In this article