Writing Tests for Bytewax Dataflows

Uses Bytewax Version 0.9.0

Data processing frameworks are often difficult to write tests for because they generally require a separate set of infrastructure for the tests to run on. One of the benefits of Bytewax’s architecture is that it is straightforward to run as a python script, which it makes it easy to incorporate tests into your development workflow. Tests are important across all software applications, from data pipelines to web applications. For every code change that’s made in your software project, there’s a possibility of introducing a new bug. Despite all your manual efforts, there can still be minor issues that go unchecked. That’s why it’s important to write automated tests.

There are different types of tests you can write to test your new features. The most important and basic one is unit testing. Unit tests help validate the functionality of each new component without affecting others. Any time you add a new feature to your code, creating a unit test is important in order to check if it’s functioning as expected. This can be done by adding both correct and incorrect input conditions.

Unit tests also help improve code quality and detect potential issues. In this article, you’ll learn about Bytewax dataflows and how to write tests for them that run on GitHub Actions.

You can find all the files used in this tutorial in this GitHub repository.

Introduction to Unit Testing

Unit testing is the initial phase of software testing in which individual blocks of code are separately tested. To illustrate some use cases of unit testing, consider the following code block:

def test_function():
    x = 10
    z = 5
    for index in range(y):
        print(index + x)
        x += 1
        # do more with 'index'
    del(x)
    del(y)

    print("Final value of x:")
    print(x)
    del(x)

This is a Python script that calculates a numeric value within a loop by incrementing it with a seed value and a counter. It then prints the value. However, this code has five things wrong:

Declaring a variable z that’s never used (line 3)
Using a variable y that hasn’t been declared (line 4)
Deleting the variable y when it was never declared (line 9)
Printing the variable x after deleting it (line 12)
Deleting the variable x even though it was already deleted (line 13)

Using the wrong variable in code is an easy way to create a bug. Although these errors may seem obvious, it’s not uncommon to see them in code.

Implementing Tests for Bytewax Dataflows with GitHub Actions

Now that you understand the basics of unit testing, we will take a look at how you can test Bytewax dataflows.

Setting Up

There are some prerequisites to use Bytewax. You’ll need the following:

A GitHub account
Git installed on your computer
A Python IDE to write and run your code, such as PyCharm

Once you meet all the prerequisites, you can install Bytewax using pip:

pip install bytewax==0.9.0

Running a Simple Bytewax Dataflow

The code snippet below shows a simple dataflow program using Bytewax:

from bytewax import Dataflow, run

def is_odd_number(x):
   return x % 2 == 1

#Bytewax dataflow
flow = Dataflow()
flow.filter(is_odd_number)
flow.capture()

inputData = [(0, 11), (0, 73), (0, 50)]

if __name__ == "__main__":
    for epoch, x in sorted(run(flow,inputData)):
        print(x)

Here, the code iterates over a list of values that are in the inputData array, then each entry is filtered through the function is_odd_number() where it checks for odd numbers, then the data pipeline captures the odd numbers and returns them one by one.

When the code runs, you would get the following results :

================ RESTART: BytewaxSimple.py ==========
11
73

Testing a Bytewax Dataflow

To make sure that your data pipeline will always behave as it should, it’s a good practice to use automated tests that will validate the behavior. Writing proper automated functional tests is the best way to ensure that no one mistakenly changes the behavior of your pipeline when you have many developers pushing and working on the same code. You can run these tests whenever someone pushes a code change to the github repository, and you can add a check that the changes won't be deployed to production if the tests fail.

Python uses a test framework called pytest that allows developers to write different types of software tests. To get started, install it for your project:

pip install pytest

On successful installation, you should be able to start writing tests with it. In order to write a good test, you should follow the arrange, act, assert (AAA) pattern:

Arrange: first, initialize objects and set variable values to run the test
Act: next, invoke the function being tested
Assert: lastly, validate the test result

The code snippet below shows how arrange, act, and assert are used in a basic Python test:

def test_squares():
   # arrange
   number = 8
   # act
   squared = number ** 2

   # assert
   assert squared == 64

Now, you can use the AAA pattern to test the Bytewax dataflow:

from bytewax import Dataflow, run

def is_odd_number(x):
   return x % 2 == 1

#Bytewax dataflow
flow = Dataflow()
flow.filter(is_odd_number)
flow.capture()

def test_run1():
   #arrange
   inputData = [(0, 11), (0, 73), (0, 50)]
   #act by running the bytewax flow
   result = run(flow, inputData)
   # assert
   assert sorted(result) == sorted([(0, 11), (0, 73)])

def test_run2():
   #arrange
   inputData = [(0, 1), (0, 3), (0, 50)]
   #act by running the bytewax flow
   result = run(flow, inputData)
   # assert
   assert sorted(result) == sorted([(0, 1), (0, 3)])

def test_run3():
   #arrange
   inputData = [(0, 12), (0, 72), (0, 5)]
   #act by running the bytewax flow
   result = run(flow, inputData)
   # assert
   assert sorted(result) == sorted([(0, 5)])

The above code tests our odd number Bytewax dataflow. The arrange section defines the data that the test would use to filter data. The act section calls run(), which runs the bytewax data pipeline program and returns a result. Finally, the assert section validates whether the data returned is correct. You have defined three different tests, where each has a different inputData value to test with the Bytewax dataflow.

To run these tests in Python, you can run the following command from the shell prompt, where main.py is the name of the file you want to test:

pytest -v main.py

Running a Bytewax Dataflow in GitHub

GitHub Actions allows developers to automate building, testing, and deploying code releases from GitHub. In this section, you’ll take a look at how to use GitHub Actions to automate tests for Bytewax. However, before diving in, it’s important to understand a couple of important terms:

Workflow: This is a file that defines all the deployment steps. Workflows are always located in every project’s .github/workflows/ directory.
Runner: The workflows you create are run on machines called runners. The runs-on command specifies the type of machine you want the code to run on.

To start running tests with GitHub Actions, begin by creating a GitHub repository (if you don’t already have one) and commit all your project changes there.

Next, click the Actions tab. You should see a list of GitHub Actions suggested for this project. Click Python application:

Github Action

This will create a new file that you can edit. Once you save the file, it will be saved as .github/workflows/python-app.yml. The file should look like this:

Delete all contents of the file and paste the following block of code into it:

name: Bytewax sample test

on:
  push:
    branches: [ "master", "main" ]

permissions:
  contents: read

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python 3.10
      uses: actions/setup-python@v3
      with:
        python-version: "3.10"
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install bytewax
        pip install pytest
    - name: Test with pytest
      run: |
        pytest -v main.py

Once you save the updated file and commit it to the master branch, the workflow automatically runs, and you can see the results from the Actions tab:

Workflow run results

In the image above, the Test with pytest tab is expanded, and you can see that the Bytewax tests have passed.

A more detailed look at the workflow can help you understand what happened here.

The first line states the name of the workflow:

name: Bytewax sample test

Then, the on statement specifies the event that would trigger this workflow. In this case, the event is triggered when a new commit is merged into the master or main branch:

on:
  push:
    branches: [ "master", "main" ]

Permission is also granted to the workflow to read contents of the program:

permissions:
  contents: read

Next is the jobs section. This workflow only contains one job, called build, but typical workflows can contain multiple jobs, which execute sequentially. Before listing the steps to run the job, you have to specify the machine, or runner, that the job should run on. This particular workflow specifies that the test will be run on the latest Ubuntu build:

runs-on: ubuntu-latest

The next bit of code adds the actions/checkout@v3 keyword. This helps you check out the repository under $GITHUB_WORKSPACE so that your workflow can have access to it. Then it sets up Python using version 3.10:

- name: Set up Python 3.10
  uses: actions/setup-python@v3
  with:
     python-version: "3.10"

The pip, bytewax, and pytest dependencies are also installed:

- name: Install dependencies
  run: |
    python -m pip install --upgrade pip
    pip install bytewax
    pip install pytest

Finally, once everything is set up, the test is run using pytest:

- name: Test with pytest
  run: |
    Pytest -v main.py

Conclusion

As you build your software, it’s important to incorporate unit testing, which enables you to write and ship code with fewer bugs. In your source code, there will be variables and different kinds of data created and used throughout the lifecycle of the process. One of the nice things about building data pipelines with Bytewax is that it is easy to write test and operate within the same best practices of writing software. Bytewax is an open source Python framework for building scalable dataflows to process any data stream. Learn more about getting started with Bytewax today.

Writing Tests for Bytewax Dataflows That Run in GitHub Actions

Introduction to Unit Testing

Implementing Tests for Bytewax Dataflows with GitHub Actions

Setting Up

Running a Simple Bytewax Dataflow

Testing a Bytewax Dataflow

Running a Bytewax Dataflow in GitHub

Conclusion

Stay updated with our newsletter

Sadequl Hussain

Shift-Left Architecture with Bytewax for Real-Time Intelligence

Other posts you may find interesting

Data Pipelines: Streams to Parquet

Redis-driven Dataflow for Clickstream Aggregation

Hacker News From Request to Stream: A Deep Dive into How to Use Bytewax to Poll HTTP Endpoints to Create a Real-Time Stream of Data