Building Sessions from Search Logs

Here is a basic example of using Bytewax to turn an incoming stream of event logs from a hypothetical search engine into metrics over search sessions. In this example, we're going to focus on the dataflow itself and aggregating state, and gloss over details of building this processing into a larger system.

Skill level: Easy Time to complete: 15-30 min Required version: 0.16.*

Prerequisities

Python modules bytewax==0.16.*

Your takeaway

This guide will teach you how to use Bytewax to aggregate on a custom session window on streaming data using reduce and then calculate metrics downstream.

Data Model

Let's start by defining a data model / schema for our incoming events. We'll make a little model class for all the relevant events we'd want to monitor.

In a more mature system, these might come from external schema or be auto generated.

The Dataflow

A dataflow is the unit of work in Bytewax. Dataflows are data-parallel directed acyclic graphs that are made up of processing steps.

Let's start by creating an empty dataflow.

Generating Input Data

Bytewax has a TestingInput class that takes an enumerable list of events that it will emit, one at a time into our dataflow.

Note: TestingInput shouldn't be used when writing your production Dataflow. See the documentation for Bytewax.inputs to see which input class will work for your use-case.

Now that we have a Dataflow, and some input, we can add a series of steps to the dataflow. Steps are made up of operators, that provide a "shape" of transformation, and logic functions, that you supply to do your specific transformation. You can read more about all the operators in our documentation.

Our first task is to make sure to group incoming events by user since no session deals with multiple users.

All Bytewax operators that perform grouping require that their input be in the form of a (key, value) tuple, where key is the string the dataflow will group by before passing to the operator logic.

The operator which modifies all data in the stream, one at a time, is map. The map operator takes a Python function as an argument and that function will transform the data, one at a time.

Here we use the map operator with an user_event function that will pull each event's user ID as a string into that key position.

For the value, we're planning ahead to our next task: windowing. We will construct a SessionWindow. A SessionWindow groups events together by key until no events occur within a gap period. In our example, we want to capture a window of events that approximate an individual search session.

Our aggregation over these events will use the fold_window operator, which takes a builder and a folder function. Our builder function will be the built in list operator, which creates a new list containing the first element. Our folder function, add_event will append each new event in a session to the existing window.

We can now move on to our final task: calculating metrics per search session in a map operator.

If there's a click during a search, the CTR is 1.0 for that search, 0.0 otherwise. Given those two extreme values, we can do further statistics to get things like CTR per day, etc

Now that our dataflow is done, we can define an output. In this example, we're just sending the output to Standard Output.

Now we're done with defining the dataflow. Let's run it!

Execution

Bytewax provides a few different options points for executing your dataflow, but because we're focusing on the dataflow in this example, we're going to use the most basic execution mode- running a single worker in the main process.

Let's inspect the output and see if it makes sense.

Since the output step is immediately after calculating CTR, we should see one output item for each search session. That checks out! There were three searches in the input: "dogs", "cats", and "fruit". Only the first two resulted in a click, so they contributed 1.0 to the CTR, while the no-click search contributed 0.0.