

In this guide, we will show you how you can combine bytewax with ydata-profiling to profile and understand the quality of your streaming data!
Python modules bytewax==0.16.2 ydata-profiling==4.3.1
You'll be able to handle and structure data streams into snapshots using Bytewax, and then analyze them with ydata-profiling to create a comprehensive report of data characteristics for each device at each time interval.
Instead of the usual approach, where data quality is assessed during the creation of the data warehouse or dashboard solution, it is a cheaper, more effective and ultimately more robust approach to monitor the quality closer to the source, which is a great fit for stream processing, since most data is created in real-time. This will prevent any data quality issues from multiplying in downstream tables and ending up in customer-facing services.
In what concerns data profiling, ydata-profiling has consistently been a crowd favorite, either for tabular or time-series data. And no wonder why — it’s one line of code for an extensive set of analysis and insights.
Let's see it in action!
Let's download a subset of the Environmental Sensor Telemetry Dataset (License — CC0: Public Domain), which contains several measurements of temperature, humidity, carbon monoxide liquid petroleum gas, smoke, light, and motion from different IoT devices.
In a production environment, these measurements would be continuously generated by each device, and the input would look like what we expect in a streaming platform such as Kafka.
To simulate a stream of data, we will use the Bytewax CSVInput connector to read the CSV file we downloaded one line at a time. In a production use case, you could easily swap this out with the KafkaInput connector.
First, let’s make some necessary imports
Then, we define our dataflow object and add our CSV input.
Afterward, we will use a stateless map
method where we pass in a function to convert the string to a datetime
object and restructure the data to the format (device_id, data).
The map
method will make the change to each data point in a stateless way. The reason we have modified the shape of our data is so that we can easily group the data in the next steps to profile data for each device separately rather than for all of the devices simultaneously.
Now we will take advantage of the stateful capabilities of bytewax to gather data for each device over a duration of time that we have defined. ydata-profiling expects a snapshot of the data over time, therefore the window
operator is the perfect method to use to do this.
In ydata-profiling, we are able to produce summary statistics for a Pandas DataFrame which is specified for a particular context. For instance, in our example, we can produce snapshots of data referring to each IoT device or to particular time frames.
This is the accumulator function, and outputs a list of readings
get_time
function instructs the event clock on how to retrieve the event's datetime from the input.
Configure the fold_window
operator to use the event time and a 5 seconds tumbling window.
After the snapshots are defined, leveraging ydata-profiling is as simple as calling the PorfileReport
method for each of the dataframes we would like to analyze.
Once the profile is complete, the dataflow expects some output, so we can use the built-in StdOutput
to print the device that was profiled and the time it was profiled at that was returned by the profile function in the map
step.
And we are ready to run our program! You can run it on your machine with the following command.