Why data prep is hard

Many data scientists and machine learning teams report that they spend about 80% of their time preparing, managing, or curating their datasets.

There are three things that have enabled the ML revival over the last 5–10 years: breakthroughs in algorithms, fast and scalable hardware, and large curated datasets. Data is a crucial pillar at the foundation of AI, and it takes a lot of effort to obtain it.

Unfortunately, it’s very difficult to know exactly what effects a specific label will have on your model, but you can’t just wait for your data to be ideal. No software company delays a release until their app is perfect — they would never release anything if that were the case. It’s the same with data. You have to start somewhere, and this is where the machine learning lifecycle comes in.


The machine learning lifecycle is all about iteration. But the fact that often eludes people is that there are actually two lifecycles in machine learning that have a symbiotic relationship: Code and Data.

We’re constantly iterating on both of them, providing our human understanding of the problem to improve our code as well as our data.

This means iteration must be a fundamental part of our ML processes. And the more that we can incorporate the right tooling, the better we can leverage our AI team’s insights to solve real world problems.

ML production pipeline

Here is a simplified version of the machine learning production pipeline, starting with the training phase followed by a validation step.


Normally we train and validate several models, and as soon as we find one that performs well enough, we push it to production. Once the model is in production, we monitor its performance to catch any accuracy deviations. If accuracy drops, we retrain the model, which means we cycle back to the initial training step.

Often, this process is a continuous loop that is executed many times over the product’s life span.

Human in the loop

But is this really all there is to the pipeline? In short, the answer is no. The cycle shown above is typical for machine learning projects, but it is missing one important component: human involvement. Most AI products require a lot of human effort, especially in the data labeling step.


When we incorporate human-in-the-loop principles, the involvement can happen at many different stages:

  • At the training stage, we need humans to annotate the data for supervised learning.
  • For monitoring the model in production, we ideally need human-labeled data to periodically check if our predictions are deviating.
  • When deviation occurs, we need human-labeled examples for retraining.

Throughout this process it’s clear we cannot rely solely on automation. We need to include humans at every stage of the machine learning lifecycle.


Additionally, there is another ML pipeline that needs constant human annotation. This is a human-in-the-loop workflow in its traditional sense, as shown in the diagram above.

Here humans assist an actual ML algorithm with difficult cases. Many ML algorithms give us confidence levels connected to the predictions they make. We can choose a threshold probability to filter the cases the algorithm is finding difficult and send them for human judgment. The human prediction is sent to the end-user and also sent back to the algorithm to help improve performance with retraining.

In the following section, we’ll introduce Toloka and Pachyderm and look at how these two tools can help you build resilient human-in-the-loop pipelines.


All AI products need constant data labeling. The more data you have, the better your algorithms will be. Speed is of the essence. The faster you label your data, the sooner you can iterate, which speeds up developing and evaluating your models.

In many ML projects, employees are hired explicitly to do data labeling, but this is not always the best solution. Toloka allows you to use the power of the crowd to scale the labeling process up or down on demand.

The availability and diversity of the crowd allow Toloka to treat the labeling tasks like engineering tasks. The process can be automated using the open API with Python and Java clients. This means that data labeling can be integrated into ML pipelines easily and the flow of new data can automatically trigger the labeling process when needed.


The fact that we can use Toloka as a labeling cluster allows it to integrate with other ML tools, like Pachyderm.


Pachyderm is the leader in data versioning and data pipelines for MLOps. It is the GitHub for your data-driven applications.

With Pachyderm’s data versioning you can organize and iterate on your data with repos and commits. Pachyderm allows you to version any type of data, be it images, audio, video, text, or anything else. The versioning system is optimized to scale to large datasets of any type, which makes it a perfect pairing for Toloka, giving you cohesive reproducibility.

In addition to versioning, Pachyderm’s pipelines allow you to connect your code to your data repositories. They can automate many components of the machine learning life cycle (such as data preparation, testing, model training) by re-running pipelines when new data is committed. Together, Pachyderm pipelines and versioning give you end-to-end lineage for your machine learning workflows.

Pachyderm and Toloka in action: Clickbait data

Toloka and Pachyderm have been working together to bring you an example of the integration of these two tools. We have created a pipeline that can be used to annotate clickbait data as shown below.


In order to do this, we have set up Pachyderm pipelines that orchestrate the labeling flow of Toloka. The diagram below shows the pipeline flow that we have created for this project.


Initially, we have a repo called “clickbait data” that holds the data that can be used to train the ML model. We then have several pipelines that manage Toloka to enrich this dataset with new examples.

We then create a Pachyderm pipeline that creates a basic project with Toloka. Corresponding pipelines then take CSV files with text, add tasks to Toloka, and create so-called honeypots that help us with quality control of the annotation. We use this to run a pipeline that launches the annotation process in Toloka followed by the aggregation pipeline. Once we have the annotation results, there is a dataset concatenation pipeline that merges the old data with newly annotated examples.

Perhaps the most exciting aspect of this setup is that every time we add new data without annotations to our repo in Pachyderm, the labeling process is automatically triggered. This means that the proposed pipeline is reusable if you need to continually annotate more data and retrain machine learning models.

In conclusion

In order to build machine learning models that work in the real world, we need to set ourselves up for iteration. This means incorporating the right tooling for data development and human-in-the-loop interaction. With Pachyderm and Toloka, data curation and management have never been more powerful or more flexible. It’s a whole new way to scale out your labeling tasks, while also versioning, managing, and automating your data transformations.

If you would like to check out the details of this integration project, you can see the code in this GitHub repo.

For details and examples, watch the video of the webinar with the full pipeline demo.

This blog has been republished by AIIA. To view the original article, please click HERE.