Test your data quality in minutes with PipeRider

Test your data quality in minutes with PipeRider

tl;dr If you missed out on PipeRider’s initial release, then now is a great time to take it for a spin. Data reliability just got even more reliable with better dbt integration, data assertion recommendations, and reporting enhancements. PipeRider is open-source and...
Improving Your ML Datasets, Part 2: NER

Improving Your ML Datasets, Part 2: NER

In our first post, we dug into 20 Newsgroups, a standard dataset for text classification. We uncovered numerous errors and garbage samples, cleaned  about 6.5% of the dataset, and improved validation by 7.24 point F1-score. In this blog, we look at a new task: Named...
Scaling Breast Cancer Detection with Pachyderm

Scaling Breast Cancer Detection with Pachyderm

Introduction Breast cancer is a horrible disease that affects millions worldwide. In the US and other high-income countries, advances in medicine and increased awareness have significantly improved the survival rate of breast cancer to 80% or higher. However, in many...
Concept drift in machine learning 101

Concept drift in machine learning 101

As machine learning models become more and more popular solutions for automation and prediction tasks, many tech companies and data scientists have adopted the following working paradigm: the data scientist is tasked with a specific problem to solve, they receive a...