An odyssey on improving data quality with synthetic data and model delivery with MLOps

What is synthetic data?

The data pyramid or most know as DIKW. Source
  1. Prototype Development: Collecting and modeling tremendous amounts of real data is a complicated and tedious process. Generating synthetic data makes data available sooner. Besides that, it can help in faster iteration through the data collection development for ML initiatives.
  2. Edge-case Simulation: It is often seen that the collected data does not contain every possible scenario which affects the model performance negatively. In such cases, we can include those rare scenarios by artificially generating them.
  3. Datasets augmentation, bias & fairness: Not all the times we have the required amount of data, or in other cases, we might be dealing with under-representation in some classes. Automated decision-making can make these issues even worse but synthetic data is a great option to mitigate these problems.
  4. Data Privacy: Synthetic data is a great way to ensure data privacy while being able to share microdata, allowing organizations to share sensitive and personal (synthetic) data without concerns with privacy regulations.

What is Machine Learning Operations?

  1. Scalable serving: Algorithms should be scalable. Demand for your algorithms is often not constant. With scalability, you always have the right amount of resources. Decreasing costs and maintaining speed.
  2. Standardization and portability: Having repeatable and expected results is key in any professional environment. In MLOps this boils down to having models in a standard format that can run everywhere and most importantly run the same everywhere.
  3. Version control: Reliability has a lot to do with understanding and tracking changes in your project. This, for example, helps with quickly reverting to an older version when a problem occurs. Not only on the source code with for example GIT but also on a serving level with different versions ready to go in case a new version has unexpected problems.

Putting all together: from raw data to production

High imbalanced data. Data source: Credit card Fraud from Kaggle
Generating synthetic data with WGAN-GP
Different parameter choices have different results. Source