Evaluation Stores: Closing the ML Data Flywheel?

Model Stores, Feature Stores, and now Evaluation stores? Is the MLOps space going crazy, or is there a real need for these tools?

10 min readNov 29, 2021

Kandinsky, *Squares with Concentric Circles*, 1913 [9]

In this post, we take a look at the Evaluation Store concept as it stands in 2021.

Example Story

It all started with that ML system alert. 📉🔥🔥

The on-call MLEs pick up their pagers. They see that the accuracy on a hyper-important model has dropped for 2 days. The drop in accuracy triggers a rolling window alert.

The MLEs check the system metrics of the serving infra, but nothing pops out. ✅

Then they make sure that the model that is currently deployed is the expected one. “model_10_01_2021__10_31_2021.pkl” ✅

Then they check that the volume of the prediction requests is not too different from the trend.✅

Then they attempt to retrieve the offline model metrics from the model store. They check if there is a way to compare this model compared to the previous version deployed last week. ✅

The offline metrics do not look too different on the experimentation tracking dashboards. ✅

Evaluation Stores: Closing the ML Data Flywheel?

Model Stores, Feature Stores, and now Evaluation stores? Is the MLOps space going crazy, or is there a real need for these tools?

Example Story

Written by Moussa Taifi PhD