ML Feature Stores: A Casual Tour 1/3

Moussa Taifi PhD
9 min readApr 13, 2020

The Missing Link of ML Infrastructures. Part 1/3

This is part 1 of a series of articles:“ML Feature Stores: A Casual Tour”. Part 2 is here and Part 3 is here.

The combined pressures of data monetization and privacy compliance are escalating at a dramatic pace. Machine learning (ML) and data science (DS) teams are asked to ship autonomous and intelligent products at a faster rate. This comes with many hurdles. A central for ML practitioners to scale their work is feature management. In the past few years, ML workflows have been improving rapidly. However, many ML engineering teams ignore this crucial piece of ML infrastructure until it is too late, and their product velocity starts to suffer.

In this series of posts, we clear the fog around how the need for a feature store manifests itself. We then explore the main goals of feature stores. Finally, we examine the organizational/product drivers that need to be in place to warrant such an investment in machine learning infrastructure.

The Story

If you ever been in a growing team of data scientists that need to collaborate, deploy, and maintain models in production, then you probably have seen the following scenario:

Say you are an ML engineer that is working on predicting the best place to plant trees for carbon capture (not a totally made up use case). You are tasked with increasing the accuracy of the model by including new data you need from our satellite imagery team. You are excited about the prospect of doing something meaningful with the latest technology. You will finally be able to use the latest tensorflow, pytorch, fastai, [Insert your favorite ML ] processing framework to do your work.

Unfortunately, you find out that the satellite imagery team is streaming images in Flink from Kafka, transforming them in Spark, and storing them in a distributed store on a cloud provider. Further, they are enriching the images with socio-economic metadata that relates to the locations at hand in Postgres. Then they are storing the results in a document store, MongoDB, that you didn’t know existed until today, for a different product your company is working on.

After following many dead-ends, you finally find the dataset you are looking for. But oh no! Something is wrong with the resolution of the images, which makes them incompatible with the embeddings strategy for your initial prototype. That will need a production pipeline update and satellite firmware to fix. See you in 4 to 8 weeks!

You get the idea, getting on-boarded on such a system can take weeks or months. While all this “fictitious” ML engineer wanted to obtain was a set of features to include in their model, they instead end up dealing with a complex data engineering stack.

Before the Feature Store

In figure 1, adapted from [1], we see how confusing it can be for ML modelers to reason about the source of their features. Multiple copies of the same feature are being built with the same name. You might not have noticed, but models 2, 3, and 4 are using feature y, but the data sources are not consistent. In this case, model 2 is using columns 1 and 2 from the data table, while models 3 and 4 are using columns 1 and 2 as well as reference data from a CSV file. In the ML code feature y might be called “num_of_purchases_past_month”, but it is calculated in two different ways by models 2, 3, and 4.

Complex feature sharing without a feature store

There must be a better way….

What is a Feature?

A feature is a measurable property of a phenomenon under observation[11]. This can be a pixel, a word, a table column, an aggregate value (sum, avg, median, max,…), or other representations such as embeddings of a n-gram.

Static vs Dynamic Features

Features are usually thought of as being static elements, they get calculated once and stored for later. However, the process that generates materialized features can be a model of its own. For example, take word2vec and glove word embeddings. These are large models that are trained on hyper-scale text corpora. The final result is a materialized feature, which is a vector of numbers representing each word in the vocabulary. The feature concept is more of a process than of static value. This process requires tuning and fitting of models such as LDA, topic extraction, TF-IDF, and the like.

Redundancy and Discoverability

Features are a vital part of the ML process, but they do not usually have dedicated infrastructure to support it. ML engineers end up building tables and storage locations for specific product pipelines. This leads to redundancy. The same feature gets built multiple times, wasting resources. Even when the feature owners make the features available, other potential users lack discoverability, e.g. “I wonder if someone already calculates the embeddings of these tweets”. This is usually coupled with a lack of trust. The process that generates the feature and in the documentation that accompanies each feature.

Feature Maturity and Back-fills Integration

Features have a slow maturity process. For example, say we start logging a new feature in production, it may take weeks for the data to forward fill to reach a usable data sufficiency threshold. In most cases, what the ML engineers really want is a common way to perform backfills on features, as well as computing features on-demand. The feature store is usually one piece in the overall feature engineering framework. It adds a robust way for collecting the metadata around any feature creation, usage, and maintenance. This can help organize external backfilling systems that can use the feature store as the source of truth for various datasets.

Train-Serve Skew

Finally, the train-serve skew is real. Custom-built ML pipelines are easy to mess up. For example, say an ML engineering team builds a model offline on a static dataset. Once satisfied, they move their ML pipeline to production. Next, surprise surprise, they find out that the predictive performance of the production model is much lower than the offline evaluation. They take a closer look and find that the distribution of the features that are passed to the live models are significantly different than the features they used offline. What they want here is a standardization of data characteristics between training and serving/scoring.

The Feature Store to the Rescue

After the Feature Store

Enters the Feature Store. With this specialized store, the DS/ML team is enabled to search for features that are already available. They can reuse existing features because they don’t need to do the heavy data engineering required to generate the features. This comes with noticeable side benefits. This can reduce the infrastructure costs, and improve the access control and governance of features.

Simplified feature management with the introduction of a feature store

In the picture above, we have a metadata layer that coordinates data access between the applications that need the features and the data sources. With this feature management layer comes a set of promised benefits:

  • Discoverability and search for features.
  • Reuse of existing features.
  • Metadata tracking for feature backfilling and caching.
  • Computed statistics on the features for distribution validation and anomaly detection.
  • Documentation of feature sources, generators, and consumers.
  • Versioning of features.
  • Cost reduction due to data reuse.
  • Data format flexibility to serve multiple types of consumers and ML libraries.
  • Point-in-time accuracy where the metadata store can help construct time-travel queries efficiently.
  • Train/Serve Consistency both in terms of naming, data transformation, and actual data returned to the data consumers.

When does a feature store make sense for an organization?

Building and managing a feature store is no small undertaking. Your current understaffed ML infrastructure team has more urgent things to do then include an additional dependency in their stack. This sort of investment starts making sense when a combination of elements start creating chaos in your ML pipelines:

  • Your organization starts having to maintain and evolve at least a dozen models that run in production.
  • These applications need to start demanding online serving as well, before the benefits of removing the train/serve skew start paying off.
  • If the apps do not have an online serving component to them, then at least they need to demand frequent model retraining and scoring from recent interaction data.
  • The apps that benefit the most from a feature store are the ones with a high number of structured data sources.

Figure 3 shows an approximate spread of applications depending on their data volume and structure needs.

Approximate spread of ML applications over data structure and volume needs

The rationale for figure 3 is that for applications with low data sources, the data is available all at one time: a whole image or document to train on. Features, in this case, are extracted using the pixels values of an image or words from a document. While very unstructured, these sorts of applications can be deployed while only depending on one or two data sources.

The situation is different for applications with a large number of data sources, such as a personalized ranking of products for an e-commerce application. In this case, the feature search space is composed of multiple tables, files, streams, and other enterprise APIS. The data is collected on the fly as consumers interact with the predictive service or the products they are interested in. For example, this can include changing product attributes/availability, past purchases, user attributes, user social interactions, current cart items, current search words, current session paths, past impressions and clicks by a user.

In the case of a high number of data sources, the ML engineering team will be spending a lot of time choosing and iteratively curating the data sets used for their ML applications. This is precisely the type of use cases that will benefit from a feature store to automate, record, and expose feature engineering pipelines.

That’s it for today. In future parts of this tour, we will take a look at a set of public/private feature stores/feature engineering frameworks. We will take a look at how various companies implemented such an abstraction in their ML pipelines:

  • Hopsworks Feature Store
  • Go-Jek Feast Feature Store
  • Twitter Feature Libraries/Store
  • Uber Michelangelo-Palette Feature Store
  • AirBnb Zipline Feature Store/ Feature engineering framework.
  • Netflix Fact/Feature Store
  • LinkedIn Modoop/Frame Feature management

Until then here are some references if you are interested in learning more about this exciting area of machine learning infrastructure.

References:

11) Bishop, Christopher (2006). Pattern recognition and machine learning. Berlin: Springer.

Blog picture from https://pixabay.com/users/tama66-1032521/

Disclaimer: The views expressed on this post are mine and do not necessarily reflect the views of my current or past employers.

--

--

Moussa Taifi PhD

Senior Data Science Platform Engineer — CS PhD— Cloudamize-Appnexus-Xandr-AT&T-Microsoft — Books: www.moussataifi.com/books