I’ve spent the last few months thinking heavily about feature stores. It’s the hottest new buzz word in the ML space, and everyone has a distinct implementation laser-focused on their personal use cases.
A recent article¹ that I read talked about this exact topic and did a great job summarizing the fundamental problem: these implementations don’t create a general purpose, conceptual framework for what a feature store is, rather focusing on the outcomes of their particular use cases. If we forget what we’ve read about these implementations and rethink this from the ground up, we may be able to design a general purpose feature store that works for any use-case.
What is a feature store? A feature store is a shareable repository of predictive features, both complex and simple, for use in near real-time machine learning and business intelligence. As you can see, this is a broad and general definition. That’s because a feature store has a plethora of use cases, all of which should be possible if architected correctly. So let’s start from the ground up and list out some of the minimal requirements.
A fundamental premise of the feature store is that it must be shareable across an organization; features must be accessible by any team that needs them. This is what allows for the reuse of complex features that may take weeks or months to develop. The idea of shared features is so prevalent that Twitter coined the metric “Sharing Adoption: The number of teams who use another team’s features in production.” Having a single repository of features that data scientists can search and reuse to help solve their problems is crucial to their productivity.
If you have a fairly mature data science organization, you will likely have hundreds or thousands of features, with potentially millions of records. Easily searching through those features, whether that be through SQL or a dataframe-like API, is a must-have for data scientists to be successful.
In order for a feature store to be trusted, the origin and implementation of each feature must be available for investigation. To achieve this, the feature generation process must be well documented and available for all to review its lineage.
A central theme of the feature store is its collaborative nature. Along with transparency and sharing, versioning ensures trust: what you see is what you get. When you put a feature in your model, especially one you didn’t create yourself, it’s crucial you know it’s up to date. This comes in two forms:
- Versioning how a feature is calculated and if that calculation changes
- Versioning when the feature was last updated
Versioning the feature generation ensures that if the implementation of a specific feature changes over time, it’s tracked — who built it, how did it change, potentially why it changed. With proper versioning and lineage information, feature auditability is enabled
Versioning the feature itself saves you from accidentally making decisions on stale data. Some features may be updated every minute, 24 hours, 2 weeks, or even 1 month, think RFM. Others may change every time a user logs into an account or swipes a credit card. It’s crucial to track not only when these features were updated, but also keep a full history of those changes.
This may seem obvious, but it’s just as important: strong enforcement over who can and who historically has queried the feature store is critical. A manager needs to easily restrict access to features (even at a column level) and see who in the organization has requested certain features, and when.
Along the same lines, insert/update privileges should be tightly controlled as well.
Online & Offline (Batch and Realtime)
This is an important concept, and there is a lot to think about here. I’ve seen some confusion in the MLOps space surrounding online/offline feature stores and differing definitions. I’m going to attempt a definition and provide an alternative approach.
A feature store has typically been segmented into two distinct workloads, offline and online.
- Offline Workload: Sometimes, people refer to offline as a process that creates features in batch at set intervals. Other times, people refer to offline as the process of training models “offline” using larger, batch based queries
- Online Workload: Sometimes, this is thought of as a process that creates features in the moment from new data. Other times, people refer to online as the serving of models, making real time inferences on new data and returning the predictions.
These two workloads are often handled by two separate feature stores, each with their own dedicated compute engines, that communicate with each other as necessary. That is problematic. Ideally you would have one engine that is both your “offline” and your “online” feature store. One repository for all of your features, able to handle both batch, analytical queries, and low latency lookups. This reduces your operational complexity and data movement within your system, lowering the potential for data corruption.
A potentially simpler paradigm is to separate the concepts of feature calculation from feature utilization, with the understanding that all features must have low latency lookups for utilization.
Feature calculation has two dimensions: frequency and complexity. Frequency determines the time of update of a given feature. These can either be real time (event driven) or batch (scheduled). Complexity, on the other hand, describes the computational scope required to generate the feature; think simple arithmetic functions on a record versus entire data pipelines. Features can be both real time and complex, such as updating the analytical aggregation “average monthly food spend” upon an individual payment at a restaurant.
The use of these features, whether calculated in batch or in real time, must be readily available with millisecond latency for business logic and machine learning models at inference time.
Using two disparate engines to manage your feature store is difficult and increases your architectural complexity. Think about the lifecycle of a machine learning model. The training of that model is done with the “offline” feature store, but its inference may occur in real time (using the “online” feature store). Having two different engines, with potentially inconsistent or stale data, can lead to poor models and bad business outcomes.
Consider this real world example: Imagine the following three features; daily spending aggregation per user, the last transaction amount made by a user, and the output of a clustering algorithm, grouping customers into different propensities to spend. The daily spending aggregation is scheduled to run once a day, whereas the last transaction amount and the clustering algorithm are made in real-time as features change. The clustering feature is then used in other online applications such as email marketing campaigns and special offerings.
Every time a user swipes a credit card, the feature store is updated with the new “last transaction amount” feature. This new feature, in addition to your aggregation feature which was updated at the end of the day, are fed into your clustering algorithm. Ideally, your feature store could propagate those changes down automatically, triggering the evaluation of the clustering algorithm, which may (if changed), trigger further changes, such as a flag on a users account for a new offer.
Having a single engine that can handle both workloads simultaneously creates a single source of truth, a single component to manage frequency and complexity.
We’ve discussed a set of minimal requirements for an enterprise ready feature store, however, this is likely not a comprehensive list. As we learn more about the use cases of machine learning in industry, this list will mature and grow. If you think I missed any must-have requirements, let me know in the comments. And if you’ve implemented a feature store of your own that you think is great, we’d love to hear about that too.
This blog has been republished by AIIA. To view the original article, please click HERE.