Latest News

Friday, May 21, 2021

Scaling Data Science using Feature Stores

Photo by Kevin Ku from Pexels

The process of creating features is known as feature engineering, and it is a complex but essential component of any machine learning process. Better features equal better models, which equals a better business outcome.

To generate a new feature requires enormous work, and building the feature pipeline is only one thing. You probably had a long trial and error process, with a large number of characteristics, to get to the point of being pleased with your unique new feature. Next, the operational pipelines needed to be calculated and stored, which then differs depending on whether or not the features are online or offline.

In addition, every data science project begins with the search for the right functionality. The problem is, that there is mostly no unique, centralized location for searches; there are features everywhere.

The Feature Store is not only a data layer, it also allows users to manipulate raw data and store them as features that are ready for use in any type of Learning Machine Model.

There are two types of features that is online and offline:

Offline Features: Many of the features are calculated offline as part of a batch job. As an example, consider the average monthly spend of a customer. They are mostly used by offline processes. Because these types of computations can take a long time, they are calculated using frameworks such as Spark or by simply running complex SQL queries against a set of databases and then using a batch inference process.

Data preparation pipelines push data into the Feature Store tables and training data repositories.

Online Features: These features are a little more complicated because they must be calculated quickly and are frequently served in milliseconds. Calculating a z-score, for example, for real-time fraud detection. In this case, the pipeline is built in real time by calculating the mean and standard deviation over a sliding window. These calculations are much more difficult, necessitating quick computation as well as quick access to the data. The information can be kept in memory or in a very fast key-value database. The process itself can be carried out on various cloud services or on a platform such as the Iguazio Data Science Platform, which includes all of these components as part of its core offering.

Model training jobs use Feature Store and training data repository data sets to train models and then push them to the model repository.

Advantages of Feature Store:

  • Faster development
  • Smooth model deployment in production
  • Increased model accuracy
  • Better collaboration
  • Track lineage and address regulatory compliance

  • Google+
  • Pinterest

No comments

Post a Comment