Faster Reading with DuckDB and Arrow Flight on Hopsworks : Benchmark and Performance Evaluation of Offline Feature Stores

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Over the last few years, Machine Learning has become a huge field with “Big Tech” companies sharing their experiences building machine learning infrastructure. Feature Stores, used as centralized data repositories for machine learning features, are seen as a central component to operational and scalable machine learning. With the growth in machine learning, there is, naturally, a tremendous growth in data used for training. Most of this data tends to sit in Parquet files in cloud object stores or data lakes and is used either directly from files or in-memory where it is used in exploratory data analysis and small batches of training. A majority of the data science involved in machine learning is done in Python, but the infrastructure surrounding it is not always directly compatible with Python. Often, query processing engines and feature stores end up having their own Domain Specific Language or require data scientists to write SQL code, thus leading to some level of ‘transpilation’ overhead across the system. This overhead can not only introduce errors but can also add up to significant time and productivity cost down the line. In this thesis, we conduct a systems research on the performance of offline feature stores and identify ways that allow us to pull out data from feature stores in a fast and efficient way. We conduct a model evaluation based on benchmark tests that address common exploratory data analysis and training use cases. We find that in the Hopsworks feature store, with the use of state-of-the-art, storage-optimized, format-aware, and vector execution-based query processing engine as well as using Arrow protocol from start to finish, we are able to see significant improvements in both creating batch training data (feature value reads) and creating Point-In-Time Correct training data. For batch training data created in-memory, Hopsworks shows an average speedup of 27x over Databricks (5M and 10M scale factors), 18x over Vertex, and 8x over Sagemaker across all scale factors. For batch training data as parquet files, Hopsworks shows a speedup of 5x over Databricks (5M, 10M, and 20M scale factors), 13x over Vertex, and 6x over Sagemaker across all scale factors. For creating in-memory Point-In-Time Correct training data, Hopsworks shows an average speedup of 8x over Databricks, 6x over Vertex, and 3x over Sagemaker across all scale factors. Similary for PIT-Correct training data created as file, Hopsworks shows an average speedup of 9x over Databricks, 8x over Vertex, and 6x over Sagemaker across all scale factors. Through the analysis of these experimental results and the underlying infrastructure, we identify the reasons for this performance gap and examine the strengths and limitations of the design.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)