Data Build Tool (DBT) Jobs in Hopsworks

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: Feature engineering at scale is always critical and challenging in the machine learning pipeline. Modern data warehouses enable data analysts to do feature engineering by transforming, validating and aggregating data in Structured Query Language (SQL). To help data analysts do this work, Data Build Tool (DBT), an open-source tool, was proposed to build and orchestrate SQL pipelines. Hopsworks, an open-source scalable feature store, would like to add support for DBT so that data scientists can do feature engineering in Python, Spark, Flink, and SQL in a single platform. This project aims to create a concept about how to build this support and then implement it. The project checks the feasibility of the solution using a sample DBT project. According to measurements, this working solution needs around 800 MB of space in the server and it takes more time than executing DBT commands locally. However, it persistently stores the results of each execution in HopsFS, which are available to users. By adding this novel support for SQL using DBT, Hopsworks might be one of the completest platforms for feature engineering so far.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)