Hudi on Hops : Incremental Processing and Fast Data Ingestion for Hops

Detta är en Master-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Sammanfattning: In the era of big data, data is flooding from numerous data sources and many companies have been utilizing different types of tools to load and process data from various sources in a data lake. The major challenges where different companies are facing these days are how to update data into an existing dataset without having to read the entire dataset and overwriting it to accommodate the changes which have a negative impact on the performance. Besides this, finding a way to capture and track changed data in a big data lake as the system gets complex with large amounts of data to maintain and query is another challenge. Web platforms such as Hopsworks are also facing these problems without having an efficient mechanism to modify an existing processed results and pull out only changed data which could be useful to meet the processing needs of an organization. The challenge of accommodating row level changes in an efficient and effective manner is solved by integrating Hudi with Hops. This takes advantage of Hudi’s upsert mechanism which uses Bloom indexing to significantly speed up the ability of looking up records across partitions. Hudi indexing maps a record key into the file id without scanning over every record in the dataset. In addition, each successful data ingestion is stored in Apache Hudi format stamped with commit timeline. This commit timeline is needed for the incremental processing mainly to pull updated rows since a specified instant of time and obtain change logs from a dataset. Hence, incremental pulls are realized through the monotonically increasing commit time line. Similarly, incremental updates are realized over a time column (key expression) that allows Hudi to update rows based on this time column. HoodieDeltaStreamer utility and DataSource API are used for the integration of Hudi with Hops and Feature store. As a result, this provided a fabulous way of ingesting and extracting row level updates where its performance can further be enhanced by the configurations of the shuffle parallelism and other spark parameter configurations since Hudi is a spark based library.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)