A Comparative Study on Efficiency and Scalability of Integer and String Datasets in cuDF and pandas

Detta är en Kandidat-uppsats från KTH/Skolan för elektroteknik och datavetenskap (EECS)

Författare: Anton Schulz; Emil Sjölander; [2023]

Nyckelord: ;

Sammanfattning: This thesis presents a comparative analysis of cuDF and pandas, two Python data processing libraries, with a focus on performance, limitations, and scalability when handling integer and string datasets. The study aims to assess the efficiency and suitability of cuDF as a potential alternative to pandas in scenarios where high-performance data processing is required. By generating string and integer datasets of different scale and creating a test suite consisting of basic operations available in both pandas and cuDF a comparative analysis was made. The results showed that cuDF performed better for almost all operations on both integers and strings, but especially on strings. There were operations where cuDF appeared to become faster at a certain scale but these operations were very quick in general. However, cuDF was found to have limitations when it came to user defined functions and could not handle abstract Python objects like pandas could. The study concluded that cuDF could offer significant increase in performance if the user is handling a dataset that consists of basic data types and fairly basic user defined functions.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)