Bridging Language & Data : Optimizing Text-to-SQL Generation in Large Language Models

Detta är en Master-uppsats från Linköpings universitet/Artificiell intelligens och integrerade datorsystem

Sammanfattning: This thesis explores text-to-SQL generation using Large Language Models within a financial context, aiming to assess the efficacy of current benchmarks and techniques. The central investigation revolves around the accuracy of the BIRD-Bench benchmark and the applicability of text-to-SQL models in real-world scenarios. The research explores why models are not showing significant performance improvements on this benchmark. The methodology adopted involved a thorough manual analysis of inputs and outputs from two distinct text-to-SQL models: a baseline zero-shot model and the DIN-SQL model, within the financial domain of the BIRD-Bench benchmark. The purpose was to identify and understand the limitations inherent in the dataset and the models themselves. Findings revealed that the best performing model on the original data was the DIN-SQL model, achieving an accuracy of 40.57%, a performance that raises questions due to its limited efficacy. Upon manual analysis, various types of noise were identified in the dataset, including string capitalization errors, faulty true SQL queries, grammatical errors, mismatches between questions and database schema, and language ambiguities. This led to the curation of two new datasets: one with cleaned questions and SQL queries, and another with only cleaned SQL queries, correcting a total of 52/106 data points. Subsequent runs of the models on these new datasets showed that data quality significantly impacts model performance. The completely cleaned dataset nearly eliminated the performance gap between the two models, with all models showing a 10%-17% increase in accuracy. Interestingly, on the dataset with only cleaned SQL queries, the performance of the models flipped with the basic zero-shot model now outperformed the DIN-SQL model. Further analysis of BIRD-Bench's development set across different domains indicated the presence of noise in other areas of the benchmark as well. This suggests that while BIRD-Bench closely resembles real-world scenarios, it falls short in offering a detailed understanding of model performance against different types of noise. The thesis introduces the concept of classifying noise in natural language questions, aiming to prevent the entry of noisy questions into text-to-SQL models and annotate noise in existing datasets. Experiments using GPT-3.5 and GPT-4 on a manually annotated dataset demonstrated the viability of this approach, with classifiers achieving up to 0.81 recall and 80% accuracy. Additionally, the thesis explored the use of LLMs for automatically correcting faulty SQL queries. This showed a 100\% success rate for specific query corrections, highlighting the potential for LLMs in improving dataset quality. The implications of these findings are substantial, emphasizing the need for noise-specific benchmarks and enhanced annotations in datasets like BIRD-Bench. This research underscores the importance of addressing specific noise challenges and developing more sophisticated text-to-SQL models. In conclusion, the thesis offers significant insights into the performance and limitations of text-to-SQL models, setting the stage for future research. This includes creating specialized datasets, enhancing annotations, focusing on identified noise types, developing user-input guardrails, and improving text-to-SQL models overall. Such advancements are expected to significantly improve the functionality and practical application of text-to-SQL technologies across various industries.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)