Using Elasticsearch for full-text searches on unstructured data

Detta är en Kandidat-uppsats från Uppsala universitet/Institutionen för informationsteknologi

Författare: Jennny Olsson; [2019]

Nyckelord: ;

Sammanfattning: In order to perform effective searches on large amounts of data it is not viable to simply scan through all of said data. A well established solution for this problem is to generate an index based on the data. This report compares different libraries for establishing such an index and a prototype was implemented to enable full-text searches on an existing database. The libraries considered include Elasticsearch, Solr, Sphinx and Xapian. The database in question consists of audit logs generated by a software for online management of financial trade. The author implemented a prototype using the open source search engine Elasticsearch. Besides performing searches in a reasonable time the implementation also allows for documents within the index to be fully removed without causing notable disturbances to the overall structure. The author defined a pattern analyzer for Elasticsearch to allow the use of the Swedish alphabet and accented letters. The audit log database which this project concerns can contain personal information. For this reason the General Data Protection Regulation was considered during the project. This regulation is a EU-law regarding personal information. The implementation described in this report is meant to serve as a starting point to allow the finding and retrieval of personal information to run more smoothly. The author also made sure that the deletions performed can be made final to comply with the General Data Protection Regulation. When testing the implementation a database of 708 megabyte containing unstructured data was used. Searching for double search terms, a first name and a last name, in the generated index resulted in an average return time of 11.5 ms when looking for exact matches and 59.3 ms when a small level of misspelling was allowed. The measurements suggest that a solution using Elasticsearch is suitable for the presented problem.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)