Handling Big Data using a Distributed Search Engine : Preparing Log Data for On-Demand Analysis

Detta är en Master-uppsats från KTH/Skolan för informations- och kommunikationsteknik (ICT)

Sammanfattning: Big data are datasets that is very large and computational complex. With an increasing volume of data the time a trivial processing task can be challenging. Companies collects data at a fast rate but knowing what to do with the data can be hard. A search engine is a system that indexes data making it efficiently queryable by users. When a bug occurs in a computer system log data is consulted in order to understand why, but processing big log data can take a long time. The purpose of this thesis is to investigate, compare and implement a distributed search engine that can prepare log data for analysis, which will make it easier for a developer to investigate bugs. There are three popular search engines: Apache Lucene, Elasticsearch and Apache Solr. Elasticsearch and Apache Solr are built as distributed systems making them capable of handling big data. Requirements was established through interviews. Big log data of totally 40 GB was provided that would be indexed in the selected search engine. The log data provided was generated in a proprietary binary format and it had to be decoded before. The distributed search engines was evaluated based on: Distributed architecture, text analysis, indexing and querying. Elasticsearch was selected for implementation. A cluster was set up on Amazon Web Services and tests was executed in order to determine how different configurations performed. An indexing software was written that would transfer data to the cluster. Results was verified through a case-study with participants of the stakeholder.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)