Evaluating clustering techniques in financial time series

Detta är en Uppsats för yrkesexamina på avancerad nivå från Uppsala universitet/Avdelningen för systemteknik

Sammanfattning: This degree project aims to investigate different evaluation strategies for clustering methodsused to cluster multivariate financial time series. Clustering is a type of data mining techniquewith the purpose of partitioning a data set based on similarity to data points in the same cluster,and dissimilarity to data points in other clusters. By clustering the time series of mutual fundreturns, it is possible to help individuals select funds matching their current goals and portfolio. Itis also possible to identify outliers. These outliers could be mutual funds that have not beenclassified accurately by the fund manager, or potentially fraudulent practices. To determine which clustering method is the most appropriate for the current data set it isimportant to be able to evaluate different techniques. Using robust evaluation methods canassist in choosing the parameters to ensure optimal performance. The evaluation techniquesinvestigated are conventional internal validation measures, stability measures, visualizationmethods, and evaluation using domain knowledge about the data. The conventional internalvalidation methods and stability measures were used to perform model selection to find viableclustering method candidates. These results were then evaluated using visualization techniquesas well as qualitative analysis of the result. Conventional internal validation measures testedmight not be appropriate for model selection of the clustering methods, distance metrics, or datasets tested. The results often contradicted one another or suggested trivial clustering solutions,where the number of clusters is either 1 or equal to the number of data points in the data sets.Similarly, a stability validation metric called the stability index typically favored clustering resultscontaining as few clusters as possible. The only method used for model selection thatconsistently suggested clustering algorithms producing nontrivial solutions was the CLOSEscore. The CLOSE score was specifically developed to evaluate clusters of time series bytaking both stability in time and the quality of the clusters into account. We use cluster visualizations to show the clusters. Scatter plots were produced by applyingdifferent methods of dimension reduction to the data, Principal Component Analysis (PCA) andt-Distributed Stochastic Neighbor Embedding (t-SNE). Additionally, we use cluster evolutionplots to display how the clusters evolve as different parts of the time series are used to performthe clustering thus emphasizing the temporal aspect of time series clustering. Finally, the resultsindicate that a manual qualitative analysis of the clustering results is necessary to finely tune thecandidate clustering methods. Performing this analysis highlights flaws of the other validationmethods, as well as allows the user to select the best method out of a few candidates based onthe use case and the reason for performing the clustering.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)