LSTM Feature Engineering Through Time Series Similarity Embedding

Detta är en Master-uppsats från Linköpings universitet/Institutionen för datavetenskap

Sammanfattning: Time series prediction has many applications. In cases with simultaneous series (like measurements of weather from multiple stations, or multiple stocks on the stock market)it is not unlikely that these series from different measurement origins behave similarly, or respond to the same contextual signals. Training input to a prediction model could be constructed from all simultaneous measurements to try and capture the relations between the measurement origins. A generalized approach is to train a prediction model on samples from any individual measurement origin. The data mass is the same in both cases, but in the first case, fewer samples of a larger width are used, while the second option uses a higher number of smaller samples. The first, high-width option, risks over-fitting as a result of fewer training samples per input variable. The second, general option, would have no way to learn relations between the measurement origins. Amending the general model with contextual information would allow for keeping a high samples-per-variable ratio without losing the ability to take the origin of the measurements into account. This thesis presents a vector embedding method for measurement origins in an environment with shared response to contextual signals. The embeddings are based on multi-variate time series from the origins. The embedding method is inspired by co-occurrence matrices commonly used in Natural Language Processing. The similarity measures used between the series are Dynamic Time Warping (DTW), Step-wise Euclidean Distance, and Pearson Correlation. The dimensionality of the resulting embeddings is reduced by Principal Component Analysis (PCA) to increase information density, and effectively preserve variance in the similarity space. The created embedding system allows contextualization of samples, akin to the human intuition that comes from knowing where measurements were taken from, like knowing what sort of company a stock ticker represents, or what environment a weather station is located in. In the embedded space, embeddings of series from fundamentally similar measurement origins are closely located, so that information regarding the behavior of one can be generalized to its neighbors. The resulting embeddings from this work resonate well with existing clustering methods in a weather dataset, and partially in a financial dataset, and do provide performance improvement for an LSTM network acting on said financial dataset. The similarity embeddings also outperform an embedding layer trained together with the LSTM.

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)