Distillation or loss of information? : The effects of distillation on model redundancy

Detta är en Master-uppsats från Uppsala universitet/Institutionen för lingvistik och filologi

Sammanfattning:     The necessity for billions of parameters in large language models has lately been questioned as there are still unanswered questions regarding how information is captured in the networks. It could be argued that without this knowledge, there may be a tendency to overparametarize the models. In turn, the investigation of model redundancy and the methods which minimize it is important both to the academic and commercial entities. As such, the two main goals of this project were to, firstly, discover whether one of such methods, namely, distillation, reduces the redundancy of the language models without losing linguistic capabilities and, secondly, to determine whether the model architecture or multilingualism has a bigger effect on said reduction. To do so, ten models, both monolingual, multilingual, and their distilled counterparts, were evaluated layer and neuron-wise. In terms of layers, we have evaluated the layer correlation of all models by visualising heatmaps and calculating the average per layer similarity. For establishing the neuron-level redundancy, a classifier probe was applied on the model neurons, both the whole model and reduced by applying a clustering algorithm, and its performance was assessed for two tasks, Part-of-Speech (POS) and Dependency (DEP) tagging. To determine the distillation effects on the multilingualism of the models, we have investigated cross-lingual transfer for the same tasks and compared the results of the classifier as applied on multilingual models and one distilled variant in ten languages, nine Indo-European and one non-Indo-European. The results show that distillation reduces the number of redundant neurons at the cost of losing some of the linguistic knowledge. In addition, the redundancy in the distilled models is mainly attributed to the architecture on which it is based, with the multilingualism aspect having only a mild impact. Finally, the cross-lingual transfer experiments have shown that after distillation the model loses the ability to capture some languages more than others. In turn, the outcome of the project suggests that distillation could be applied to reduce the size of billion parameter models and is a promising method in terms of reducing the redundancy in current language models. 

  HÄR KAN DU HÄMTA UPPSATSEN I FULLTEXT. (följ länken till nästa sida)