2019:Research/Crosslingual Embedding via Generalized Eigenvalue Decomposition
This is an Accepted submission for the Research space at Wikimania 2019. |
Abstract
editUnsupervised pretraining of word embeddings has been a catalyst for recent advances in NLP. Simple matrix factorization models such as word2vec or GloVe have been particularly well adopted, given their relative computational efficiency compared to more complex previous models based on neural networks. Recently, however, more complex models have again gained popularity. Their massive computational cost of training and testing has led to a quest for models that can be parallelized, shifting the focus from RNNs towards CNNs and eventually to attention-based architectures, in particular, based on Transformers. While these models have boosted performance on almost all NLP tasks, the improvements rely on powerful hardware that not everyone can afford, and inference is slowed down by complex computations and large models with up to billions of parameters [19], all of which hamper the new models’ widespread adoption.
Another major recent trend in NLP has been toward crosslinguality, by embedding texts in a language-invariant space such that similar texts have close-by vectors both within and across languages, with applications ranging from machine translation to crosslingual transfer learning to language-independent retrieval. This direction, too, has been driven by the need to learn from unsupervised data, epitomized by breakthroughs such as Facebook’s MUSE embeddings, where adversarial methods are employed to leverage the structure in the monolingual spaces and gradually align them without any supervised signal. In recent work, we proposed an alternative method for obtaining crosslingual embeddings, called Cr5 (Crosslingual reduced-rank ridge regression). Cr5 is trained end-to-end at the document level, rather than the word level, and achieves large improvements on crosslingual document retrieval over the previous state of the art. A major focus in this line of work, is the cost, and consequently the amount of available data. For training, our approach leverages the signal from Wikipedia as a multilingual corpus where the same concept is covered in multiple languages, but not via exact translation.
Cr5 takes a radically different approach from its competition as it frames the problem purely in terms of linear algebra, which enables the use of highly efficient numerical routines and scales to massive problem sizes. Our work on Cr5 gave us important insights on how to generalize the method, giving rise to Cr6, a novel formulation of multitask crosslingual text embedding via the generalized eigenvalue decomposition, with an efficient and scalable implementation using highly optimized, yet affordable, linear algebra software and hardware. The multitask learning setup is of particular importance as it enables information flow between the document-, sentence-, and word-level tasks, but at the same time provides different linear maps, each optimal for its own level of granularity. Finally, the method should generalize to a broad range of embedding problems and therefore the output of the project will be a package with an easy-to-use interface benefiting the community at large.
Authors
editMartin Josifoski (Ecole Polytechnique Fédérale de Lausanne), Bob West (Ecole Polytechnique Fédérale de Lausanne)
Session type
edit22-min presentation.
Participants [subscribe here!]
edit# ...