Improving Word Embeddings Using Kernel PCA
Vishwani Gupta,Sven Giesselbach,Stefan Rüping,Christian Bauckhage +3 more
- 01 Aug 2019
- pp 200-208
TL;DR: This paper uses word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices, in order to reduce training time and enhance their performance.
read more
Abstract: Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
WOVe: Incorporating Word Order in GloVe Word Embeddings.
TL;DR: The authors proposed Word Order Vector (WOVe) to incorporate word order in GloVe word embeddings, which achieved an average 36.34% improvement in accuracy on the word analogy task.
22
Measuring the innovation of method knowledge elements in scientific literature
TL;DR: The proposed approach can measure the innovation of MKEs in scientific literature effectively and is useful for both reviewers and funding agencies to assess the quality of academic papers.
10
Evaluating Word Embeddings on Low-Resource Languages
Nathan Stringham,Mike Izbicki +1 more
- 01 Nov 2020
TL;DR: This paper argues that the analogy task is unsuitable for low-resource languages for two reasons: it requires that word embeddings be trained on large amounts of text, and analogies may not be well-defined in some low- Resource settings, and introduces the OddOneOut and Topk tasks, which are specifically designed for model selection in the low- resource setting.
Detecting interdisciplinary semantic drift for knowledge organization based on normal cloud model
TL;DR: In this article , a framework for interdisciplinary semantic drift detection based on the normal cloud model (NCM) is proposed to reduce the conceptual ambiguity in interdisciplinary knowledge organization systems (KOSs) and enhance interdisciplinary KOS management.
5
References
•Proceedings Article
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov,Kai Chen,Greg S. Corrado,Jeffrey Dean +3 more
- 16 Jan 2013
TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
27.5K
•Proceedings Article
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov,Ilya Sutskever,Kai Chen,Greg S. Corrado,Jeffrey Dean +4 more
- 05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
•Posted Content
Distributed Representations of Words and Phrases and their Compositionality
TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.
Convolutional Neural Networks for Sentence Classification
Yoon Kim
- 25 Aug 2014
TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.
Enriching Word Vectors with Subword Information
TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.