Improving Word Embeddings Using Kernel PCA

doi:10.18653/V1/W19-4323

Open AccessProceedings Article10.18653/V1/W19-4323

Improving Word Embeddings Using Kernel PCA

Vishwani Gupta, +3 more

- 01 Aug 2019

- pp 200-208

11

TL;DR: This paper uses word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices, in order to reduce training time and enhance their performance.

Abstract: Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

WOVe: Incorporating Word Order in GloVe Word Embeddings.

Mohammed Ibrahim, +3 more

- 18 May 2021

- arXiv: Computation and Language

TL;DR: The authors proposed Word Order Vector (WOVe) to incorporate word order in GloVe word embeddings, which achieved an average 36.34% improvement in accuracy on the word analogy task.

...read moreread less

22

Journal Article•10.1007/s11192-022-04350-5

Measuring the innovation of method knowledge elements in scientific literature

Zhong-Yi Wang, +4 more

- 25 Mar 2022

- Scientometrics

10

Journal Article•10.1007/s11192-022-04350-5

Measuring the innovation of method knowledge elements in scientific literature

Zhong-Yi Wang, +4 more

- 25 Mar 2022

- Scientometrics

TL;DR: The proposed approach can measure the innovation of MKEs in scientific literature effectively and is useful for both reviewers and funding agencies to assess the quality of academic papers.

...read moreread less

10

•Proceedings Article•10.18653/V1/2020.EVAL4NLP-1.17

Evaluating Word Embeddings on Low-Resource Languages

Nathan Stringham, +1 more

- 01 Nov 2020

TL;DR: This paper argues that the analogy task is unsuitable for low-resource languages for two reasons: it requires that word embeddings be trained on large amounts of text, and analogies may not be well-defined in some low- Resource settings, and introduces the OddOneOut and Topk tasks, which are specifically designed for model selection in the low- resource setting.

...read moreread less

7

•Journal Article•10.1016/j.jksuci.2023.101569

Detecting interdisciplinary semantic drift for knowledge organization based on normal cloud model

Zhong-Yi Wang, +4 more

- 01 Apr 2023

- Journal of King Saud University - Comput...

TL;DR: In this article , a framework for interdisciplinary semantic drift detection based on the normal cloud model (NCM) is proposed to reduce the conceptual ambiguity in interdisciplinary knowledge organization systems (KOSs) and enhance interdisciplinary KOS management.

...read moreread less

5

References

•Proceedings Article

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013

TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

...read moreread less

27.5K

•Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

- 05 Dec 2013

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

24.1K

•Posted Content

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

- 16 Oct 2013

- arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less

22.9K

•Proceedings Article•10.3115/V1/D14-1181

Convolutional Neural Networks for Sentence Classification

Yoon Kim

- 25 Aug 2014

TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.

...read moreread less

16.1K

•Journal Article•10.1162/TACL_A_00051

Enriching Word Vectors with Subword Information

Piotr Bojanowski, +3 more

- 12 Jun 2017

- Transactions of the Association for Comp...

TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.

...read moreread less

10.3K

...

Expand

Improving Word Embeddings Using Kernel PCA

Chat with Paper

AI Agents for this Paper

Citations

WOVe: Incorporating Word Order in GloVe Word Embeddings.

Measuring the innovation of method knowledge elements in scientific literature

Measuring the innovation of method knowledge elements in scientific literature

Evaluating Word Embeddings on Low-Resource Languages

Detecting interdisciplinary semantic drift for knowledge organization based on normal cloud model

References

Efficient Estimation of Word Representations in Vector Space

Distributed Representations of Words and Phrases and their Compositionality

Distributed Representations of Words and Phrases and their Compositionality

Convolutional Neural Networks for Sentence Classification

Enriching Word Vectors with Subword Information

Related Papers (5)

Semantic Information Extraction for Improved Word Embeddings

Co-learning of Word Representations and Morpheme Representations

Rehabilitation of Count-based Models for Word Vector Representations

Learning class-specific word embeddings

Learning context-sensitive word embeddings with neural tensor skip-gram model