TL;DR: This paper follows the framework of Skip-gram and presents three sememe-encoded models to learn representations of sememes, senses and words, where the attention scheme is applied to detect word senses in various contexts.
Abstract: Sememes are minimum semantic units of word meanings, and the meaning of each word sense is typically composed by several sememes. Since sememes are not explicit for each word, people manually annotate word sememes and form linguistic common-sense knowledge bases. In this paper, we present that, word sememe information can improve word representation learning (WRL), which maps words into a low-dimensional semantic space and serves as a fundamental step for many NLP tasks. The key idea is to utilize word sememes to capture exact meanings of a word within specific contexts accurately. More specifically, we follow the framework of Skip-gram and present three sememe-encoded models to learn representations of sememes, senses and words, where we apply the attention scheme to detect word senses in various contexts. We conduct experiments on two tasks including word similarity and word analogy, and our models significantly outperform baselines. The results indicate that WRL can benefit from sememes via the attention scheme, and also confirm our models being capable of correctly modeling sememe information.
TL;DR: mppSMT as mentioned in this paper is a divide-and-conquer technique for phrase-based SMT to model and translate source code with well-formed structures, but it does not support rule-based methods.
Abstract: Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces much semantically incorrect code. A key limitation is the use of sequences in phrase-based SMT to model and translate source code with well-formed structures. We propose mppSMT, a divide-and-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First, mppSMT treats a program as a sequence of syntactic units and maps/translates such sequences in two languages to one another. Second, with a syntax-directed fashion, it deals with the tokens within syntactic units by encoding them with semantic symbols to represent their data and token types. This encoding via semantic symbols helps better migration of API usages. Third, the lexical tokens corresponding to each sememe are mapped or migrated. The resulting sequences of tokens are merged together to form the final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to syntactical structures in source code, thus, improving migration accuracy. Our empirical evaluation on several real-world systems shows that 84.8 -- 97.9% and 70 -- 83% of the migrated methods are syntactically and semantically correct, respectively. 26.3 -- 51.2% of total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from 6.6 -- 57.7% relatively. Importantly, it does not require manual labeling for training data or manual definition of rules.
TL;DR: This paper for the first time explores to automatically predict lexical sememes based on semantic meanings of words encoded by word embeddings, and applies matrix factorization to learn semantic relations between sememe and words.
Abstract: Sememes are defined as the minimum semantic units of human languages. People have manually annotated lexical sememes for words and form linguistic knowledge bases. However, manual construction is time-consuming and labor-intensive, with significant annotation inconsistency and noise. In this paper, we for the first time explore to automatically predict lexical sememes based on semantic meanings of words encoded by word embeddings. Moreover, we apply matrix factorization to learn semantic relations between sememes and words. In experiments, we take a real-world sememe knowledge base HowNet for training and evaluation, and the results reveal the effectiveness of our method for lexical sememe prediction. Our method will be of great use for annotation verification of existing noisy sememe knowledge bases and annotation suggestion of new words and phrases.
TL;DR: This work considers the LIWC lexicon as a hierarchical classification problem and utilizes the Sequence-to-Sequence model to classify words in the lexicon, and uses the sememe information with the attention mechanism to capture the exact meanings of a word, so that it can expand a more precise and comprehensive lexicon.
Abstract: Linguistic Inquiry and Word Count (LIWC) is a word counting software tool which has been used for quantitative text analysis in many fields. Due to its success and popularity, the core lexicon has been translated into Chinese and many other languages. However, the lexicon only contains several thousand of words, which is deficient compared with the number of common words in Chinese. Current approaches often require manually expanding the lexicon, but it often takes too much time and requires linguistic experts to extend the lexicon. To address this issue, we propose to expand the LIWC lexicon automatically. Specifically, we consider it as a hierarchical classification problem and utilize the Sequence-to-Sequence model to classify words in the lexicon. Moreover, we use the sememe information with the attention mechanism to capture the exact meanings of a word, so that we can expand a more precise and comprehensive lexicon. The experimental results show that our model has a better understanding of word meanings with the help of sememes and achieves significant and consistent improvements compared with the state-of-the-art methods. The source code of this paper can be obtained from https://github.com/thunlp/Auto_CLIWC.
TL;DR: This article proposed a novel framework to take advantage of both internal character information and external context information of words to solve the problem of low-frequency and out-of-vocabulary words.
Abstract: Sememes are minimum semantic units of concepts in human languages, such that each word sense is composed of one or multiple sememes. Words are usually manually annotated with their sememes by linguists, and form linguistic common-sense knowledge bases widely used in various NLP tasks. Recently, the lexical sememe prediction task has been introduced. It consists of automatically recommending sememes for words, which is expected to improve annotation efficiency and consistency. However, existing methods of lexical sememe prediction typically rely on the external context of words to represent the meaning, which usually fails to deal with low-frequency and out-of-vocabulary words. To address this issue for Chinese, we propose a novel framework to take advantage of both internal character information and external context information of words. We experiment on HowNet, a Chinese sememe knowledge base, and demonstrate that our framework outperforms state-of-the-art baselines by a large margin, and maintains a robust performance even for low-frequency words.