Proceedings Article
Parameter-Efficient Korean Character-Level Language Modeling
Marco Cognetta,Sangwhan Moon,Lawrence Wolf-Sonkin,Naoaki Okazaki +3 more
pp 2342-2348
TL;DR: The authors exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations, and find that their three-hot embedding and decoding scheme alleviates the two major issues with prior syllable and jamo level models.
read more
Abstract: Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a “three-hot” scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
References
Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components
Jinxing Yu,Xun Jian,Hao Xin,Yangqiu Song +3 more
- 01 Sep 2017
TL;DR: This work proposes an approach to jointly embed Chinese words as well as their characters and fine-grained subcharacter components, using three likelihoods to evaluate whether the context words, characters, and components can predict the current target word.
•Posted Content
Classifier Chains: A Review and Perspectives
TL;DR: The goal of this work is to provide a review of classifier chains, a survey of the techniques and extensions provided in the literature, as well as perspectives for this approach in the domain of multi-label classification in the future.
62
Sub-character Neural Language Modelling in Japanese
Viet Anh Nguyen,Julian Brooke,Timothy Baldwin +2 more
- 01 Sep 2017
TL;DR: The results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decompose.
•Proceedings Article
Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
Sangwhan Moon,Naoaki Okazaki +1 more
- 01 May 2020
TL;DR: Two algorithms applicable in any unsupervised multilingual pre-training task are proposed, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers and significantly reducing the cost of supporting Korean in a multilingual model.
12
•Posted Content
Investigating an Effective Character-level Embedding in Korean Sentence Classification.
TL;DR: The result displays that the character-level features show higher performance in general, although the Jamo- level features may show compatibility with the attention-based models if guaranteed adequate parameter set size.