Parameter-Efficient Korean Character-Level Language Modeling

Proceedings Article

Parameter-Efficient Korean Character-Level Language Modeling

pp 2342-2348

TL;DR: The authors exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations, and find that their three-hot embedding and decoding scheme alleviates the two major issues with prior syllable and jamo level models.

Abstract: Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a “three-hot” scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

References

•Proceedings Article•10.18653/V1/D17-1027

Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components

Jinxing Yu, +3 more

- 01 Sep 2017

TL;DR: This work proposes an approach to jointly embed Chinese words as well as their characters and fine-grained subcharacter components, using three likelihoods to evaluate whether the context words, characters, and components can predict the current target word.

...read moreread less

150

•Posted Content

Classifier Chains: A Review and Perspectives

Jesse Read, +3 more

- 26 Dec 2019

- arXiv: Learning

TL;DR: The goal of this work is to provide a review of classifier chains, a survey of the techniques and extensions provided in the literature, as well as perspectives for this approach in the domain of multi-label classification in the future.

...read moreread less

62

•Proceedings Article•10.18653/V1/W17-4122

Sub-character Neural Language Modelling in Japanese

Viet Anh Nguyen, +2 more

- 01 Sep 2017

TL;DR: The results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decompose.

...read moreread less

14

•Proceedings Article

Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

Sangwhan Moon, +1 more

- 01 May 2020

TL;DR: Two algorithms applicable in any unsupervised multilingual pre-training task are proposed, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers and significantly reducing the cost of supporting Korean in a multilingual model.

...read moreread less

12

•Posted Content

Investigating an Effective Character-level Embedding in Korean Sentence Classification.

Won Ik Cho, +2 more

- 31 May 2019

- arXiv: Computation and Language

TL;DR: The result displays that the character-level features show higher performance in general, although the Jamo- level features may show compatibility with the attention-based models if guaranteed adequate parameter set size.

...read moreread less

9