Substring-based machine translation

doi:10.1007/S10590-013-9136-6

Journal Article10.1007/S10590-013-9136-6

Substring-based machine translation

Graham Neubig, +3 more

- 01 Jun 2013

- Machine Translation

- Vol. 27, Iss: 2, pp 139-166

13

TL;DR: This paper demonstrates that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-basedtranslation for two distant language pairs.

Abstract: Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632---641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Character-based Neural Machine Translation

Wang Ling, +3 more

- 14 Nov 2015

- arXiv: Computation and Language

TL;DR: A neural machine translation model that views the input and output sentences as sequences of characters rather than words, which alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages.

...read moreread less

264

•Posted Content

A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation

Junyoung Chung, +2 more

- 19 Mar 2016

- arXiv: Computation and Language

TL;DR: This paper evaluates an attention-based encoder-decoder with a subword-level encoder and a character-level decoder on four language pairs using the parallel corpora from WMT'15 to ask a fundamental question: can neural machine translation generate a character sequence without any explicit segmentation?

...read moreread less

180

•Posted Content

Character-based NMT with Transformer.

Rohit Gupta, +3 more

- 12 Nov 2019

- arXiv: Computation and Language

TL;DR: This paper experiments on EN-DE show that character-based Transformer models are more robust than their BPE counterpart, both when translating noisy text, and when translating text from a different domain.

...read moreread less

22

•Posted Content

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality.

Gustavo Aguilar, +5 more

- 24 Oct 2020

TL;DR: A character-based subword transformer module (char2subword) that learns the subword embedding table in pre-trained models like BERT and is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation is proposed.

...read moreread less

13

Unsupervised Learning of Lexical Information for Language Processing Systems

Graham Neubig

- 26 Mar 2012

TL;DR: This thesis attempts to answer the question of which lexical units should be used for these applications by acquiring them through unsupervised learning, and presents models for lexical learning for speech recognition and machine translation.

...read moreread less

8

References

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

•Proceedings Article•10.3115/1557769.1557821

Moses: Open Source Toolkit for Statistical Machine Translation

Philipp Koehn, +13 more

- 25 Jun 2007

TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.

...read moreread less

6.3K

•Journal Article

The mathematics of statistical machine translation: parameter estimation

Peter Fitzhugh Brown, +3 more

- 01 Jun 1993

- Computational Linguistics

TL;DR: The authors describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another.

...read moreread less

4.9K

•Journal Article•10.1162/089120103321337421

A systematic comparison of various statistical alignment models

Franz Josef Och, +1 more

- 01 Mar 2003

- Computational Linguistics

TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.

...read moreread less

4.6K

•Proceedings Article•10.3115/1073445.1073462

Statistical phrase-based translation

Philipp Koehn, +2 more

- 27 May 2003

TL;DR: The empirical results suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translation.

...read moreread less

4.1K