Journal Article10.1007/S10590-013-9136-6
Substring-based machine translation
TL;DR: This paper demonstrates that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-basedtranslation for two distant language pairs.
read more
Abstract: Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632---641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Character-based Neural Machine Translation
TL;DR: A neural machine translation model that views the input and output sentences as sequences of characters rather than words, which alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages.
•Posted Content
A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation
TL;DR: This paper evaluates an attention-based encoder-decoder with a subword-level encoder and a character-level decoder on four language pairs using the parallel corpora from WMT'15 to ask a fundamental question: can neural machine translation generate a character sequence without any explicit segmentation?
180
•Posted Content
Character-based NMT with Transformer.
TL;DR: This paper experiments on EN-DE show that character-based Transformer models are more robust than their BPE counterpart, both when translating noisy text, and when translating text from a different domain.
22
•Posted Content
Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality.
Gustavo Aguilar,Bryan McCann,Tong Niu,Nazneen Fatema Rajani,Nitish Shirish Keskar,Thamar Solorio +5 more
- 24 Oct 2020
TL;DR: A character-based subword transformer module (char2subword) that learns the subword embedding table in pre-trained models like BERT and is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation is proposed.
13
Unsupervised Learning of Lexical Information for Language Processing Systems
Graham Neubig
- 26 Mar 2012
TL;DR: This thesis attempts to answer the question of which lexical units should be used for these applications by acquiring them through unsupervised learning, and presents models for lexical learning for speech recognition and machine translation.
8
References
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Moses: Open Source Toolkit for Statistical Machine Translation
Philipp Koehn,Hieu Hoang,Alexandra Birch,Chris Callison-Burch,Marcello Federico,Nicola Bertoldi,Brooke Cowan,Wade Shen,C. Corbett Moran,Richard Zens,Chris Dyer,Ondrej Bojar,Alexandra Elena Constantin,Evan Herbst +13 more
- 25 Jun 2007
TL;DR: An open-source toolkit for statistical machine translation whose novel contributions are support for linguistically motivated factors, confusion network decoding, and efficient data formats for translation models and language models.
•Journal Article
The mathematics of statistical machine translation: parameter estimation
TL;DR: The authors describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another.
A systematic comparison of various statistical alignment models
Franz Josef Och,Hermann Ney +1 more
TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.
Statistical phrase-based translation
Philipp Koehn,Franz Josef Och,Daniel Marcu +2 more
- 27 May 2003
TL;DR: The empirical results suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translation.