Text normalization

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1109/TKDE.2006.180•

Some Effective Techniques for Naive Bayes Text Classification

[...]

Sang-Bum Kim¹, Kyoung-Soo Han¹, Hae-Chang Rim¹, Sung-Hyon Myaeng²•Institutions (2)

Korea University¹, Information and Communications University²

01 Nov 2006-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposed two empirical heuristics: per-document text normalization and feature weighting method, which performed very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM.

...read moreread less

Abstract: While naive Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem Based on the observation of naive Bayes for the natural language text, we found a serious problem in the parameter estimation process, which causes poor results in text classification domain In this paper, we propose two empirical heuristics: per-document text normalization and feature weighting method While these are somewhat ad hoc methods, our proposed naive Bayes text classifier performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM

...read moreread less

585 citations

Proceedings Article•

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

[...]

Bo Han¹, Timothy Baldwin¹•Institutions (1)

University of Melbourne¹

19 Jun 2011

TL;DR: This paper targets out-of-vocabulary words in short text messages and proposes a method for identifying and normalising ill-formed words, which achieves state- of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

...read moreread less

Abstract: Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

...read moreread less

572 citations

Proceedings Article•

Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems

[...]

Michael Denkowski¹, Alon Lavie¹•Institutions (1)

Carnegie Mellon University¹

30 Jul 2011

TL;DR: Meteor 1.3 as discussed by the authors was the first submission to the 2011 EMNLP Workshop on Statistical Machine Translation automatic evaluation metric tasks, which included improved text normalization, higher-precision paraphrase matching, and discrimination between content and function words.

...read moreread less

Abstract: This paper describes Meteor 1.3, our submission to the 2011 EMNLP Workshop on Statistical Machine Translation automatic evaluation metric tasks. New metric features include improved text normalization, higher-precision paraphrase matching, and discrimination between content and function words. We include Ranking and Adequacy versions of the metric shown to have high correlation with human judgments of translation quality as well as a more balanced Tuning version shown to outperform BLEU in minimum error rate training for a phrase-based Urdu-English system.

...read moreread less

489 citations

Journal Article•10.1006/CSLA.2001.0169•

Normalization of non-standard words

[...]

Richard Sproat¹, Alan W. Black², Stanley F. Chen³, Shankar Kumar⁴, Mari Ostendorf⁵, Christopher D. Richards⁶ - Show less +2 more•Institutions (6)

AT&T Labs¹, Carnegie Mellon University², IBM³, Johns Hopkins University⁴, University of Washington⁵, Princeton University⁶

01 Jul 2001-Computer Speech & Language

TL;DR: A taxonomy of NSWs was developed on the basis of four rather distinct text types, and several general techniques including n-gram language models, decision trees and weighted finite-state transducers were investigated, demonstrating that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past.

...read moreread less

409 citations

Proceedings Article•10.3115/1273073.1273078•

A Phrase-Based Statistical Model for SMS Text Normalization

[...]

Aiti Aw, Min Zhang, Juan Xiao, Jian Su

17 Jul 2006

TL;DR: This paper views the task of SMS normalization as a translation problem from the SMS language to the English language and proposes to adapt a phrase-based statistical MT model for the task, which can largely boost SMS translation performance.

...read moreread less

Abstract: Short Messaging Service (SMS) texts behave quite differently from normal written texts and have some very special phenomena. To translate SMS texts, traditional approaches model such irregularities directly in Machine Translation (MT). However, such approaches suffer from customization problem as tremendous effort is required to adapt the language model of the existing translation system to handle SMS text style. We offer an alternative approach to resolve such irregularities by normalizing SMS texts before MT. In this paper, we view the task of SMS normalization as a translation problem from the SMS language to the English language and we propose to adapt a phrase-based statistical MT model for the task. Evaluation by 5-fold cross validation on a parallel SMS normalized corpus of 5000 sentences shows that our method can achieve 0.80702 in BLEU score against the baseline BLEU score 0.6958. Another experiment of translating SMS texts from English to Chinese on a separate SMS text corpus shows that, using SMS normalization as MT preprocessing can largely boost SMS translation performance from 0.1926 to 0.3770 in BLEU score.

...read moreread less

289 citations

...

Expand

Year	Papers
2022	2
2021	34
2020	43
2019	41
2018	28
2017	32

Topic Tools

Papers published on a yearly basis

Papers

Some Effective Techniques for Naive Bayes Text Classification

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems

Normalization of non-standard words

A Phrase-Based Statistical Model for SMS Text Normalization

Related Topics (5)

Performance Metrics