Noisy text analytics

Topic Tools

Papers published on a yearly basis

Papers

Book•

Automatic text processing: the transformation, analysis, and retrieval of information by computer

[...]

Gerard Salton¹•Institutions (1)

Cornell University¹

3 Jan 1989

3,866 citations

Proceedings Article•10.1109/ICDAR.2013.221•

ICDAR 2013 Robust Reading Competition

[...]

Dimosthenis Karatzas¹, Faisal Shafait², Seiichi Uchida³, Masakazu Iwamura⁴, Lluís Gómez i Bigorda¹, Sergi Robles Mestre¹, Joan Mas¹, David Fernandez Mota¹, Jon Almazan¹, Lluís-Pere de las Heras¹ - Show less +6 more•Institutions (4)

Autonomous University of Barcelona¹, University of Western Australia², Kyushu University³, Osaka Prefecture University⁴

25 Aug 2013

TL;DR: The datasets and ground truth specification are described, the performance evaluation protocols used are details, and the final results are presented along with a brief summary of the participating methods.

...read moreread less

Abstract: This report presents the final results of the ICDAR 2013 Robust Reading Competition. The competition is structured in three Challenges addressing text extraction in different application domains, namely born-digital images, real scene images and real-scene videos. The Challenges are organised around specific tasks covering text localisation, text segmentation and word recognition. The competition took place in the first quarter of 2013, and received a total of 42 submissions over the different tasks offered. This report describes the datasets and ground truth specification, details the performance evaluation protocols used and presents the final results along with a brief summary of the participating methods.

...read moreread less

1,689 citations

Journal Article•10.1007/S11263-015-0823-Z•

Reading Text in the Wild with Convolutional Neural Networks

[...]

Max Jaderberg¹, Karen Simonyan¹, Andrea Vedaldi¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2016-International Journal of Computer Vision

TL;DR: An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.

...read moreread less

Abstract: In this work we present an end-to-end system for text spotting--localising and recognising text in natural scene images--and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

...read moreread less

1,281 citations

Proceedings Article•10.3115/1034678.1034679•

Untangling Text Data Mining

[...]

Marti A. Hearst¹•Institutions (1)

University of California, Berkeley¹

20 Jun 1999

TL;DR: Data mining, information access, and corpus-based computational linguistics are defined and the relationship of these to text data mining is discussed, and the intent behind these contrasts is to draw attention to exciting new kinds of problems for computational linguists.

...read moreread less

Abstract: The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or have not made use of text directly to discover heretofore unknown information. In this paper I will first define data mining, information access, and corpus-based computational linguistics, and then discuss the relationship of these to text data mining. The intent behind these contrasts is to draw attention to exciting new kinds of problems for computational linguists. I describe examples of what I consider to be real text data mining efforts and briefly outline recent ideas about how to pursue exploratory data analysis over text.

...read moreread less

971 citations

Proceedings Article•10.1145/1835449.1835643•

Short text classification in twitter to improve information filtering

[...]

Bharath Sriram¹, Dave Fuhry¹, Engin Demir¹, Hakan Ferhatosmanoglu¹, Murat Demirbas² - Show less +1 more•Institutions (2)

Ohio State University¹, University at Buffalo²

19 Jul 2010

TL;DR: A small set of domain-specific features extracted from the author's profile and text is proposed to use to classify short text messages to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages.

...read moreread less

Abstract: In microblogging services such as Twitter, the users may become overwhelmed by the raw data One solution to this problem is the classification of short text messages As short texts do not provide sufficient word occurrences, traditional classification methods such as "Bag-Of-Words" have limitations To address this problem, we propose to use a small set of domain-specific features extracted from the author's profile and text The proposed approach effectively classifies the text to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages

...read moreread less

867 citations

...

Expand

Year	Papers
2023	7
2022	8
2020	1
2019	1
2018	4
2017	23

Topic Tools

Papers published on a yearly basis

Papers

Automatic text processing: the transformation, analysis, and retrieval of information by computer

ICDAR 2013 Robust Reading Competition

Reading Text in the Wild with Convolutional Neural Networks

Untangling Text Data Mining

Short text classification in twitter to improve information filtering

Related Topics (5)

Performance Metrics