Top 84 papers presented at Document Analysis Systems in 2008

Showing papers presented at "Document Analysis Systems in 2008"

Proceedings Article•10.1109/DAS.2008.42•

A Robust System to Detect and Localize Texts in Natural Scene Images

[...]

Yi-Feng Pan¹, Xinwen Hou¹, Cheng-Lin Liu¹•Institutions (1)

16 Sep 2008

TL;DR: A region-based method utilizing multiple features and cascade AdaBoost classifier is adopted for text detection and a window grouping method integrating text line competition analysis is used to generate text lines.

...read moreread less

Abstract: In this paper, we present a robust system to accurately detect and localize texts in natural scene images. For text detection, a region-based method utilizing multiple features and cascade AdaBoost classifier is adopted. For text localization, a window grouping method integrating text line competition analysis is used to generate text lines. Then within each text line, local binarization is used to extract candidate connected components (CCs) and non-text CCs are filtered out by Markov Random Fields (MRF) model, through which text line can be localized accurately. Experiments on the public benchmark ICDAR 2003 Robust Reading and Text Locating Dataset show that our system is comparable to the best existing methods both in accuracy and speed.

...read moreread less

90 citations

Proceedings Article•10.1109/DAS.2008.41•

An Objective Evaluation Methodology for Document Image Binarization Techniques

[...]

Konstantinos Ntirogiannis¹, Basilis Gatos¹, Ioannis Pratikakis•Institutions (1)

National Centre of Scientific Research "Demokritos"¹

16 Sep 2008

TL;DR: An objective evaluation methodology that aims to reduce the human involvement in the ground truth construction and consecutive testing and a benchmarking of the six most promising state-of-the-art binarization algorithms based on the proposed methodology is presented.

...read moreread less

Abstract: Evaluation of document image binarization techniques is a tedious task that is mainly performedby a human expert or by involving an OCR engine. This paper presents an objective evaluation methodology for document image binarization techniques that aims to reduce the human involvement in the ground truth construction and consecutive testing. A skeletonized ground truth image is produced by the user following a semi-automatic procedure. The estimated ground truth image can aid in evaluating the binarization result in terms of recall and precision as well as to further analyze the result by calculating broken and missing text, deformations and false alarms. A detailed description of the methodology along with a benchmarking of the six (6) most promising state-of-the-art binarization algorithms based on the proposed methodology is presented.

...read moreread less

90 citations

Proceedings Article•10.1109/DAS.2008.73•

A Complete Optical Character Recognition Methodology for Historical Documents

[...]

Georgios Vamvakas, Basilis Gatos, Nikolaos Stamatopoulos, Stavros Perantonis

16 Sep 2008

TL;DR: In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented.

...read moreread less

Abstract: In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a top-down segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based on the character database that has been produced at the previous step.

...read moreread less

89 citations

Proceedings Article•10.1109/DAS.2008.74•

New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets

[...]

Sami Gazzah, N.E. Ben Amara

16 Sep 2008

TL;DR: This work proposes oversampling the minority class using polynomial fitting functions to improve True Negatives rate (TNr) without much sacrifice in TNr.

...read moreread less

Abstract: In classification tasks, class-modular strategy has been widely used. It has outperformed classical strategy for pattern classification task in many applications. However, in some modular architecture, such as one against all in support vector machines classifier, the training dataset for one class risks to heavily outnumber the other classes. In this challenging situation, the trained classifier will accurately classify the majority class; nevertheless, it marginalizes the minority class. As a result, True Negatives rate (TNr) will be very high while the True Positives rate (TPr) will be low. The main goal of this work is to improve TPr without much sacrifice in TNr. In this paper, we propose oversampling the minority class using polynomial fitting functions. Four new approaches were proposed: star topology, bus topology, polynomial curve topology and mesh topology. Star and mesh topologies approach had led to the best performances.

...read moreread less

76 citations

Proceedings Article•10.1109/DAS.2008.40•

A Two-Step Dewarping of Camera Document Images

[...]

Nikolaos Stamatopoulos, Basilis Gatos, Ioannis Pratikakis, Stavros Perantonis

16 Sep 2008

TL;DR: A two-step approach for efficient dewarping of camera document images is presented and experimental results demonstrate the robustness and effectiveness of the proposed technique.

...read moreread less

Abstract: Dewarping of camera document images has attracted a lot of interest over the last few years since warping not only reduces the document readability but also affects the accuracy of an OCR application. In this paper, a two-step approach for efficient dewarping of camera document images is presented. At a first step, a coarse dewarping is accomplished with the help of a transformation model which maps the projection of a curved surface to a 2D rectangular area. The projection of the curved surface is delimited by the two curved lines which fit the top and bottom text lines along with the two straight lines which fit to the left and right text boundaries. At a second step, fine dewarping is achieved based on words detection. All words are pose normalized guided by the lower and upper word baselines. Experimental results on several camera document images demonstrate the robustness and effectiveness of the proposed technique.

...read moreread less

60 citations

Proceedings Article•10.1109/DAS.2008.21•

MathBrush: A System for Doing Math on Pen-Based Devices

[...]

George Labahn¹, Edward Lank¹, Scott MacLean¹, Mirette Marzouk¹, David Tausky¹ - Show less +1 more•Institutions (1)

University of Waterloo¹

16 Sep 2008

TL;DR: This paper presents MathBrush, a system that allows users to draw math input using a pen-input device on a tablet computer, recognizes the math expression, and then supports mathematical transformation and problem solving using back-end Computer Algebra Systems (CAS).

...read moreread less

Abstract: Many on-line (interactive) mathematics recognition systems allow the creation of typeset equations, normally in LaTeX, but they do not support mathematical problem solving. In this paper, we present MathBrush, a system that allows users to draw math input using a pen-input device on a tablet computer, recognizes the math expression, and then supports mathematical transformation and problem solving using back-end Computer Algebra Systems (CAS). We describe the architecture of the MathBrush system, which includes modules that support symbol recognition, semantic analysis, the transfer of recognized expressions to back-end CAS, and interface techniques for interacting with CAS output. We also identify unique challenges associated with recognition for math problem solving, such as the need for deeper semantic analysis than is required by LATEX, and the need to deal with ambiguities in user input. Our experiences serve to inform researchers seeking to design interactive mathematics recognition systems geared toward mathematical problem solving.

...read moreread less

54 citations

Proceedings Article•10.1109/DAS.2008.71•

Segmentation of Curled Textlines Using Active Contours

[...]

Syed Saqib Bukhari¹, Faisal Shafait¹, Thomas M. Breuel¹•Institutions (1)

Kaiserslautern University of Technology¹

16 Sep 2008

TL;DR: This work presents a new algorithm for curled textline segmentation which is robust to above mentioned problems at the expense of high execution time and insensitivity, and will demonstrate this insensitivity in a performance evaluation section.

...read moreread less

Abstract: Segmentation of curled textlines from warped document images is one of the major issues in document image dewarping. Most of the curled textlines segmentation algorithms present in the literature today are sensitive to the degree of curl, direction of curl, and spacing between adjacent lines. We present a new algorithm for curled textline segmentation which is robust to above mentioned problems at the expense of high execution time. We will demonstrate this insensitivity in a performance evaluation section. Our approach is based on the state-of-the-art image segmentation technique: Active Contour Model (Snake) with the novel idea of several baby snakes and their convergence in a vertical direction only. Experiment on publically available CBDAR 2007 document image dewarping contest dataset shows our text line segmentation algorithm accuracy of 97.96%.

...read moreread less

43 citations

Proceedings Article•10.1109/DAS.2008.26•

Super-Resolution of Text Images Using Edge-Directed Tangent Field

[...]

Jyotirmoy Banerjee¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

16 Sep 2008

TL;DR: This paper presents an edge-directed super-resolution algorithm for document images without using any training set, which creates an image with smooth regions in both the foreground and the background, while allowing sharp discontinuities across and smoothness along the edges.

...read moreread less

Abstract: This paper presents an edge-directed super-resolution algorithm for document images without using any training set. This technique creates an image with smooth regions in both the foreground and the background, while allowing sharp discontinuities across and smoothness along the edges. Our method preserves sharp corners in text images by using the local edge direction, which is computed first by evaluating the gradient field and then taking its tangent. Super-resolution of document images is characterized by bimodality, smoothness along the edges as well as subsampling consistency. These characteristics are enforced in a Markov random field (MRF) framework by defining an appropriate energy function. In our method, subsampling of super-resolution image will return the original low-resolution one, proving the correctness of the method. The super-resolution image, is generated by iteratively reducing this energy function. Experimental results on a variety of input images, demonstrate the effectiveness of our method for document image super-resolution.

...read moreread less

40 citations

Proceedings Article•10.1109/DAS.2008.24•

Word and Symbol Spotting Using Spatial Organization of Local Descriptors

[...]

Marçal Rusiñol, Josep Lladós

16 Sep 2008

TL;DR: This paper proposes a spotting architecture able to index both words and symbols, inspired in off-the-shelf object recognition architectures, and presents a method to spot both text and graphical symbols in a collection of images of wiring diagrams.

...read moreread less

Abstract: In this paper we present a method to spot both text and graphical symbols in a collection of images of wiring diagrams. Word spotting and symbol spotting methods tend to use the most discriminative features to describe the objects to be located. This fact makes that one can not tackle with textual and symbolic information at the same time. We propose a spotting architecture able to index both words and symbols, inspired in off-the-shelf object recognition architectures. Keypoints are extracted from a document image and a local descriptor is computed at each of these points of interest. The spatial organization of these descriptors validate the hypothesis to find an object (text or symbol) in a certain location and under a certain pose.

...read moreread less

35 citations

Proceedings Article•10.1109/DAS.2008.60•

Accurate Alignment of Double-Sided Manuscripts for Bleed-Through Removal

[...]

Jie Wang¹, Michael S. Brown¹, Chew Lim Tan¹•Institutions (1)

National University of Singapore¹

16 Sep 2008

TL;DR: A two-stage hierarchical alignment technique that can efficiently and accurately align the two sides of a document and build a classification and recovery system to remove bleed-through interference and restore historical manuscripts.

...read moreread less

Abstract: Double-sided manuscripts are often degraded by bleed-through interference. Such degradation must be corrected to facilitate human perception and machine recognition. Most approaches to bleed-through removal rely on perfect alignment between the recto and verso images of a document. This paper presents a two-stage hierarchical alignment technique that can efficiently and accurately align the two sides of a document. Our approach first coarsely aligns the two images using a pair of anchors extracted from the recto and verso images respectively. The coarsely aligned images are then precisely aligned using block matching and radial basis function (RBF) based interpolation techniques. To evaluate the proposed alignment technique, we build a classification and recovery system to remove bleed-through interference and restore historical manuscripts. The accuracy of our alignment approach is then assessed with the accuracy of bleed-through correction.

...read moreread less

35 citations

Proceedings Article•10.1109/DAS.2008.59•

Automated OCR Ground Truth Generation

[...]

Joost van Beusekom¹, Faisal Shafait, Thomas M. Breuel¹•Institutions (1)

Kaiserslautern University of Technology¹

16 Sep 2008

TL;DR: This paper finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information.

...read moreread less

Abstract: Most optical character recognition (OCR) systems need to be trained and tested on the symbols that are to be recognized. Therefore, ground truth data is needed. This data consists of character images together with their ASCII code. Among the approaches for generating ground truth of real world data, one promising technique is to use electronic version of the scanned documents. Using an alignment method, the character bounding boxes extracted from the electronic document are matched to the scanned image. Current alignment methods are not robust to different similarity transforms. They also need calibration to deal with non-linear local distortions introduced by the printing/scanning process. In this paper we present a significant improvement over existing methods, allowing to skip the calibration step and having a more accurate alignment, under all similarity transforms. Our method finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information. The accuracy of the alignment is demonstrated using documents from the UW3 dataset. The results show that the mean distance between the estimated and the ground truth character bounding box position is less than one pixel.

...read moreread less

Proceedings Article•10.1109/DAS.2008.58•

Symbol Descriptor Based on Shape Context and Vector Model of Information Retrieval

[...]

Thi-Oanh Nguyen, Salvatore Tabbone, Oriol Ramos Terrades

16 Sep 2008

TL;DR: An adaptive method for graphic symbol representation based on shape contexts that is invariant under classical geometric transforms and based on interest points to reduce the complexity of matching a symbol to a largeset of candidates.

...read moreread less

Abstract: In this paper we present an adaptive method for graphic symbol representation based on shape contexts. The proposed descriptor is invariant under classical geometric transforms (rotation, scale) and based on interest points. To reduce the complexity of matching a symbol to a largeset of candidates we use the popular vector model for information retrieval. In this way, on the set of shape descriptors we build a visual vocabulary where each symbol is retrieved on visual words. Experimental results on complex and occluded symbols show that the approach is very promising.

...read moreread less

Proceedings Article•10.1109/DAS.2008.66•

Efficient Binarization of Historical and Degraded Document Images

[...]

B. Gatos, Ioannis Pratikakis, Stavros Perantonis

16 Sep 2008

TL;DR: A new adaptive approach for the binarization and enhancement of historical and degraded documents and demonstrated superior performance against six well-known techniques on numerous historical handwritten and machine-printed documents mainly from the Library of Congress of the United States archive.

...read moreread less

Abstract: This paper presents a new adaptive approach for the binarization and enhancement of historical and degraded documents. The proposed method is based on (i) efficient pre-processing; (ii) the combination of the results of several state-of-the-art binarization methodologies; (iii) the incorporation of edge information and (iv) the application of efficient image post-processing based on mathematical morphology for the enhancement of the final result. The proposed method demonstrated superior performance against six well-known techniques on numerous historical handwritten and machine-printed documents mainly from the Library of Congress of the United States archive. The performance evaluation was based on a consistent and concrete methodology.

...read moreread less

Proceedings Article•10.1109/DAS.2008.14•

Multi-oriented Text Line Extraction from Handwritten Arabic Documents

[...]

Nazih Ouwayed, Abdel Belaïd

16 Sep 2008

TL;DR: A novel approach for the multi-oriented text line extraction from handwritten Arabic documents by using the Wigner-Ville Distribution (WVD) to estimate the global orientation of each zone.

...read moreread less

Abstract: In this paper, we present a novel approach for the multi-oriented text line extraction from handwritten Arabic documents. After image pre-processing, the local orientations are determined in small windows obtained by image paving. The orientation of the text within each window is estimated using the projection profile technique considering several projection angles. Then, the windows which close angles are gathered into largest zones. We use the Wigner-Ville Distribution (WVD) to estimate the global orientation of each zone. The WVD is more precise than the classical projection profile technique. Afterwards, the text lines are extracted in each zone basing on the follow-up of the baselines and the proximity of connected components. The experimental results prove the efficiency of the proposed scheme. It has been evaluated on 50 documents reaching an accuracy of about 97.6%.

...read moreread less

Proceedings Article•10.1109/DAS.2008.83•

Multi-Oriented English Text Line Extraction Using Background and Foreground Information

[...]

Partha Pratim Roy¹, Umapada Pal², Josep Lladós¹, Fumitaka Kimura³•Institutions (3)

Autonomous University of Barcelona¹, Indian Statistical Institute², Mie University³

16 Sep 2008

TL;DR: A novel method to extract individual text lines from multi-oriented and/or curved text document is proposed and the method is based on the foreground and background information of the characters of the text.

...read moreread less

Abstract: In graphical documents (map, engineering drawing), artistic documents etc. there exist many printed materials where text lines are not parallel to each other and they are multi-oriented and curve in nature. For the OCR of such documents we need to extract individual text lines from the documents. Extraction of individual text lines from multi-oriented and/or curved text document is a difficult problem. In this paper, we propose a novel method to extract individual text lines from such document pages and the method is based on the foreground and background information of the characters of the text. To take care of background information, water reservoir concept is used here. In the proposed scheme at first, individual components are detected and grouped into 3-character clusters using their inter-component distance, size and positional information. Applying concept of graph, initial 3-character clusters are merged to have larger cluster group. Using inter-character background information, orientations of the extreme characters of a larger cluster are decided and based on these orientation, two candidate regions are formed from the cluster. Finally, with the help of these candidate regions, individual lines are extracted. From the experiment, we obtained encouraging result.

...read moreread less

Proceedings Article•10.1109/DAS.2008.43•

An End-to-End Administrative Document Analysis System

[...]

Hatem Hamza, Yolande Belaïd, Abdel Belaïd, Bidyut B. Chaudhuri

16 Sep 2008

TL;DR: An improved version of an already existing neural network called Incremental Growing Neural Gas is proposed, Applied on documents learning and classification, this neural network reaches a recognition rate of 97.63%.

...read moreread less

Abstract: This paper presents an end-to-end administrative document analysis system. This system uses case-based reasoning in order to process documents from known and unknown classes. For each document, the system retrieves the nearest processing experience in order to analyze and interpret the current document. When a complete analysis is done, this document needs to be added to the document database. This requires an incremental learning process in order to take into account every new information, without losing the previous learnt ones. For this purpose, we proposed an improved version of an already existing neural network called Incremental Growing Neural Gas. Applied on documents learning and classification, this neural network reaches a recognition rate of 97.63%.

...read moreread less

Proceedings Article•10.1109/DAS.2008.22•

Skew Estimation by Instances

[...]

Seiichi Uchida¹, Megumi Sakai¹, Masakazu Iwamura², Shinichiro Omachi³, Koichi Kise² - Show less +1 more•Institutions (3)

Kyushu University¹, Osaka Prefecture University², Tohoku University³

16 Sep 2008

TL;DR: The proposed skew estimation method by instances will be applicable to various documents such as signboard images captured by a camera and revealed the expected robustness against various character layouts.

...read moreread less

Abstract: This paper proposes a novel skew estimation method by instances. The instances to be learned (i.e., stored) are rotation invariants and a rotation variant for each character category. Using the instances, it is possible to estimate a skew angle of each individual character on a document. This fact implies that the proposed method can estimate the skew angle of a document where characters do not form long straight text lines. Thus, the proposed method will be applicable to various documents such as signboard images captured by a camera. Experimental evaluation using synthetic and real images revealed the expected robustness against various character layouts.

...read moreread less

Proceedings Article•10.1109/DAS.2008.50•

Towards Whole-Book Recognition

[...]

Pingping Xiu¹, Henry S. Baird¹•Institutions (1)

Lehigh University¹

16 Sep 2008

TL;DR: Experimental results for unsupervised recognition of the textual contents of book-images using fully automatic mutual-entropy-based model adaptation are described and it is observed that error rates on long words fall monotonically with passage lengths.

...read moreread less

Abstract: We describe experimental results for unsupervised recognition of the textual contents of book-images using fully automatic mutual-entropy-based model adaptation. Each experiment starts with approximate iconic and linguistic models---derived from (generally errorful) OCR results and (generally incomplete) dictionaries---and then runs a fully automatic adaptation algorithm which, guided entirely by evidence internal to the test set, attempts to correct the models for improved accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier. The linguistic model describes word-occurrence probabilities. Our adaptation algorithm detects disagreements between the models by analyzing mutual entropy between (1) the a posteriori probability distribution of character classes (the recognition results from image classification alone), and (2) the a posteriori probability distribution of word classes (the recognition results from image classification combined with linguistic constraints). Disagreements identify candidates for automatic model corrections. We report experiments on 40 textlines in which word error rates fall monotonicaly with passage lengths. We also report experiments on an enhanced algorithm which can cope with character-segmentation errors (a single split, or a single merge, per word). In order to scale up experiments, soon, to whole book images, we have revised data structures and implemented speed enhancements. For this algorithm, we report results on three increasingly long passage lengths: (a) one full page, (b) five pages, and (b) ten pages. We observe that error rates on long words fall monotonically with passage lengths.

...read moreread less

Proceedings Article•10.1109/DAS.2008.12•

Difference of Boxes Filters Revisited: Shadow Suppression and Efficient Character Segmentation

[...]

Erik Rodner¹, Herbert Süsse¹, Wolfgang Ortmann¹, Joachim Denzler¹•Institutions (1)

University of Jena¹

16 Sep 2008

TL;DR: This work presents an efficient segmentation framework using a preprocessing step for shadow suppression combined with a local thresholding technique based on a combination of difference of boxes filters and a new ternary segmentation, which are both simple low-level image operations.

...read moreread less

Abstract: A robust segmentation is the most important part of an automatic character recognition system (e.g. document processing, license plate recognition etc.). In our contribution we present an efficient segmentation framework using a preprocessing step for shadow suppression combined with a local thresholding technique. The method is based on a combination of difference of boxes filters and a new ternary segmentation, which are both simple low-level image operations. We also draw parallels to a recently published work on a ganglion cell model and show that our approach is theoretically more substantiated as well as more robust and more efficient in practice. Systematic evaluation of noisy input data as well as results on a large dataset of license plate images show the robustness and efficiency of our proposed method. Our results can be applied easily to any optical character recognition system resulting in an impressive gain of robustness against nonlinear illumination.

...read moreread less

Proceedings Article•10.1109/DAS.2008.61•

Structural Mixtures for Statistical Layout Analysis

[...]

Faisal Shafait, J. van Beusekom¹, Daniel Keysers, Thomas M. Breuel¹•Institutions (1)

Kaiserslautern University of Technology¹

16 Sep 2008

TL;DR: A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities that aims at solving the above mentioned problems for Manhattan layouts.

...read moreread less

Abstract: A key limitation of current layout analysis methods is that they rely on many hard-coded assumptions about document layouts and can not adapt to new layouts for which the underlying assumptions are not satisfied. Another major drawback of these approaches is that they do not return confidence scores for their outputs. These problems pose major challenges in large scale digitization efforts where a large number of different layouts need to be handled and manual inspection of the results on each individual page is not feasible. This paper presents a novel statistical approach to layout analysis that aims at solving the above mentioned problems for Manhattan layouts. The presented approach models known page layouts as a structural mixture model. A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities. First experiments on documents from the publicly available MARG dataset achieved below 5%error rate for geometric layout analysis.

...read moreread less

Proceedings Article•10.1109/DAS.2008.81•

Writer Verification of Arabic Handwriting

[...]

Sargur N. Srihari¹, Gregory R. Ball¹•Institutions (1)

University at Buffalo¹

16 Sep 2008

TL;DR: This work extends on an earlier study to objectively validate the hypothesis that handwriting is individualistic to include handwriting in the Arabic script, and uses global attributes of handwriting to determine the writer with a high degree of confidence.

...read moreread less

Abstract: Expanding on an earlier study to objectively validate the hypothesis that handwriting is individualistic, we extend the study to include handwriting in the Arabic script. Handwriting samples from twelve native speakers of Arabic were obtained. Analyzing differences in handwriting was done by using computer algorithms for extracting features from scanned images of handwriting. Attributes characteristic of the handwriting were obtained, e.g., line separation, slant, character shapes, etc. These attributes, which are a subset of attributes used by forensic document examiners (FDEs), were used to quantitatively establish individuality by using machine learning approaches. Using global attributes of handwriting, the ability to determine the writer with a high degree of confidence was established. The work is a step towards providing scientific support for admitting handwriting evidence in court.

...read moreread less

Proceedings Article•10.1109/DAS.2008.28•

State: A Multimodal Assisted Text-Transcription System for Ancient Documents

[...]

Albert Gordo, D. Llorens, A. Marzal, F. Prat, J.M. Vilar - Show less +1 more

16 Sep 2008

TL;DR: Some preliminary experiments show the productivity gains obtained with the system when transcribing a document and the error rate of the current recognition engine.

...read moreread less

Abstract: We present a complete assisted transcription system for ancient documents: State. The system consists of two applications: a pen-based, interactive application to assist humans in transcribing ancient documents and a recognition engine which offers automatic transcriptions via a web service. The interaction model and the recognition algorithm employed in the current version of State are presented. Some preliminary experiments show the productivity gains obtained with the system when transcribing a document and the error rate of the current recognition engine.

...read moreread less

Proceedings Article•10.1109/DAS.2008.82•

An Empirical Measure on the Set of Symbols Occurring in Engineering Mathematics Texts

[...]

Stephen M. Watt¹•Institutions (1)

University of Western Ontario¹

16 Sep 2008

TL;DR: This work examines second year university engineering mathematics as taught in North America as the domain, and presents an empirical analysis of the symbols and $n$-grams occurring in these expressions.

...read moreread less

Abstract: Certain forms of mathematical expression are used more often than others in practice. A quantitative understanding of actual usage can provide additional information to improve the accuracy of software for the input of mathematical expressions from scanned documents or handwriting and more natural forms of presentation of mathematical expressions by computer algebra systems. Earlier work has examined this question for the diverse set of articles from the mathematics preprint archive arXiv.org. That analysis showed showed the variance between mathematical areas. The present work analyzes a particular mathematical domain more deeply. We have chosen to examine second year university engineering mathematics as taught in North America as the domain. We have analyzed the set of expressions occurring in the most popular textbooks, weighted by popularity. Assuming that early training influences later mathematical usage, we take this as a model of the set of mathematical expressions used by the population of North American engineers. We present an empirical analysis of the symbols and $n$-grams occurring in these expressions.

...read moreread less

Proceedings Article•10.1109/DAS.2008.51•

Text String Extraction from Scene Image Based on Edge Feature and Morphology

[...]

Yuming Wang¹, N. Tanaka¹•Institutions (1)

Kobe University¹

16 Sep 2008

TL;DR: An algorithm that uses mathematical morphology to extract text effectively, and edge border ratio is utilized to differentiate text region from noise region, using the edge contrast feature of the text region in real scene.

...read moreread less

Abstract: Extraction of text from scene image is much difficult than extraction from simple document image. A lot of researches succeeded in extracting single text string from image, but can not deal with image including many text strings. Meanwhile, the result may be mixed with noises be similar to text. This paper describes an algorithm that uses mathematical morphology to extract text effectively, and edge border ratio is utilized to differentiate text region from noise region, using the edge contrast feature of the text region in real scene. This paper also describes the method which can connect characters into text strings, and distribute text strings to different subimages according to their width of strokes. The algorithm is implied to scene image like signs, indicators as well as magazine covers, and its robustness is proved.

...read moreread less

Proceedings Article•10.1109/DAS.2008.9•

Object Extraction from Colour Cadastral Maps

[...]

Romain Raveaux¹, Jean-Christophe Burie¹, J.-M. Ogier¹•Institutions (1)

University of La Rochelle¹

16 Sep 2008

TL;DR: An object extraction method from ancient colour maps is proposed that consists on the localization of quarters inside a given cadastral map using a peeling the onion method and the colour aspect is exploited thanks to a colour restoration algorithm and a relevant hybrid colour model.

...read moreread less

Abstract: In this paper, an object extraction method from ancient colour maps is proposed. It consists on the localization of quarters inside a given cadastral map. The colour aspect is exploited thanks to a colour restoration algorithm and the selection of a relevant hybrid colour model. Objects composing the map are located using a multi-components gradient. To identify quarters, a peeling the onion method is adopted. This selective method starts by separated text and graphics. On the graphic layer, a connected component analysis is carried out through the use of a neighbourhood graph. This graph is smartly pruned to consider only significant areas. Consequently, the quarter boundaries are found using a snake which is a computer-generated curve that moves within an image to fit a given object. The performance of our method is measured up in two steps: Firstly, the colour space selection is assessed according to the colour distinction capacity while being robust to variations/noise then the automatic extraction approach is compared to the user ground truth. Results show the good behaviour of the whole system.

...read moreread less

Proceedings Article•10.1109/DAS.2008.87•

On the Reading of Tables of Contents

[...]

Prateek Sarkar¹, E. Saund¹•Institutions (1)

PARC¹

16 Sep 2008

TL;DR: A universal logical structure representation in terms of a hierarchy of entries, each of which may contain a descriptor and a locator is proposed for tables of contents (TOC) of books, journals, and magazines.

...read moreread less

Abstract: This paper presents a framework for understanding tables of contents (TOC) of books, journals, and magazines. We propose a universal logical structure representation in terms of a hierarchy of entries, each of which may contain a descriptor and a locator. We enumerate graphical and perceptual cues that provide cues to parsing of tables of contents in terms of this formalism. We make initial suggestions about the form of evaluation metrics for comparing ground truthed tables of contents with the output of recognition algorithms. Typical and a typical tables of contents are used throughout to illustrate significant phenomena that must be dealt with in principled ways in any general TOC interpretation scheme. Finally we discuss implications of our observations on the design of recognition algorithms.

...read moreread less

Proceedings Article•10.1109/DAS.2008.8•

Writer-Dependent Recognition of Handwritten Whiteboard Notes in Smart Meeting Room Environments

[...]

Marcus Liwicki¹, Andreas Schlapbach¹, Horst Bunke¹•Institutions (1)

University of Bern¹

16 Sep 2008

TL;DR: A writer-dependent handwriting recognition system based on hidden Markov models (HMMs) that operates in two stages, where a Gaussian mixture model (GMM)-based writer identification system developed for smart meeting rooms identifies the person writing on the whiteboard.

...read moreread less

Abstract: In this paper we present a writer-dependent handwriting recognition system based on hidden Markov models (HMMs). This system, which has been developed in the context of research on smart meeting rooms, operates in two stages. First, a Gaussian mixture model (GMM)-based writer identification system developed for smart meeting rooms identifies the person writing on the whiteboard. Then a recognition system adapted to the individual writer is applied. Two different methods for obtaining writer-dependent recognizers are proposed. The first method uses the available writer-specific data to train an individual recognition system for each writer from scratch, while the second method takes a writer-independent recognizer and adapts it with the data from the considered writer. The experiments have been performed on the IAM-OnDB. In the first stage,the writer identification system produces a perfect identification rate. In the second stage, the writer-specific recognition system gets significantly better recognition results, compared to the writer-independent recognizer. The final word recognition rate on the IAM-OnDB-t1 benchmark task is close to 80 %.

...read moreread less

Proceedings Article•10.1109/DAS.2008.25•

Exploring Evolutionary Technical Trends from Academic Research Papers

[...]

Teng Kai Fan¹, Chia-Hui Chang¹•Institutions (1)

National Central University¹

16 Sep 2008

TL;DR: This work uses focused technical terms from research papers to explore technical trends in the research literature and defines this new text mining issue and applies machine learning algorithms for solving this problem.

...read moreread less

Abstract: Automatic Term Recognition (ATR) is concerned with discovering terminology in large volumes of text corpora. Technical terms are vital elements for understanding the techniques used in academic research papers, and in this paper, we use focused technical terms to explore technical trends in the research literature. The major purpose of this work is to understand the relationship between techniques and research topics to better explore technical trends. We define this new text mining issue and apply machine learning algorithms for solving this problem by (1) recognizing focused technical terms from research papers; (2) classifying these terms into predefined technology categories; (3) analyzing the evolution of technical trends. The dataset consists of 656 papers collected from well-known conferences on ACM. The experimental results indicate that our proposed methods can effectively explore interesting evolutionary technical trends in various research topics.

...read moreread less

Proceedings Article•10.1109/DAS.2008.13•

Named Entity Recognition by Neural Sliding Window

[...]

Ignazio Gallo, Elisabetta Binaghi, Moreno Carullo, N. Lamberti

16 Sep 2008

TL;DR: A NER algorithm which uses a Multi-Layer Perceptron (MLP) to find and classify entities in natural language text and implements a new supervised context-based NER approach called Sliding Window Neural (SWiN).

...read moreread less

Abstract: Named Entity Recognition (NER) is an important subtask of document processing such as Information Extraction. This paper describes a NER algorithm which uses a Multi-Layer Perceptron (MLP) to find and classify entities in natural language text. In particular we use the MLP to implement a new supervised context-based NER approach called Sliding Window Neural (SWiN). The SWiN method is a good solution for domains where the documents are grammatically ill-formed and it is difficult to exploit the features derived from linguistic analysis. Experiments indicate good accuracy compared with traditional approaches and demonstrate the system's portability.

...read moreread less

Proceedings Article•10.1109/DAS.2008.64•

Keyword Matching in Historical Machine-Printed Documents Using Synthetic Data, Word Portions and Dynamic Time Warping

[...]

Thomas Konidaris, B. Gatos, Stavros Perantonis, Anastasios L. Kesidis

16 Sep 2008

TL;DR: A novel and efficient technique for finding keywords typed by the user in digitised machine-printed historical documents using the dynamic time warping (DTW) algorithm, which manages to significantly prune the list of candidate words thus, speeding up the entire process.

...read moreread less

Abstract: In this paper we propose a novel and efficient technique for finding keywords typed by the user in digitised machine-printed historical documents using the dynamic time warping (DTW) algorithm. The method uses word portions located at the beginning and end of each segmented word of the processed documents and try to estimate the position of the first and last characters in order to reduce the list of candidate words. Since DTW can become computational intensive in large datasets the proposed method manages to significantly prune the list of candidate words thus, speeding up the entire process. Word length is also used as a means of further reducing the data to be processed. Results are improved in terms of time and efficiency compared to those produced if no pruning is done to the list of candidate words.

...read moreread less