Enhancing document structure analysis using visual analytics
Andreas Stoffel,David Spretke,Henrik Kinnemann,Daniel A. Keim +3 more
- 22 Mar 2010
- pp 8-12
TL;DR: A new approach for analyzing the logical structure of text documents is presented, combining state-of-the-art machine learning with novel interactive visualization techniques, allowing a quick adaptation of the structure analysis process to unknown document classes and new tasks without requiring a predefined training set.
read more
Abstract: During the last decade national archives, libraries, museums and companies started to make their records, books and files electronically available. In order to allow efficient access of this information, the content of the documents must be stored in database and information retrieval systems. State-of-the-art indexing techniques mostly rely on the information explicitly available in the text portions of documents. Documents usually contain a significant amount of implicit information such as their logical structure which is not directly accessible (unless the documents are available as well-structured XML-files) and is therefore not used in the search process. In this paper, a new approach for analyzing the logical structure of text documents is presented. The problem of state-of-the-art methods is that they have been developed for a particular type of documents and can only handle documents of that type. In most cases, adaptation and re-training for a different document type is not possible. Our proposed method allows an efficient and effective adaptation of the structure analysis process by combining state-of-the-art machine learning with novel interactive visualization techniques, allowing a quick adaptation of the structure analysis process to unknown document classes and new tasks without requiring a predefined training set.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 3: Average F-measure of the structure analysis in the different reference data iterations. 
Figure 1: The different components of the system. 
Table 4: Performance of the proposed system on product manuals. 
Figure 2: Visualization of the structure analysis results. 
Table 1: Performance of different algorithms on INTEGERS articles. 
Table 2: Performance of different algorithms on computer science publications.
Citations
ICDAR 2013 Table Competition
Max Göbel,Tamir Hassan,Ermelinda Oro,Giorgio Orsi +3 more
- 25 Aug 2013
TL;DR: The Table Competition held in the context of ICDAR 2013 is the first attempt at objectively evaluating these techniques against each other in a standardized way, across several input formats.
289
Visual Readability Analysis: How to Make Your Writings Easier to Read
TL;DR: A semiautomatic feature selection approach is discussed that is used to choose appropriate measures from a collection of 141 candidate readability features and the visual analysis tool VisRA is presented, which allows the user to analyze the feature values across the text and within single sentences.
Using structural information and citation evidence to detect significant plagiarism cases in scientific publications
TL;DR: An empirical study on the system's response shows that structural information, unlike existing plagiarism detectors, helps to flag significant plagiarism cases, improve the similarity index, and provide human-like plagiarism screening results.
Visual readability analysis: How to make your writings easier to read
Daniela Oelke,David Spretke,Andreas Stoffel,Daniel A. Keim +3 more
- 10 Dec 2010
TL;DR: A semi-automatic feature selection approach is discussed that is used to choose appropriate measures from a collection of 141 candidate readability features and the visual analysis tool VisRA is presented, which allows the user to analyze the feature values across the text and within single sentences.
CanvasEmb: Learning Layout Representation with Large-scale Pre-training for Graphic Design
Yuxi Xie,Danqing Huang,Jinpeng Wang,Chin-Yew Lin +3 more
- 17 Oct 2021
TL;DR: CanvasEmb as mentioned in this paper pre-trains deep representations from unlabeled graphic designs by jointly conditioning on all the context elements in a canvas, with a multidimensional feature encoder and a multi-task learning objective.
References
•Book
Introduction to Information Retrieval
Christopher D. Manning,Prabhakar Raghavan,Hinrich Schütze +2 more
- 01 Jan 2008
TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Twenty years of document image analysis in PAMI
TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
580
Advances in Pattern Recognition
Richard G. Casey,George Nagy +1 more
TL;DR: This paper proposes training costsentitive neural networks with editing techniques for handling the class imbalance problem on multi-class datasets and proposes a strategy to remove majority samples while compensating theclass imbalance during the training process.
304
Advances in Pattern Recognition
R. Casey,G. Nagy +1 more
- 01 Apr 1971
TL;DR: Several areas of application as well as different types of requirements and constraints are discussed in this chapter prior to the presentation of the methods in the rest of the book.
283
Document Structure and Layout Analysis
Anoop M. Namboodiri,Anil K. Jain +1 more
- 01 Jan 2007
TL;DR: Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems.
Related Papers (5)
Matteo Cristani,Claudio Tomazzoli +1 more
- 03 Jun 2014
[...]
Jaspal Kaur,Mohammad Yusof,Patrice Boursier,Jean-Marc Ogier +3 more
- 19 Apr 2010