TL;DR: An intelligent document recognition-based document management system as discussed by the authors includes modules for image capture, image enhancement, image identification, optical character recognition (OCR), data extraction, and quality assurance.
Abstract: An intelligent document recognition-based document management system (Fig. 2) includes modules for image capture (32), image enhancement (32), image identification (34), optical character recognition (36), data extraction (37) and quality assurance (42). The system captures data from electronic documents as diverse as facsimile images, scanned images and images from document management systems. It processes these images and presents the data in, for example, a standard XML format. The document management system processes both structured document images (40) (ones which have a standard format) and unstructured document images (38) (ones which do not have a standard format). The system can extract images directly from a facsimile machine, a scanner or a document management system for processing.
TL;DR: This work proposes the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and reports promising results obtained in preliminary experiments.
Abstract: One of the aims of the EU project COLLATE is to design and implement a Web-based collaboratory for archives, scientists and end-users working with digitized cultural material. Since the originals of such a material are often unique and scattered in various archives, severe problems arise for their wide fruition. A solution would be to develop intelligent document processing tools that automatically transform printed documents into a Web-accessible form such as XML. Here, we propose the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and report promising results obtained in preliminary experiments.
TL;DR: The general framework, feature extraction modules, query capabilities, a graphical query interface, and the application interface are introduced and each component of the system is demonstrated and how the query mechanisms can be used to handle both content and structural queries eeectively.
Abstract: Work has recently begun on a joint project between the Universities of Maryland and Oulu on the development of a system for Intelligent Document Image Retrieval (IDIR). The IDIR system will provide close connections with and utilization of document analysis and image processing techniques, advanced computing and networking, and modern approaches to database management. The system design consists of aggressively modularized components to enhance the development of individual parts which are used in the complete solution, including: Interface speciications, multipurpose feature extraction, an integrated eecient query language, physical retrieval from an object-oriented database, and delivery of retrieved objects. In this paper, we introduce the general framework, feature extraction modules, query capabilities, a graphical query interface, and the application interface. We demonstrate each component of the system and how the query mechanisms can be used to handle both content and structural queries eeectively.
TL;DR: This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals.
Abstract: A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents.
TL;DR: In this article, a content processing module is configured to perform intelligent document content processing, such as confidential information processing, content optimization and workflow optimization, on the electronic document data based upon the particular user preference data.
Abstract: A network device includes a content processing module that is configured to perform intelligent document content processing, such as confidential information processing, content optimization and workflow optimization. The network device authenticates a user and determines electronic document data that is to be processed. The electronic document data may be created at the network device, e.g., by a scanning module on the network device, or at a client device, e.g., by a word processing application executing on the client device. The content processing module retrieves particular user preference data based upon the user authentication. The particular user preference data may specify confidential information preferences, content optimization preferences and/or workflow preferences. The content processing module performs intelligent document content processing on the electronic document data based upon the particular user preference data and generates processed electronic document data.