TL;DR: This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents and proposes a much more intuitive alternative which weights term frequencies before the non-linear term frequency saturation function is applied.
Abstract: This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.
TL;DR: This work considers what information is needed to retrieve effectively and shows that knowledge of the structure of documents can lead to improved retrieval performance.
Abstract: Information systems usually retrieve whole documents as answers to queries. However, it may in some circumstances be more appropriate to retrieve parts of documents. We consider formulas for retrieving whole documents and parts of documents horn a large structured document collection. We consider what information is needed to retrieve effectively and show that knowledge of the structure of documents can lead to improved retrieval performance.
TL;DR: In this paper, the first user action relating to a first topic from a first user, identifying the first topic based on the user action, identifying one or more second posts that relate to the first topics, and transmitting to the user the information associated with the second posts in a structured document.
Abstract: In one embodiment, a method includes receiving a first user action relating to a first topic from a first user, identifying the first topic based on the first user action, identifying one or more second posts that relate to the first topic, and transmitting to the first user one or more of the second posts or information associated with the second posts in a structured document for display to the first user, the structured document further comprising one or more interactive elements that enable the first user to interact with the one or more second posts or to respective second users that declared the second posts.
TL;DR: An intelligent document recognition-based document management system as discussed by the authors includes modules for image capture, image enhancement, image identification, optical character recognition (OCR), data extraction, and quality assurance.
Abstract: An intelligent document recognition-based document management system (Fig. 2) includes modules for image capture (32), image enhancement (32), image identification (34), optical character recognition (36), data extraction (37) and quality assurance (42). The system captures data from electronic documents as diverse as facsimile images, scanned images and images from document management systems. It processes these images and presents the data in, for example, a standard XML format. The document management system processes both structured document images (40) (ones which have a standard format) and unstructured document images (38) (ones which do not have a standard format). The system can extract images directly from a facsimile machine, a scanner or a document management system for processing.
TL;DR: This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architecture of complete high-performance printed-document reading systems.
Abstract: Document image analysis is the automatic computer interpretation of images of printed and handwritten documents, including text, drawings, maps, music scores, etc. Research in this field supports a rapidly growing international industry. This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architectureof complete high-performance printed-document reading systems. A unique feature is the extended section on music notation, an ideal vehicle for international sharing of basic research. Also, the collection includes important new work on line drawings, handwriting, character and symbol recognition, and basic methodological issues. The IAPR 1990 Workshop on Syntactic and Structural Pattern Recognition is summarized,including the reports of its expert working groups, whose debates provide a fascinating perspective on the field. The book is an excellent text for a first-year graduate seminar in document image analysis,and is likely to remain a standard reference in the field for years.