Structured document

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•10.1145/1031171.1031181•

Simple BM25 extension to multiple weighted fields

[...]

Stephen Robertson¹, Hugo Zaragoza¹, Michael J. Taylor¹•Institutions (1)

Microsoft¹

13 Nov 2004

TL;DR: This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents and proposes a much more intuitive alternative which weights term frequencies before the non-linear term frequency saturation function is applied.

...read moreread less

Abstract: This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.

...read moreread less

860 citations

Proceedings Article•10.5555/188490.188591•

Effective retrieval of structured documents

[...]

Ross Wilkinson¹•Institutions (1)

RMIT University¹

1 Aug 1994

TL;DR: This work considers what information is needed to retrieve effectively and shows that knowledge of the structure of documents can lead to improved retrieval performance.

...read moreread less

Abstract: Information systems usually retrieve whole documents as answers to queries. However, it may in some circumstances be more appropriate to retrieve parts of documents. We consider formulas for retrieving whole documents and parts of documents horn a large structured document collection. We consider what information is needed to retrieve effectively and show that knowledge of the structure of documents can lead to improved retrieval performance.

...read moreread less

305 citations

Patent•

Facilitating interaction among users of a social network

[...]

Spencer Greg Ahrens¹, Cameron Marlow¹, Lars Backstrom¹, Chaitanya Mishra¹•Institutions (1)

Facebook¹

30 Jun 2011

TL;DR: In this paper, the first user action relating to a first topic from a first user, identifying the first topic based on the user action, identifying one or more second posts that relate to the first topics, and transmitting to the user the information associated with the second posts in a structured document.

...read moreread less

Abstract: In one embodiment, a method includes receiving a first user action relating to a first topic from a first user, identifying the first topic based on the first user action, identifying one or more second posts that relate to the first topic, and transmitting to the first user one or more of the second posts or information associated with the second posts in a structured document for display to the first user, the structured document further comprising one or more interactive elements that enable the first user to interact with the one or more second posts or to respective second users that declared the second posts.

...read moreread less

234 citations

Patent•

Document management system with enhanced intelligent document recognition capabilities

[...]

Suresh S. Pandian, Thyagarajan Swaminathan, Subramaniyan Neelagandan, Krishna K. Srinivasan, Randal J. Martin - Show less +1 more

10 Jun 2005

TL;DR: An intelligent document recognition-based document management system as discussed by the authors includes modules for image capture, image enhancement, image identification, optical character recognition (OCR), data extraction, and quality assurance.

...read moreread less

Abstract: An intelligent document recognition-based document management system (Fig. 2) includes modules for image capture (32), image enhancement (32), image identification (34), optical character recognition (36), data extraction (37) and quality assurance (42). The system captures data from electronic documents as diverse as facsimile images, scanned images and images from document management systems. It processes these images and presents the data in, for example, a standard XML format. The document management system processes both structured document images (40) (ones which have a standard format) and unstructured document images (38) (ones which do not have a standard format). The system can extract images directly from a facsimile machine, a scanner or a document management system for processing.

...read moreread less

233 citations

Book•

Structured Document Image Analysis

[...]

Henry S. Baird, Horst Bunke, Kazuhiko Yamamoto

1 Nov 1992

TL;DR: This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architecture of complete high-performance printed-document reading systems.

...read moreread less

Abstract: Document image analysis is the automatic computer interpretation of images of printed and handwritten documents, including text, drawings, maps, music scores, etc. Research in this field supports a rapidly growing international industry. This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architectureof complete high-performance printed-document reading systems. A unique feature is the extended section on music notation, an ideal vehicle for international sharing of basic research. Also, the collection includes important new work on line drawings, handwriting, character and symbol recognition, and basic methodological issues. The IAPR 1990 Workshop on Syntactic and Structural Pattern Recognition is summarized,including the reports of its expert working groups, whose debates provide a fascinating perspective on the field. The book is an excellent text for a first-year graduate seminar in document image analysis,and is likely to remain a standard reference in the field for years.

...read moreread less

231 citations

...

Expand

Year	Papers
2021	11
2020	16
2019	21
2018	21
2017	15
2016	17

Topic Tools

Papers published on a yearly basis

Papers

Simple BM25 extension to multiple weighted fields

Effective retrieval of structured documents

Facilitating interaction among users of a social network

Document management system with enhanced intelligent document recognition capabilities

Structured Document Image Analysis

Related Topics (5)

Performance Metrics