A Versatile Hypergraph Model for Document Collections

doi:10.1145/3400903.3400919

Proceedings Article10.1145/3400903.3400919

A Versatile Hypergraph Model for Document Collections

- 07 Jul 2020

3

TL;DR: Heterogeneous hypergraphs are introduced as a versatile model for representing annotated document collections that integrates external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data.

Abstract: Efficiently and effectively representing large collections of text is of central importance to information retrieval tasks such as summarization and search. Since models for these tasks frequently rely on an implicit graph structure of the documents or their contents, graph-based document representations are naturally appealing. For tasks that consider the joint occurrence of words or entities, however, existing document representations often fall short in capturing cooccurrences of higher order, higher multiplicity, or at varying proximity levels. Furthermore, while numerous applications benefit from structured knowledge sources, external data sources are rarely considered as integral parts of existing document models. To address these shortcomings, we introduce heterogeneous hypergraphs as a versatile model for representing annotated document collections. We integrate external metadata, document content, entity and term annotations, and document segmentation at different granularity levels in a joint model that bridges the gap between structured and unstructured data. We discuss selection and transformation operations on the set of hyperedges, which can be chained to support a wide range of query scenarios. To ensure compatibility with established information retrieval methods, we discuss projection operations that transform hyperedges to traditional dyadic cooccurrence graph representations. Using PostgreSQL and Neo4j, we investigate the suitability of existing database systems for implementing the hypergraph document model, and explore the impact of utilizing implicit and materialized hyperedge representations on storage space requirements and query performance.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1109/access.2022.3143612

A Hypergraph Approach for Estimating Growth Mechanisms of Complex Networks

01 Jan 2022

- IEEE Access

TL;DR: In this article , the authors proposed a new hypergraph growth model with a data-driven preferential attachment mechanism estimated from observed data, which can preserve higher-order relationships by using hyperedges.

...read moreread less

3

Book Chapter•10.1007/978-3-031-26390-3_33

Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels

Sara Abdali, +2 more

- 01 Jan 2023

TL;DR: This work develops Vec2Node that leverages self-training from in-domain unlabeled data augmented with tensorized word embeddings that significantly improves over state-of-the-art models, particularly in low-resource settings.

...read moreread less

2

Journal Article•10.1109/ACCESS.2022.3143612

A Hypergraph Approach for Estimating Growth Mechanisms of Complex Networks

Masaaki Inoue, +2 more

- IEEE Access

TL;DR: Fitting the proposed hypergraph model to 13 real-world datasets from diverse domains, it is found that all estimated preferential attachment functions deviates substantially from the linear form, demonstrating the need of doing away with the linear preferential attachment assumption and adopting a data-driven approach.

...read moreread less

2

References

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

Journal Article•10.1145/219717.219748

WordNet: a lexical database for English

George A. Miller

- 01 Nov 1995

- Communications of The ACM

TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.

...read moreread less

16.9K

•Book

Introduction to Information Retrieval

Christopher D. Manning, +2 more

- 01 Jan 2008

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

13.1K

•Proceedings Article•10.18653/V1/N18-1202

Deep contextualized word representations

Matthew E. Peters, +6 more

- 15 Feb 2018

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less

11.7K

•Journal Article•10.1162/153244303322533223

A neural probabilistic language model

Yoshua Bengio, +3 more

- 01 Mar 2003

- Journal of Machine Learning Research

TL;DR: The authors propose to learn a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, which can be expressed in terms of these representations.

...read moreread less

8K