Domain-agnostic Document Representation Learning Using Latent Topics and Metadata

doi:10.32473/FLAIRS.V34I1.128388

Open AccessJournal Article10.32473/FLAIRS.V34I1.128388

Domain-agnostic Document Representation Learning Using Latent Topics and Metadata

Natraj Raman, +3 more

- 18 Apr 2021

- Vol. 34, Iss: 1

2

TL;DR: This work generates document representations that capture both text and metadata in a task agnostic manner and demonstrates through extensive evaluation that the proposed cross-model fusion solution outperforms several competitive baselines on multiple domains.

Abstract: Fine-tuning a pre-trained neural language model with a task specific output layer is the de facto approach of late when dealing with document classification. This technique is inadequate when labeled examples are unavailable at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings by employing a pre-learned topic model distribution as surrogate labels. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The generated document embeddings exhibit compositional characteristics and are directly used by downstream classification tasks to create decision boundaries from a small number of labels, thereby eschewing complicated recognition methods. We demonstrate through extensive evaluation that our proposed cross-model fusion solution outperforms several competitive baselines on multiple domains.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1109/icassp49357.2023.10095058

Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

04 Jun 2023

TL;DR: In this article , the authors investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models and find that creating anchor and positive track pairs by relying on co-occurrences in playlists provides better music similarity and competitive classification results compared to choosing tracks from the same artist as in previous works.

...read moreread less

2

Proceedings Article•10.48550/arXiv.2304.12257

Pre-Training Strategies Using Contrastive Learning and Playlist Information for Music Classification and Similarity

Pablo Alonso-Jiménez, +5 more

- 24 Apr 2023

TL;DR: In this article , the authors investigate an approach that relies on contrastive learning and music metadata as a weak source of supervision to train music representation models and find that creating anchor and positive track pairs by relying on co-occurrences in playlists provides better music similarity and competitive classification results compared to choosing tracks from the same artist as in previous works.

...read moreread less

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

•Journal Article•10.5555/944919.944937

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003

- Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

36.2K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

- 03 Jan 2001

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

25.5K