Open AccessPosted Content
Hierarchical Metadata-Aware Document Categorization under Weak Supervision
TL;DR: This paper proposes a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and introduces a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
read more
Abstract: Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
Yu Yvette Zhang,Zhihong Shen,Chieh-Han Wu,Boya Xie,Junheng Hao,Yexin Wang,Kuansan Wang,Jiawei Han +7 more
- 11 Feb 2022
TL;DR: Experimental results show that MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines and is on par with the state-of-the-art supervised metadata-aware LMTC method trained on 10K–200K labeled documents, and tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.
dhCM: Dynamic and Hierarchical Event Categorization and Discovery for Social Media Stream
GuoJinjin,GongZhiguo,CaoLongbing +2 more
TL;DR: The online event discovery in social media based documents is useful, such as for disaster recognition and intervention, but the diverse events incrementally identified from social media streaks need to be addressed.
6
Who Should Review Your Proposal? Interdisciplinary Topic Path Detection for Research Proposals
TL;DR: A deep Hierarchical Interdisciplinary Research Proposal Classification Network (HIRPCN) is developed, which proposes a hierarchical transformer to extract the textual semantic information of proposals and designs a level-wise prediction component to fuse the two types of knowledge representations and detect interdisciplinary topic paths for each proposal.
3
Partial label learning for automated classification of single-cell transcriptomic profiles.
Malek Senoussi,Thierry Artières,Paul Villoutreix +2 more
TL;DR: Overall the findings show how hierarchical and non-hierarchical partial label learning strategies can help solve the problem of automated classification of single-cell transcriptomic profiles, interestingly these methods rely on a much less stringent type of annotated datasets compared to fully supervised learning methods.
2
Adapting Pretrained Representations for Text Mining
Yu Meng,Jian Huang,Yu Yvette Zhang,Jiawei Han +3 more
- 14 Aug 2022
TL;DR: This tutorial introduces recent advances in pretrained text representations, as well as their applications to a wide range of text mining tasks, and focuses on minimally-supervised approaches that do not require massive human annotations.
1
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Journal Article
Visualizing Data using t-SNE
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin,Ming-Wei Chang,Kenton Lee,Kristina Toutanova +3 more
- 11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
24.6K
•Proceedings Article
Distributed Representations of Words and Phrases and their Compositionality
Tomas Mikolov,Ilya Sutskever,Kai Chen,Greg S. Corrado,Jeffrey Dean +4 more
- 05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
•Posted Content
Distributed Representations of Words and Phrases and their Compositionality
TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.