Hierarchical Metadata-Aware Document Categorization under Weak Supervision
Yu Zhang,Xiusi Chen,Yu Meng,Jiawei Han +3 more
- 08 Mar 2021
- pp 770-778
19
TL;DR: In this paper, a joint representation learning and data augmentation module is proposed for document categorization under weak supervision, which allows simultaneous modeling of category dependencies, metadata information and textual semantics, and introduces a hierarchical synthesizing training documents to complement the original, small-scale training set.
read more
Abstract: Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
Yu Yvette Zhang,Zhihong Shen,Chieh-Han Wu,Boya Xie,Junheng Hao,Yexin Wang,Kuansan Wang,Jiawei Han +7 more
- 11 Feb 2022
TL;DR: Experimental results show that MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines and is on par with the state-of-the-art supervised metadata-aware LMTC method trained on 10K–200K labeled documents, and tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.
•Posted Content
Coarse2Fine: Fine-grained Text Classification on Coarsely-grained Annotated Data
TL;DR: The authors proposed a coarse-to-fine grained classification approach to perform fine-grained classification on coarsely annotated data, which leverages label surface names as the only human guidance and employs rich pre-trained generative language models into the iterative weak supervision strategy.
16
MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information
11 Feb 2022
TL;DR: MotifClass as mentioned in this paper proposes a heterogeneous information network to capture higher-order structures in the network, and uses motifs to describe metadata combinations to help weakly supervised text classification.
dhCM: Dynamic and Hierarchical Event Categorization and Discovery for Social Media Stream
GuoJinjin,GongZhiguo,CaoLongbing +2 more
TL;DR: The online event discovery in social media based documents is useful, such as for disaster recognition and intervention, but the diverse events incrementally identified from social media streaks need to be addressed.
6
Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers
Yu Zhang,Bowen Jin,Xiusi Chen,Yanzhen Shen,Yunyi Zhang,Meng Yu,Jiawei Han +6 more
- 04 Aug 2023
TL;DR: Weakly supervised multi-label classification of full-text scientific papers focuses on classifying papers into coarse-grained research topics and fine-grained themes using category descriptions and full text. The proposed framework, FUTEX, leverages the cross-paper network structure and the in-paper hierarchy structure to achieve competitive performance.
3
References
•Journal Article
Visualizing Data using t-SNE
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Hierarchical Attention Networks for Document Classification
Zichao Yang,Diyi Yang,Chris Dyer,Xiaodong He,Alexander J. Smola,Eduard Hovy +5 more
- 13 Jun 2016
TL;DR: Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin.
LINE: Large-scale Information Network Embedding
Jian Tang,Meng Qu,Mingzhe Wang,Ming Zhang,Jun Yan,Qiaozhu Mei +5 more
- 18 May 2015
TL;DR: A novel network embedding method called the ``LINE,'' which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted, and optimizes a carefully designed objective function that preserves both the local and global network structures.
LINE: Large-scale Information Network Embedding
TL;DR: LINE as discussed by the authors proposes a network embedding method called LINE, which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted, and optimizes a carefully designed objective function that preserves both the local and global network structures.
4.2K
Knowledge Graph Embedding: A Survey of Approaches and Applications
TL;DR: This article provides a systematic review of existing techniques of Knowledge graph embedding, including not only the state-of-the-arts but also those with latest trends, based on the type of information used in the embedding task.
2.8K