DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM

doi:10.18653/v1/2023.findings-emnlp.606

Journal Article10.18653/v1/2023.findings-emnlp.606

DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM

Weijie Xu, +3 more

- 01 Jan 2023

3

TL;DR: DeTiME is a novel framework for Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM that generates highly clusterable embeddings and enhances topic generation capabilities.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 4: The histogram of survey scores from fluency, grammar, and redundancy perspectives.

Figure 1: A summary of a few of our findings: (1) Our embeddings outperform the best clusterable methods (selected from (Muennighoff et al., 2022)). (2) The same framework with a slightly different finetuned task(DeTiME Training) does not perform well. (3) When compressed, our embeddings excel in higher dimensions, making them ideal for topic modeling. Detailed settings is in Appendix E.

Table 1: The main results for all clusterability metrics, diversity, and coherence (Cv). The number of topics is 20. The best and second-best scores of each dataset are highlighted in boldface and with an underline, respectively. The result represents the average value obtained from three datasets, where each dataset was processed 10 times to compute the mean and standard deviation.

Figure 3: The diffusion framework based on the main framework in Figure 2. In the training component, a DDPM-scheduled Autoencoder with residual connections diffusor is trained using the embedding vectors obtained from the enc2. In generating part, the trained diffusor is used to denoise the embedding vectors transformed from the topic vectors hidden space before the text generation. It’s important to note that we normalized the hidden space before passing it to the dec2.

Figure 2: DeTiME framework. We have 4 encoders and 4 decoders. enc1 and enc2 are compressing the input document to the lower dimension. enc3 is to construct topic distribution. dec1 is to reconstruct bag of words representations. enc4 is to extract the hidden dimension from the reconstructed bag of words. dec2, dec3 and dec4 is to reconstruct/rephrase the input document. In our method, we name the number of dimensions for embedding Dtoken and maximum sequence length N1. The dimension of the compressed vector is Dembed. The number of topics equals T . The dimension of vocabulary is NBoW . The dimension of topic embeddings is Dtopic.

Table 2: The average readability scores at different time steps during the denoising process. A general increase in readability is observed.

Citations

Journal Article•10.2139/SSRN.3304756

Focused Concept Miner (FCM): Interpretable Deep Learning for Text Exploration

Dokyun Lee, +2 more

- 20 May 2018

- Social Science Research Network

TL;DR: The Focused Concept Miner is introduced, an interpretable deep learning text mining algorithm to automatically extract coherent corpus-level concepts from text data, and the discovery of concepts so that they are highly correlated to the user-specified outcome, and quantify the concept correlational importance to outcome.

...read moreread less

12

Journal Article•10.2139/ssrn.4836558

Topic Modelling Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment

Holli Sargeant, +2 more

- 21 May 2024

- Social Science Research Network

TL;DR: A novel taxonomy for topic modelling summary judgment cases in UK law is developed and applied, revealing distinct patterns in the application of summary judgments across various legal domains.

...read moreread less

Journal Article•10.1002/eng2.70365

Systematic Literature Review of LLM‐Large Language Model in Medical: Digital Health, Technology and Applications

Imrus Salehin, +5 more

- 01 Sep 2025

- Engineering reports

Abstract: ABSTRACT Large language models (LLMs), like the GPT series, have recently emerged as transformative tools in the medical field due to their human‐like language generation and understanding. This systematic review examines the evolution, applications, and challenges of medical LLMs in digital health and clinical technology. A structured search was conducted across ScienceDirect, PubMed, Scopus, and manual sources from 2007 to 2025, following PRISMA 2020 guidelines. After applying inclusion and exclusion criteria, 185 studies were selected from an initial pool of 698 papers. Among these, 30 representative studies were analyzed in‐depth based on their relevance, methodological quality, and contribution to diverse LLM applications in health care. Most research centered on GPT‐based models, with over 81% demonstrating strong performance in language generation, diagnostic assistance, and clinical documentation, based on automated metrics and human feedback. Notably, some models achieved up to 90% satisfaction from healthcare professionals. The findings reveal LLM's potential to enhance patient interaction, decision support, and overall healthcare efficiency. This review contributes by synthesizing key advancements, assessing model performance, and outlining ethical challenges such as trust, privacy, and safe deployment. It offers novel insights for researchers and practitioners seeking to adopt or improve LLM integration in health care. Future directions include improving transparency, developing domain‐specific models, and establishing regulatory frameworks for responsible use.

...read moreread less

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

•Proceedings Article

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013

TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

...read moreread less

27.5K

•Proceedings Article•10.18653/V1/D19-1410

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers, +1 more

- 14 Aug 2019

TL;DR: Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.

...read moreread less

12K

...

Expand