Journal Article10.18653/v1/2023.findings-emnlp.606
DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM
Weijie Xu,Wenxiang Hu,Fanyou Wu,Srinivasan H. Sengamedu +3 more
- 01 Jan 2023
TL;DR: DeTiME is a novel framework for Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM that generates highly clusterable embeddings and enhances topic generation capabilities.
read more
Abstract: In the burgeoning field of natural language processing, Neural Topic Models (NTMs) and Large Language Models (LLMs) have emerged as areas of significant research interest. Despite this, NTMs primarily utilize contextual embeddings from LLMs, which are not optimal for clustering or capable for topic generation. Our study addresses this gap by introducing a novel framework named Diffusion-Enhanced Topic Modeling using Encoder-Decoder-based LLMs (DeTiME). DeTiME leverages Encoder-Decoder-based LLMs to produce highly clusterable embeddings that could generate topics that exhibit both superior clusterability and enhanced semantic coherence compared to existing methods. Additionally, by exploiting the power of diffusion, our framework also provides the capability to generate content relevant to the identified topics. This dual functionality allows users to efficiently produce highly clustered topics and related content simultaneously. DeTiME’s potential extends to generating clustered embeddings as well. Notably, our proposed framework proves to be efficient to train and exhibits high adaptability, demonstrating its potential for a wide array of applications.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 4: The histogram of survey scores from fluency, grammar, and redundancy perspectives. 
Figure 1: A summary of a few of our findings: (1) Our embeddings outperform the best clusterable methods (selected from (Muennighoff et al., 2022)). (2) The same framework with a slightly different finetuned task(DeTiME Training) does not perform well. (3) When compressed, our embeddings excel in higher dimensions, making them ideal for topic modeling. Detailed settings is in Appendix E. 
Table 1: The main results for all clusterability metrics, diversity, and coherence (Cv). The number of topics is 20. The best and second-best scores of each dataset are highlighted in boldface and with an underline, respectively. The result represents the average value obtained from three datasets, where each dataset was processed 10 times to compute the mean and standard deviation. 
Figure 3: The diffusion framework based on the main framework in Figure 2. In the training component, a DDPM-scheduled Autoencoder with residual connections diffusor is trained using the embedding vectors obtained from the enc2. In generating part, the trained diffusor is used to denoise the embedding vectors transformed from the topic vectors hidden space before the text generation. It’s important to note that we normalized the hidden space before passing it to the dec2. 
Figure 2: DeTiME framework. We have 4 encoders and 4 decoders. enc1 and enc2 are compressing the input document to the lower dimension. enc3 is to construct topic distribution. dec1 is to reconstruct bag of words representations. enc4 is to extract the hidden dimension from the reconstructed bag of words. dec2, dec3 and dec4 is to reconstruct/rephrase the input document. In our method, we name the number of dimensions for embedding Dtoken and maximum sequence length N1. The dimension of the compressed vector is Dembed. The number of topics equals T . The dimension of vocabulary is NBoW . The dimension of topic embeddings is Dtopic. 
Table 2: The average readability scores at different time steps during the denoising process. A general increase in readability is observed.
Citations
Focused Concept Miner (FCM): Interpretable Deep Learning for Text Exploration
TL;DR: The Focused Concept Miner is introduced, an interpretable deep learning text mining algorithm to automatically extract coherent corpus-level concepts from text data, and the discovery of concepts so that they are highly correlated to the user-specified outcome, and quantify the concept correlational importance to outcome.
12
Topic Modelling Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment
Holli Sargeant,Ahmed Izzidien,Felix Steffek +2 more
TL;DR: A novel taxonomy for topic modelling summary judgment cases in UK law is developed and applied, revealing distinct patterns in the application of summary judgments across various legal domains.
Systematic Literature Review of LLM‐Large Language Model in Medical: Digital Health, Technology and Applications
Imrus Salehin,Md Tomal Ahmed Sajib,Nazmul Huda Badhon,Md Sakibul Hassan Rifat,Nazrul Amin,Nazmun Nessa Moon +5 more
Abstract: ABSTRACT Large language models (LLMs), like the GPT series, have recently emerged as transformative tools in the medical field due to their human‐like language generation and understanding. This systematic review examines the evolution, applications, and challenges of medical LLMs in digital health and clinical technology. A structured search was conducted across ScienceDirect, PubMed, Scopus, and manual sources from 2007 to 2025, following PRISMA 2020 guidelines. After applying inclusion and exclusion criteria, 185 studies were selected from an initial pool of 698 papers. Among these, 30 representative studies were analyzed in‐depth based on their relevance, methodological quality, and contribution to diverse LLM applications in health care. Most research centered on GPT‐based models, with over 81% demonstrating strong performance in language generation, diagnostic assistance, and clinical documentation, based on automated metrics and human feedback. Notably, some models achieved up to 90% satisfaction from healthcare professionals. The findings reveal LLM's potential to enhance patient interaction, decision support, and overall healthcare efficiency. This review contributes by synthesizing key advancements, assessing model performance, and outlining ethical challenges such as trust, privacy, and safe deployment. It offers novel insights for researchers and practitioners seeking to adopt or improve LLM integration in health care. Future directions include improving transparency, developing domain‐specific models, and establishing regulatory frameworks for responsible use.
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
•Proceedings Article
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov,Kai Chen,Greg S. Corrado,Jeffrey Dean +3 more
- 16 Jan 2013
TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
27.5K
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers,Iryna Gurevych +1 more
- 14 Aug 2019
TL;DR: Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.