Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

doi:10.3390/computers12120258

Journal Article10.3390/computers12120258

Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

Sung-Hee Jun

- 10 Dec 2023

- Computers

4

TL;DR: The main finding of the study is how to change zero values to the very small numeric values with random noise through the GAN to solve the zero-inflated problem using synthetic data generated from the original data with zero inflation.

Abstract: In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.3390/electronics13040798

Keyword Data Analysis Using Generative Models Based on Statistics and Machine Learning Algorithms

Sunghae Jun

- 19 Feb 2024

- Electronics

TL;DR: Keyword data analysis using generative models based on statistics and machine learning algorithms is valid and contributes to the field of text big data analysis.

...read moreread less

3

Journal Article•10.3390/electronics13214247

Patent Keyword Analysis Using Regression Modeling Based on Quantile Cumulative Distribution Function

Sangsung Park, +1 more

- 30 Oct 2024

- Electronics

TL;DR: This study proposes a regression model based on the quantile cumulative distribution function to analyze patent keywords, addressing the zero-inflated problem in patent data, and demonstrates its performance using blockchain technology patent documents.

...read moreread less

Journal Article•10.3390/electronics13183670

Technology Keyword Analysis Using Graphical Causal Models

Sunghae Jun

- 15 Sep 2024

- Electronics

TL;DR: This paper proposes a technology keyword analysis method using graphical causal models to identify cause-effect relationships between technology keywords, enabling informed research and development planning in various technology management aspects.

...read moreread less

Journal Article•10.3390/computers14100436

Sparse Keyword Data Analysis Using Bayesian Pattern Mining

Sunghae Jun

- 14 Oct 2025

- Computers

Abstract: Keyword data analysis aims to extract and interpret meaningful relationships from large collections of text documents. A major challenge in this process arises from the extreme sparsity of document–keyword matrices, where the majority of elements are zeros due to zero inflation. To address this issue, this study proposes a probabilistic framework called Bayesian Pattern Mining (BPM), which integrates Bayesian inference into association rule mining (ARM). The proposed method estimates both the expected values and credible intervals of interestingness measures such as confidence and lift, providing a probabilistic evaluation of keyword associations. Experiments conducted on 9436 quantum computing patent documents, from which 175 representative keywords were extracted, demonstrate that BPM yields more stable and interpretable associations than conventional ARM. By incorporating credible intervals, BPM reduces the risk of biased decisions under sparsity and enhances the reliability of keyword-based technology analysis, offering a rigorous approach for knowledge discovery in zero-inflated text data.

...read moreread less

References

•Book

Regression Analysis of Count Data

A. Colin Cameron, +1 more

- 28 Sep 1998

TL;DR: The authors combine theory and practice to make sophisticated methods of analysis accessible to researchers and practitioners working with widely different types of data and software in areas such as applied statistics, econometrics, marketing, operations research, actuarial studies, demography, biostatistics and quantitative social sciences.

...read moreread less

6.2K

Monograph•10.1017/CBO9780511973420

Negative binomial regression, 2nd ed.

Joseph M. Hilbe

- 01 Jan 2011

2.5K

•Journal Article•10.1080/0960085X.2020.1740618

Examining the interplay between big data analytics and contextual factors in driving process innovation capabilities

Patrick Mikalef, +1 more

- 16 Apr 2020

- European Journal of Information Systems

TL;DR: Results show that under different combinations of contextual factors the significance of big data analytics resources varies, with specific configurations leading to high levels of incremental and radical process innovation capabilities.

...read moreread less

243

•Journal Article•10.1109/tii.2022.3170149

Distribution Bias Aware Collaborative Generative Adversarial Network for Imbalanced Deep Learning in Industrial IoT

01 Jan 2023

- IEEE Transactions on Industrial Informat...

TL;DR: Wang et al. as mentioned in this paper proposed a distribution bias aware collaborative generative adversarial network (DB-CGAN) model for imbalanced deep learning in industrial IoT, especially to solve limitations caused by distribution bias issue between the generated data and original data, via a robust data augmentation.

...read moreread less

115

•Journal Article•10.3390/JSAN10030039

A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method

Nirmalya Thakur, +1 more

- 24 Jun 2021

- Journal of Sensor and Actuator Networks

TL;DR: A novel methodology based on the usage of k-folds cross-validation and the AdaBoost algorithm that improves the performance accuracy of the k-NN classifier-based fall detection system to the extent that it outperforms all similar works in this field.

...read moreread less

104

...

Expand