Journal Article10.3390/computers12120258
Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling
4
TL;DR: The main finding of the study is how to change zero values to the very small numeric values with random noise through the GAN to solve the zero-inflated problem using synthetic data generated from the original data with zero inflation.
read more
Abstract: In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Keyword Data Analysis Using Generative Models Based on Statistics and Machine Learning Algorithms
Sunghae Jun
TL;DR: Keyword data analysis using generative models based on statistics and machine learning algorithms is valid and contributes to the field of text big data analysis.
3
Patent Keyword Analysis Using Regression Modeling Based on Quantile Cumulative Distribution Function
Sangsung Park,Sunghae Jun +1 more
TL;DR: This study proposes a regression model based on the quantile cumulative distribution function to analyze patent keywords, addressing the zero-inflated problem in patent data, and demonstrates its performance using blockchain technology patent documents.
Technology Keyword Analysis Using Graphical Causal Models
Sunghae Jun
TL;DR: This paper proposes a technology keyword analysis method using graphical causal models to identify cause-effect relationships between technology keywords, enabling informed research and development planning in various technology management aspects.
Sparse Keyword Data Analysis Using Bayesian Pattern Mining
Abstract: Keyword data analysis aims to extract and interpret meaningful relationships from large collections of text documents. A major challenge in this process arises from the extreme sparsity of document–keyword matrices, where the majority of elements are zeros due to zero inflation. To address this issue, this study proposes a probabilistic framework called Bayesian Pattern Mining (BPM), which integrates Bayesian inference into association rule mining (ARM). The proposed method estimates both the expected values and credible intervals of interestingness measures such as confidence and lift, providing a probabilistic evaluation of keyword associations. Experiments conducted on 9436 quantum computing patent documents, from which 175 representative keywords were extracted, demonstrate that BPM yields more stable and interpretable associations than conventional ARM. By incorporating credible intervals, BPM reduces the risk of biased decisions under sparsity and enhances the reliability of keyword-based technology analysis, offering a rigorous approach for knowledge discovery in zero-inflated text data.
References
•Book
Regression Analysis of Count Data
A. Colin Cameron,Pravin K. Trivedi +1 more
- 28 Sep 1998
TL;DR: The authors combine theory and practice to make sophisticated methods of analysis accessible to researchers and practitioners working with widely different types of data and software in areas such as applied statistics, econometrics, marketing, operations research, actuarial studies, demography, biostatistics and quantitative social sciences.
6.2K
Examining the interplay between big data analytics and contextual factors in driving process innovation capabilities
Patrick Mikalef,John Krogstie +1 more
TL;DR: Results show that under different combinations of contextual factors the significance of big data analytics resources varies, with specific configurations leading to high levels of incremental and radical process innovation capabilities.
243
Distribution Bias Aware Collaborative Generative Adversarial Network for Imbalanced Deep Learning in Industrial IoT
TL;DR: Wang et al. as mentioned in this paper proposed a distribution bias aware collaborative generative adversarial network (DB-CGAN) model for imbalanced deep learning in industrial IoT, especially to solve limitations caused by distribution bias issue between the generated data and original data, via a robust data augmentation.
A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method
Nirmalya Thakur,Chia Y. Han +1 more
TL;DR: A novel methodology based on the usage of k-folds cross-validation and the AdaBoost algorithm that improves the performance accuracy of the k-NN classifier-based fall detection system to the extent that it outperforms all similar works in this field.
104