TL;DR: What the effect is of many authors on feature selection and learning, and robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning are shown.
Abstract: Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.
TL;DR: IGTree is a useful algorithm for problems characterized by the availability of a large number of training instances described by symbolic features with sufficiently differing information gain values, and is obtained similar or better generalization accuracy with IGTree when trained on two complex linguistic tasks.
Abstract: We describe the IGTree learning algorithm, which compresses an instance base into a tree structure. The concept of information gain is used as a heuristic function for performing this compression. IGTree produces trees that, compared to other lazy learning approaches, reduce storage requirements and the time required to compute classifications. Furthermore, we obtained similar or better generalization accuracy with IGTree when trained on two complex linguistic tasks, viz. letter–phoneme transliteration and part-of-speech-tagging, when compared to alternative lazy learning and decision tree approaches (viz., IB1, information-gain-weighted IB1, and C4.5). A third experiment, with the task of word hyphenation, demonstrates that when the mutual differences in information gain of features is too small, IGTree as well as information-gain-weighted IB1 perform worse than IB1. These results indicate that IGTree is a useful algorithm for problems characterized by the availability of a large number of training instances described by symbolic features with sufficiently differing information gain values.
TL;DR: In this paper, the authors discuss and evaluate thirty-two published automatic methods for scRNA-seq data analysis in terms of their prediction accuracy, F1-score, unlabeling rate and running time.
Abstract: Single-cell RNA sequencing (scRNA-seq) has become a powerful tool for scientists of many research disciplines due to its ability to elucidate the heterogeneous and complex cell-type compositions of different tissues and cell populations. Traditional cell-type identification methods for scRNA-seq data analysis are time-consuming and knowledge-dependent for manual annotation. By contrast, automatic cell-type identification methods may have the advantages of being fast, accurate, and more user friendly. Here, we discuss and evaluate thirty-two published automatic methods for scRNA-seq data analysis in terms of their prediction accuracy, F1-score, unlabeling rate and running time. We highlight the advantages and disadvantages of these methods and provide recommendations of method choice depending on the available information. The challenges and future applications of these automatic methods are further discussed. In addition, we provide a free scRNA-seq data analysis package encompassing the discussed automatic methods to help the easy usage of them in real-world applications.
TL;DR: The emotion for Twitter messages is detected as they provide rich ensemble of human emotions and Naive Bayes and k-nearest neighbor algorithm are used to detect the emotion and classify the Twitter messages into four emotional categories.
Abstract: “The task of emotion detection usually involves the analysis of text. Humans show universal consistency in identifying emotions however shows an excellent deal of variation between individuals in their abilities.” We have detected the emotion for Twitter messages as they provide rich ensemble of human emotions. We have used machine learning algorithms namely Naive Bayes (NB) and k-nearest neighbor algorithm (KNN) to detect the emotion of Twitter message and then classify the Twitter messages into four emotional categories. We also made a comparative study of two supervised machine learning algorithms; the eager learning classifier (NB) performed well when compared with lazy learning classifier (KNN).
TL;DR: In this article, the authors carried out a comprehensive analysis and study of seven machine learning algorithms for rent prediction, including Linear Regression, Multilayer Perceptron, Random Forest, KNN, Locally Weighted Learning, SMO, and KStar algorithms.
Abstract: Real-Estate rent prediction in housing market analysis plays a key role in calculating the Rate of Return - a salient index used to evaluate real-estate investment options. Accurate rent prediction in real estate investment can help in generating capital gains and guaranty a financial success. In this paper, we carry out a comprehensive analysis and study of seven machine learning algorithms for rent prediction, including Linear Regression, Multilayer Perceptron, Random Forest, KNN, Locally Weighted Learning, SMO, and KStar algorithms. We train new model for the US territory, including three house types of single-family, townhouse, and condo. Each data instance in the dataset has 21 internal attributes (e.g., area space, price, number of bed/bathroom, rent, school rating, so forth). A subset of the collected features selected by filter methods for the prediction models. We also employ a hierarchical clustering approach to cluster the data based on two factors of house type, and average rent estimate of zip codes. The empirical results suggest that the rent prediction models based on lazy learning algorithms lead to higher accuracy and lower prediction error compared to eager learning methods.