TL;DR: The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability ofText classification schemes, which is of practical importance in the application fields of text classification.
Abstract: Text classification is a domain with high dimensional feature space.Extracting the keywords as the features can be extremely useful in text classification.An empirical analysis of five statistical keyword extraction methods.A comprehensive analysis of classifier and keyword extraction ensembles.For ACM collection, a classification accuracy of 93.80% with Bagging ensemble of Random Forest. Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naive Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.
TL;DR: This paper presents a state of the art of automatic keyphrase extraction approaches to identify their strengths and weaknesses and discusses why some techniques perform better than others and how to improve the task of automaticKeyphrases extraction.
Abstract: Keyphrases are useful for a variety of tasks in information retrieval systems and natural language processing, such as text summarization, automatic indexing, clustering/classification, ontology learning and building and conceptualizing particular knowledge domains, etc. However, assigning these keyphrases manually is time consuming and expensive in term of human resources. Therefore, there is a need to automate the task of extracting keyphrases. A wide range of techniques of keyphrase extraction have been proposed, but they are still suffering from the low accuracy rate and poor performance. This paper presents a state of the art of automatic keyphrase extraction approaches to identify their strengths and weaknesses. We also discuss why some techniques perform better than others and how can we improve the task of automatic keyphrase extraction.
TL;DR: Experimental results show that the graph based indexing method is a promising approach for semantic and contextual indexation, and outperforms statistical based method (TFIDF) by 12% in F-measure.
Abstract: Extracting knowledge from text data and taking its full advantage has been an important way to reduce its computation and accelerate processing, especially for large amounts of data. Thus, different approaches and methodologies for modeling and representing textual data have been proposed. In this paper, a graph-based approach for automatic indexing of unstructured data from an Arabic corpus has been proposed. First, each document in the collection is represented by a graph. After the generation of document graph, term weighting is computed to estimate the relevance of a term to the document. The graph representation offers the advantage that it allows for a much more expressive document modeling than the standard bag of words approach, and consequently, it improves classification performance. Experimental results show that the graph based indexing method is a promising approach for semantic and contextual indexation, and outperforms statistical based method (TFIDF) by 12% in F-measure.
TL;DR: The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents.
Abstract: Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans Documents in Ottoman are difficult to segment into words without a prior knowledge of the word In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version One version of a document is used as the source dataset and the other version of the same document is used as the target dataset Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries We present the idea of cross-document word matching for a novel task of segmenting historical documents into words We propose a matching scheme based on possible combinations of sequence of sub-words We improve the performance of simple features through considering the words in a context The method is applied on two versions of Layla and Majnun divan by Fuzuli The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents
TL;DR: A platform of Automatic Natural Language Processing (ANLP) tools which permits the automatic indexing and research from Arabic texts corpus and permits to extract contexts from the entered corpus and to assign meaning from the user is proposed.
Abstract: This paper will help the linguists find an easy method to facilitate the creation of a standard Arabic historical dictionary in order to save time and to be up to date with the other languages. In this method, we propose a platform of Automatic Natural Language Processing (ANLP) tools which permits the automatic indexing and research from Arabic texts corpus. Some pretreatments are done before the indexation process: segmentation, normalization, and filtering, morphological analysis. The prototype that we've developed for the generation of standard Arabic historical dictionary permits to extract contexts from the entered corpus and to assign meaning from the user. The evaluation of our system shows that the results are reliable.
TL;DR: A platform of Automatic Natural Language Processing (ANLP) tools which permits the automatic indexing and research from a corpus of Arabic texts and permits to extract contexts from the entered corpus and to assign meaning from the user.
Abstract: In this paper, we provide for the linguists a method to facilitate the creation of a standard Arabic historical dictionary in order to save the lost period and to be up to date with other languages. In this method, we propose a platform of Automatic Natural Language Processing (ANLP) tools which permits the automatic indexing and research from a corpus of Arabic texts. The indexation is applied after some pretreatments: segmentation, normalization, and filtering, morphological analysis. The prototype that we’ve developed for the generation of standard Arabic historical dictionary permits to extract contexts from the entered corpus and to assign meaning from the user. The evaluation of our system shows that the results are reliable.
TL;DR: This chapter introduces HIVE automatic indexing capabilities and reports on training the HIVE-ES server in connection with the Wine Thesaurus, a SKOSified vocabulary in Spanish.
Abstract: Helping Interdisciplinary Vocabulary Engineering (HIVE) is an automatic indexing, machine learning technology that addresses cost, interoperability and usability challenges associated with traditional vocabulary frameworks. HIVE supports dynamic subject metadata generation using multiple SKOS encoded controlled vocabularies. Kea++/Maui algorithms are used for the machine learning activity. Professionally indexed documents representing the gold standard are processed via Kea++ or Maui, to train HIVE. The HIVE-ES (Espana) initiative extends HIVE to Spanish language vocabularies. This chapter introduces HIVE automatic indexing capabilities and reports on training the HIVE-ES server in connection with the Wine Thesaurus, a SKOSified vocabulary in Spanish. The results highlight the value of the HIVE approach for information management systems seeking to work with multiple vocabularies for metadata generation in an intelligent manner.
TL;DR: The automatic video analysis of the TIB AV Portal includes not only structural analysis (scene recognition), but also text, audio and image analysis, enabling pinpoint searches to be made within videos.
Abstract: With the AV Portal 1 , the German National Library of Science and Technology (TIB) 2 in
collaboration with the Hasso Plattner Institute (HPI)3 has developed a user-oriented platform
for scientific films. This portal offers free access to high-quality computer visualisations,
simulations, experiments and interviews as well as recordings of lectures and conferences
from the fields of science and technology. The automatic video analysis of the TIB AV Portal
includes not only structural analysis (scene recognition), but also text, audio and image
analysis. Automatic indexing by the AV Portal describes videos at the segment level, enabling
pinpoint searches to be made within videos. Films are allocated a Digital Object Identifier
(DOI), which means they can be referenced clearly. Individual film segments are allocated
a Media Fragment Identifier (MFID), which enables the video to be referenced down to
the second and cited. The creator of the audiovisual media segment can choose between
an Open Access licence and a declaration of consent, enabling them to decide how they wish to permit TIB to utilise the material. TIB recommends the “CC-Namensnennung – Deutschland
3.0” licence, which ensures that the creator is acknowledged and permits the comprehensive
use of audiovisual media in research and teaching.
TL;DR: The results show that the three types of approaches give globally comparable results for images, but they differ for specific kinds of pairs/triplets, and directly building bi-concept detectors remains the best alternative, especially if a well-annotated dataset is available.
Abstract: Automatic indexing of images and videos is a highly relevant and important research area in multimedia information retrieval. The difficulty of this task is no longer something to prove. Most efforts of the research community have been focusing, in the past, on the detection of single concepts in images/videos, which is already a hard task. With the evolution of information retrieval systems, users' needs become more abstract, and lead to a larger number of words composing the queries. It is important to think about indexing multimedia documents with more than just individual concepts, to help retrieval systems to answer such complex queries. Few studies addressed specifically the problem of detecting multiple concepts (multi-concept) in images and videos. Most of them concern the detection of concept pairs. These studies showed that such challenge is even greater than the one of single concept detection. In this work, we address the problem of multi-concept detection in images/videos by making a comparative and detailed study. Three types of approaches are considered: 1) building detectors for multi-concept, 2) fusing single concepts detectors and 3) exploiting detectors of a set of single concepts in a stacking scheme. We conducted our evaluations on PASCAL VOC'12 collection regarding the detection of pairs and triplets of concepts. We extended the evaluation process on TRECVid 2013 dataset for infrequent concept pairs' detection. Our results show that the three types of approaches give globally comparable results for images, but they differ for specific kinds of pairs/triplets. In the case of videos, late fusion of detectors seems to be more effective and efficient when single concept detectors have good performances. Otherwise, directly building bi-concept detectors remains the best alternative, especially if a well-annotated dataset is available. The third approach did not bring additional gain or efficiency.
TL;DR: The goal of this thesis is to model the semantic and topical context of new proper names in order to retrieve those which are relevant to the spoken content in the audio document.
Abstract: The diachronic nature of broadcast news causes frequent variations in the linguistic content and vocabulary, leading to the problem of Out-Of-Vocabulary (OOV) words in automatic speech recognition. Most of the OOV words are found to be proper names whereas proper names are important for automatic indexing of audio-video content as well as for obtaining reliable automatic transcriptions. The goal of this thesis is to model the semantic and topical context of new proper names in order to retrieve those which are relevant to the spoken content in the audio document. Training context models is a challenging problem in this task because several new names come with a low amount of data and the context model should be robust to errors in the automatic transcription. Probabilistic topic models and word embeddings from neural network models are explored for the task of retrieval of relevant proper names. A thorough evaluation of these contextual representations is performed. It is argued that these representations, which are learned in an unsupervised manner, are not the best for the given retrieval task. Neural network context models trained with an objective to maximise the retrieval performance are proposed. The proposed Neural Bag-of-Weighted-Words (NBOW2) model learns to assign a degree of importance to input words and has the ability to capture task specific key-words. Experiments on automatic speech recognition on French broadcast news videos demonstrate the effectiveness of the proposed models. Evaluation of the NBOW2 model on standard text classification tasks shows that it learns interesting information and gives best classification accuracies among the BOW models
TL;DR: An outline of the use of automatic indexing is given by discussing various hashing techniques including fuzzy finger printing and locality-sensitive hashing.
Abstract: Indexing is one of the important tasks of Information Retrieval that can be applied to any form of data, generated from the web, databases, etc. As the size of corpora increases, indexing becomes too time consuming and labor intensive, therefore, the introduction of computer aided indexer. A review of indexing techniques, both human and automatic indexing has been done in this paper. This paper gives an outline of the use of automatic indexing by discussing various hashing techniques including fuzzy finger printing and locality-sensitive hashing. Two different processes of matching that are used in automatic subject indexing are also reviewed. Accepting the need of automatic indexing in a possible replacement to manual indexing, studies in the development of automatic indexing tools must continue
TL;DR: This paper investigated and compared factorization methods in utilising the semantics of concept re-occurrence and co-occurring patterns and the efficacies in enhancing recognition performances.
Abstract: The proliferation of portable and even wearable visual sensing devices e.g. SenseCam, Google Glass, etc. is creating opportunities for automatic indexing and management of digitally-recorded everyday behaviour. Although the detection of semantic concepts within narrow domains has now reached a satisfactory performance level based on automatic mapping from low-level features to higher level semantics, in wearable sensing and life-logging, a diversity of everyday concepts are captured by the images and this challenges the performance of automatic concept detection and activity indexing based on this. In this paper, we investigated and compared factorization methods in utilising the semantics of concept re-occurrence and co-occurrence patterns. The factorized results are then input to activity recognition to show the efficacies in enhancing recognition performances.
TL;DR: ACTIVE, a platform for indexing and cataloging audiovisual collections through the automatic recognition of faces and speakers, and chosen solutions for making the platform easily extensible (plug-ins) and for distributing CPU-intensive calculations across a network of computers are presented.
Abstract: The cost of manual metadata production is high, especially for audiovisual content, where a time-consuming inspection is usually required in order to identify the most appropriate annotations. There is a growing need from digital content industries for solutions capable of automating such a process. In this work we present ACTIVE, a platform for indexing and cataloging audiovisual collections through the automatic recognition of faces and speakers. Adopted algorithms are described and our main contributions on people clustering and caption-based people identification are presented. Results of experiments carried out on a set of TV shows and audio files are reported and analyzed. An overview of the whole architecture is presented as well, with a focus on chosen solutions for making the platform easily extensible (plug-ins) and for distributing CPU-intensive calculations across a network of computers.
TL;DR: Results showed that the ExamPraser significantly extract more and diverse concepts from exams, indicating the automatic concept extraction from exams is promising and could be a potential technological solution to provide personalized feedback for large-size programming classes.
Abstract: We design and study ExamParser, an innovative intelligent semantic automatic indexing method, for orchestrating today's programming classes. ExamParser automatically processes paper-based exams by associating sets of concepts to the exam questions, which provide graders semantic grading guidelines and leave personalized semantic feedback. Results showed that the ExamPraser significantly extract more and diverse concepts from exams. It also achieves high coherence within exam, indicating the automatic concept extraction from exams is promising and could be a potential technological solution to provide personalized feedback for large-size programming classes.
TL;DR: In this article, the authors present a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating ranking quality indirectly through analyzing retrieval performance.
Abstract: Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. Although some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The article reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single "gold standard" method when evaluating indexing and retrieval, and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on evaluation approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard, evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance.