TL;DR: In this article, a speech-to-text transcription of multimedia files is performed automatically by means of an ASR process, and acoustic and language models adapted for ASR at least before the latter processes the multimedia file, i.e. "a priori".
Abstract: It comprises analyzing audio content of multimedia files and performing a speech to text transcription thereof automatically by means of an ASR process, and selecting acoustic and language models adapted for the ASR process at least before the latter processes the multimedia file, i.e. "a priori". The method is particularly applicable to the automatic indexing, aggregation and clustering of news from different sources and from different types of files, including text, audio and audiovisual documents without any manual annotation.
TL;DR: A probabilistic Bayesian belief network (BBN) method for automatic indexing of excitement clips of sports video sequences and offers a general approach to the automatic tagging of large scale multimedia content with rich semantics.
Abstract: This paper presents a probabilistic Bayesian belief network (BBN) method for automatic indexing of excitement clips of sports video sequences. The excitement clips from sports video sequences are extracted using audio features. The excitement clips are comprised of multiple subclips corresponding to the events such as replay, field-view, close-ups of players, close-ups of referees/umpires, spectators, players' gathering. The events are detected and classified using a hierarchical classification scheme. The BBN based on observed events is used to assign semantic concept-labels to the excitement clips, such as goals, saves, and card in soccer video, wicket and hit in cricket video sequences. The BBN based indexing results are compared with our previously proposed event-association based approach and found BBN is better than the event-association based approach. The proposed scheme provides a generalizable method for linking low-level video features with high-level semantic concepts. The generic nature of the proposed approach in the sports domain is validated by demonstrating successful indexing of soccer and cricket video excitement clips. The proposed scheme offers a general approach to the automatic tagging of large scale multimedia content with rich semantics. The collection of labeled excitement clips provide a video summary for highlight browsing, video skimming, indexing and retrieval.
TL;DR: The extracted keywords are very much helpful in automatic indexing, text summarization, information retrieval, classification, clustering, topic detection and tracking and web searches etc.
Abstract: Automatic keywords extraction is the task to identify a small set of words, key phrases, keywords, or key segments from a document that can describe the meaning of the document. Keywords are useful tools as they give the shortest summary of the document. This paper concentrates on Automatic keywords extraction for Punjabi language text. It includes various phases like removing stop words, Identification of Punjabi nouns and noun stemming, Calculation of Term Frequency and Inverse Sentence Frequency (TF-ISF), Punjabi keywords as nouns with high TF-ISF score and title/headline feature for Punjabi text. The extracted keywords are very much helpful in automatic indexing, text summarization, information retrieval, classification, clustering, topic detection and tracking and web searches etc.
TL;DR: It is suggested, with several examples, that ST's may convey a unique slant of a document's content not normally represented in standard indexing vocabularies.
Abstract: This paper first describes the methodology of journal descriptor (JD) ndexing, based on human indexing at the journal level using only 127 descriptors, and applying statistical methods that associate this journal indexing with text words in a training set of MEDLINE® citations. These associations form the basis for automatic indexing of documents outside the training set. The paper then presents the new technique of semantic type (ST) indexing, based on JD indexing associated with each of 134 ST's, and applying the standard cosine coefficient measure to compare the similarity between the JD indexing of a document and the JD indexing of each ST. The ST indexing of the document is the list of ST's ranked in decreasing order of similarity between the JD indexing of the document and the JD indexing of the ST's. Discussion of the potential usefulness and application of the very general indexing provided by JD's and ST's comprises the remainder of the paper. JD's have been used for more than thirty years to search MEDLINE by discipline, and discipline-based indexing is in evidence on the Web. It is suggested, with several examples, that ST's may convey a unique slant of a document's content not normally represented in standard indexing vocabularies. Use of ST indexing to rank retrieved output is mentioned as a possible application. Notwithstanding the importance of methodology and performance issues, the intent of this paper is to explore questions of the potential utility and applicability of JD and ST indexing.
TL;DR: The complete indexing process of the registers of a French census dating back to more than a hundred years is described, from image analysis to the integration into the information system, in the context of probate genealogy.
Abstract: This paper describes the complete indexing process of the registers of a French census dating back to more than a hundred years, from image analysis to the integration into the information system, in the context of probate genealogy. The documents of interest are composed of a table of personal information in which the cells containing the first name, the surname and the relation to head of household must be extracted and recognized. More than 30 millions of cells were processed and their content either directly integrated into the information system or sent to keyers for manual validation, allowing an automation rate at 80% while keeping the error rate below 15% on average. Based on this project, we have started the development of a generic platform for table-based historical documents processing including new functionalities and a more generic and user-friendly table model definition interface.
TL;DR: This work has tested the state-of-the art automatic indexing tool Maui on Finnish texts using three stemming and lemmatization algorithms and tested it with documents and vocabularies of different domains.
Abstract: Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly inflected language requires word form normalization that goes beyond rule-based stemming algorithms. We have tested the state-of-the art automatic indexing tool Maui on Finnish texts using three stemming and lemmatization algorithms and tested it with documents and vocabularies of different domains. Both of the lemmatization algorithms we tested performed significantly better than a rule-based stemmer, and the subject indexing quality was found to be comparable to that of human indexers.
TL;DR: An information-processing analysis of biologically inspired design, a scheme for representing knowledge of designs of biological systems, and a computational technique for automatic indexing and retrieval of biological analogues of engineering problems are provided.
Abstract: Biologically inspired design is an increasingly popular design paradigm. Biologically inspired design differs from many traditional case-based reasoning tasks because it employs cross-domain analogies. The wide differences in biological source cases and technological target problems present challenges for determining what would make good or useful schemes for case representation, indexing, and adaptation. In this paper, we provide an information-processing analysis of biologically inspired design, a scheme for representing knowledge of designs of biological systems, and a computational technique for automatic indexing and retrieval of biological analogues of engineering problems. Our results highlight some important issues that a case-based reasoning system must overcome to succeed in supporting biologically inspired design.
TL;DR: In this paper, the authors proposed an information extraction method for text block information from a page file, wherein the text block is composed of page text block and manuscript text block, and judged when the default page text blocks information in the text blocks were extracted.
Abstract: The embodiment of the invention discloses an information extraction method and an information extraction device, relating to the technical field of information extraction, and aiming to solve the problem that in the prior art, the default text block information can not be extracted from the page information and manuscript information of the newspaper through automatic indexing. The information extraction method disclosed by the embodiment of the invention comprises the following steps: extracting text block information from a page file, wherein the text block information comprises page text block information and manuscript text block information; judging when the default page text block information in the text block information is extracted; if the default page text block information is not extracted, extracting the default page text block information; and if the default page text block information is extracted, extracting the default manuscript text block information. By using the method and device disclosed by the embodiment of the invention, the workload of the indexing personnel can be reduced, and the accuracy of indexing can be enhanced.
TL;DR: These applications demonstrate CUMLS's praticality and validity for developing knowledge organizations and services for medical information resources in network environments.
Abstract: The Chinese Unified Medical Language System(CUMLS),which consists of three components,namely medical vocabulary,semantic network and lexical tools,integrates more than ten biomedical sources such as biomedical thesauri,classifications,terminologies,and text words of biomedical literature.Based on CUMLS,the applications including automatic indexing,knowledge navigation,intelligent retrieval,etc.,are realized.These applications demonstrate CUMLS's praticality and validity for developing knowledge organizations and services for medical information resources in network environments.
TL;DR: The increasing availability of digital images, video, and audio has created exciting new research challenges on the organization of multimedia data for a variety of purposes, including the emerging MPEG-7 standard, which aims at standardizing tools for describing multimedia data.
Abstract: The increasing availability of digital images, video, and audio has created exciting new research challenges on the organization of multimedia data for a variety of purposes. While some of these challenges relate to computational techniques (e.g., automatic extraction of visual features for automatic indexing of visual data), others are conceptual in nature (e.g., design of templates for manual indexing of visual data). The key issues are what to index from the data, how to perform the indexing of the data, and how to organize the indices obtained. The indices used to describe content as well as the organization of those indices have a tremendous impact on applications, particularly on large digital libraries where different types of media need to be stored and accessed. Relevant efforts in this direction include the emerging MPEG-7 standard [5], which aims at standardizing tools for describing multimedia data.
TL;DR: A new approach in creating semantic skolem indexing for multiple documents that automatically index all the documents into single knowledge representation to retrieve the answer for users query is presented.
Abstract: The rapid growth of digital data and users’ information needs have made the demands for automatic indexing to become more important than before. Indexing based on keyword has proven to be unsuccessful to cater for the current needs. Thus, this paper presents a new approach in creating semantic skolem indexing for multiple documents that automatically index all the documents into single knowledge representation. The skolem indexing matrix will then be incorporated in question answering system to retrieve the answer for users query.
TL;DR: This chapter describes the architecture of a system designed to semi-automatically and intelligently index huge repositories of special effects video clips and uses a network of scalable ontologies to represent the semantic content to further enable intelligent retrieval.
Abstract: Automatic indexing and retrieval of digital data poses major challenges. The main problem arises from the ever increasing mass of digital media and the lack of efficient methods for indexing and retrieval of such data based on the semantic content rather than keywords. To enable intelligent web interactions, or even web filtering, we need to be capable of interpreting the information base in an intelligent manner. For a number of years research has been ongoing in the field of ontological engineering with the aim of using ontologies to add such (meta) knowledge to information. In this paper, we describe the architecture of a system (Dynamic REtrieval Analysis and semantic metadata Management (DREAM)) designed to automatically and intelligently index huge repositories of special effects video clips, based on their semantic content, using a network of scalable ontologies to enable intelligent retrieval. The DREAM Demonstrator has been evaluated as deployed in the film post-production phase to support the process of storage, indexing and retrieval of large data sets of special effects video clips as an exemplar application domain. This paper provides its performance and usability results and highlights the scope for future enhancements of the DREAM architecture which has proven successful in its first and possibly most challenging proving ground, namely film production, where it is already in routine use within our test bed Partners' creative processes.
TL;DR: A music genre classification system which was a winning solution in the Music Information Retrieval ISMIS 2011 contest is described, which consisted of a powerful ensemble classifier using the Error Correcting Output Coding coupled with an original, multi-resolution clustering and iterative relabelling scheme.
Abstract: With the large amounts of multimedia data produced, recorded and made available every day, there is a clear need for well-performing automatic indexing and search methods. This paper describes a music genre classification system, which was a winning solution in the Music Information Retrieval ISMIS 2011 contest. The system consisted of a powerful ensemble classifier using the Error Correcting Output Coding coupled with an original, multi-resolution clustering and iterative relabelling scheme. The two approaches used together outperformed other competing solutions by a large margin, reaching the final accuracy close to 88%.
TL;DR: This paper proposes a semiautomatic approach of old Arabic documents images indexing and searching without resorting to recognize their contents in order to deal with the incapacity of the recognition techniques to understand the contents of old documents.
Abstract: indexing is a largely used technique in retrieval systems. It has as goal to extract and to represent the meaning of a document so that it can be found by the user. We can cite two types of indexing: manual indexing, and automatic indexing. The automatic indexing requires to use character and words recognition engines which work only over the texts of contemporary documents. In this paper, we propose a semiautomatic approach of old Arabic documents images indexing and searching without resorting to recognize their contents in order to deal with the incapacity of the recognition techniques to understand the contents of old documents. The proposed approach repose on the representation of the documents according to the structural features of their indexes chosen manually from each document by an expert. The approach is tested on a sample of approximately 1100 envelopes and shows good results. Keywords-component; indexing, old documents, structural features, documents analysis
TL;DR: The ontology has been proposed to capitalize the concept of knowledge as NP and its semantic relations and indexing and information retrieval processes based on noun phrase (NP) and its semantics representation are developed.
Abstract: In the context of the valorization of Tunisian patrimony, we propose an approach to represent semantic properties on contents: heterogeneous information (multimedia) concerning patrimony objects. We develop indexing and information retrieval (IR) processes based on noun phrase (NP) and its semantic representation. These processes use natural language processing (NLP) to take into account the NPs structure organization. In view of this study, the ontology has been proposed to capitalize the concept of knowledge as NP and its semantic relations.
TL;DR: This doctoral research focuses on studying the semantic relations between social tags, items and content creators through co-occurrence analysis, social network analysis and information visualization, thus revealing the role played by social tags in representing and classifying contents and creators, and implications they might have for facilitating information seeking practice.
Abstract: As evidenced by the growing popularity of collaborative tagging sites like librarything, last.fm and del.icio.us, social tagging has provided a social and information organizing platform that warrants public attention and academic investigation alike. This doctoral research focuses on studying the semantic relations between social tags, items and content creators through co-occurrence analysis, social network analysis and information visualization, thus revealing the role played by social tags in representing and classifying contents and creators, and implications they might have for facilitating information seeking practice, particularly knowledge discovery and information summary, and as a result, helping the design of information retrieval and browsing interface. User-oriented studies are conducted to evaluate the advantage of visual and presentational features based on tagging analysis over existing constructs such as tag clouds in performing high-level information seeking tasks. The social tagging paradigm is widely considered an extension beyond keyword-based indexing and hierarchical classification schemes. The new massive manual indexing method characterized by social tagging differs from automatic indexing that lays the foundation of modern information retrieval in that its manual nature obviates the common pitfalls of computer-based automatic indexing. It also complements traditional manual indexing since tag word distribution reflects the opinions of a large number of people with various background and knowledge instead of a limited number of domain experts who are dominant in the classification and cataloging undertakings. Parallel to the observation that an individual’s social identity is defined by the collectivities to which the individual belongs, the topical, temporal, geographic, and stylistic features
TL;DR: This work proposes a straight-forward linguistic indexer, that can be used as a basis for own developments and for experiments and analyses to explore own documents and KOSs; it uses state-of-the- art information retrieval techniques and hence forms a suitable baseline for evaluations.
Abstract: Automatic KOS based indexing { i.e. indexing based on a restricted, controlled vocabulary, a thesaurus or a classication { can play an important role to close the gap between the intellectually, high quality indexed publications and the mass of unindexed publications. Especially for unknown, heterogeneous publications, like web publications, simple processes that do not rely on manually created training data are needed. With this contribution, we propose a straight-forward linguistic indexer, that can be used as a basis for own developments and for experiments and analyses to explore own documents and KOSs; it uses state-of-the- art information retrieval techniques and hence forms a suitable baseline for evaluations. Finally, it is free and open source.
TL;DR: This paper presents a new method based on ontology formation and fuzzy recognition of digital pictures that will be able to make automatic creation of the fish geometric ontology and automatic indexing to existing Semantic Web.
Abstract: This paper presents a new method based on ontology formation and fuzzy recognition of digital pictures.Ontology creation and document indexing are well-known bottlenecks for integrating semantic services and for the Semantic Web,and thus the new method will be able to make automatic creation of the fish geometric ontology and automatic indexing to existing Semantic Web.Fuzzy set and fuzzy recognition are used to decide wheter a new fish picture belongs to an existing training set,here with the carp as an example.Training samples are used to set up fuzzy set and membership functions.The existing way of fish ontology formation can be integrated with the new method and the existing work for fish web can be used.
TL;DR: A new weighting function is established in this article, comprehensively combining with four important factors such as the weight value of subject words, the classes, the specificity, and the cohesion relation to standardize the indexing of the subject words of the official document.
Abstract: Based on the traditional word frequency statistical function, the new weighting function is established in this article, comprehensively combining with four important factors such as the weight value of subject words, the classes, the specificity, and the cohesion relation This new method could standardize the indexing of the subject words of the official document, enhance the work efficiency, realize the automatic indexing, and reduce the mistakes because of personal factors In addition, the program design and the implementation of the computer language of this method are also introduced in this article
TL;DR: Using the horizontal and vertical white-spaces present in any document,herent characteristic disparities were used to distinguish pictures from text, and section-headings from the explanations that follow them and it was verified that the method implemented was language independent.
Abstract: Image processing techniques have been used over the years to convert printed material into electronic form. In our work we exploit the fact that some applications may find such conversions redundant and yet satisfactorily meet the demands of the end user. Using the horizontal and vertical white-spaces present in any document, independent regions of text, pictures, tables etc. could be identified. Inherent characteristic disparities were then used to distinguish pictures from text, and section-headings from the explanations that follow them. A table of contents, showing the heading and the associated page number, was generated and displayed on the browser. Each heading was hyperlinked to the corresponding page of the original document. HTML code was written dynamically, using file handling techniques in MATLAB to accommodate for variable number of headings obtained for different documents and also from different pages of a single document. The platform thus developed was tested on various languages and it was verified that the method implemented was language independent.
TL;DR: An efficient storage method for storing posting lists efficiently and a novel ranking technique of ordering relevant videos for efficient retrieval are proposed.
Abstract: With increasing need for indexing of persons in a large video database, automatic indexing has been attracting great interest which takes advantage of automatic tagging instead of the time-consuming and costly manual tagging. However, automatic indexing approach should provide a degree of recognition proximity because it cannot identify the persons with accuracy of 100%. In this paper, we propose an efficient storage method for storing posting lists efficiently and a novel ranking technique of ordering relevant videos for efficient retrieval. Through experiment evaluations we have shown that our storage method exhibits good performance in compressing the posting list. We have also shown that the proposed ranking method is effective for finding relevant videos.
TL;DR: The present article uses L&E as the point of departure for demonstrating in what way more social and interpretative understandings may provide fruitful improvements for research in indexing, knowledge organization, and information retrieval.