TL;DR: The KIM platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases.
Abstract: The KIM platform provides a novel Knowledge and Information Management infrastructure and services for automatic semantic annotation, indexing, and retrieval of documents. It provides mature infrastructure for scaleable and customizable information extraction (IE) as well as annotation and document management, based on GATE. In order to provide basic level of performance and allow easy bootstrapping of applications, KIM is equipped with an upper-level ontology and a knowledge base providing extensive coverage of entities of general importance. The ontologies and knowledge bases involved are handled using cutting edge Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning.
From technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases. This paper presents the KIM platform, with emphasize on its architecture, interfaces, tools, and other technical issues.
TL;DR: In this article, a technique for fast indexing of live video broadcasts is provided which incorporate both efficient manual processing and automatic indexing steps to generate semantically meaningful and practically usable highlight hierarchy of broadcast television programs in real-time.
Abstract: Techniques for fast indexing of live video broadcasts are provided which incorporate both efficient manual processing and automatic indexing steps to generate semantically meaningful and practically usable highlight hierarchy of broadcast television programs in real-time. In one technique, a list of predefined keywords is provided, describing the highlights, and the manual marking process can be implemented by just a few mouse clicks. A technique is provided for grouping highlights into a semantic hierarchy in real-time. A technique is provided for efficiently generating highlight metadata on live broadcast programs, using a coarse-to-fine indexing methodology in order for a operator to quickly generate highlight summaries of live broadcast programs.
TL;DR: A methodology for evaluating the performance of a system that automatically detects critical parts of underwater video, online or during post-mission tape analysis, on real data is presented and its performance is studied and benchmarked on real underwater data.
Abstract: It is often the case that only a few sparse sequences of long videos from scientific underwater surveys actually contain important information for the expert. Locating such sequences is time consuming and tedious. A system that automatically detects those critical parts, online or during post-mission tape analysis, would alleviate the expert workload and improve data exploitation. In this paper, a methodology for evaluating the performance of such a system on real data is presented. Interesting sequences are started by changes of visual context. An algorithm to detect significant context changes in benthic videos in real time has been presented by Lebart et al. in 2000. It is used as an illustration for this methodology - its performance is studied and benchmarked on real underwater data, ground truthed by an expert biologist. Various issues relating to the complexity of the problems of automatically analyzing underwater video are also discussed.
TL;DR: A combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices and shows the potential benefits of combining speech recognition techniques and automatic indexing.
TL;DR: A method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.
Abstract: Motivation: The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. Results: We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.
TL;DR: A review of natural language processing techniques existing in these two areas and an in-depth presentation of FASTR, a corpus processor for the recognition, normalization, and acquisition of multi-word terms are done.
Abstract: We discuss the nature and the scope of linguistic (morphological, syntactic and semantic) variation of terms and its impact on two information retrieval tasks: term acquisition and automatic indexing. A review of natural language processing techniques existing in these two areas is done, along with an in-depth presentation of FASTR, a corpus processor for the recognition, normalization, and acquisition of multi-word terms.
TL;DR: To check whether discourse variables have an impact on modern information retrieval and classification algorithms, a functional framework for information analysis in an automated environment has been proposed and results demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure.
Abstract: Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extralinguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen's classification algorithms have been tested against sub-collections of documents based on the following discourse variables: "Genre", "Register", "Domain terminology", and "Document structure". The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen's algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.
TL;DR: This article focuses on speeding up the information retrieval process in Arabic document base by using a root-based hierarchical indexing model, and results demonstrated that speed gain in the range of 50-100 can be achieved for typical queries.
Abstract: The world has recently witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. Text data mining, as a m...
TL;DR: The Scene of Crime Information System's automatic image-indexing prototype goes beyond extracting keywords and syntactic relations from captions and applies advanced natural language processing techniques to text-based image indexing and retrieval to tackle crime investigation needs effectively and efficiently.
Abstract: The Scene of Crime Information System's automatic image-indexing prototype goes beyond extracting keywords and syntactic relations from captions. The semantic information it gathers gives investigators an intuitive, accurate way to search a database of cases for specific photographic evidence. Intelligent, automatic indexing and retrieval of crime scene photographs is one of the main functions of SOCIS, our research prototype developed within the Scene of Crime Information System project. The prototype, now in its final development and evaluation phase, applies advanced natural language processing techniques to text-based image indexing and retrieval to tackle crime investigation needs effectively and efficiently.
TL;DR: The approach is based on a multiple strategy that combines knowledge about the geometry of multiple views of the same scene, the extraction of low-level features, the detection of objects using the VA and domain knowledge.
Abstract: We address the problem of automatic indexing and organization of visual information through user interaction at multiple levels. Our work focuses on the following three important areas: (1) understanding of visual content and the way users search and index it; (2) construction of flexible computational methods that learn how to automatically classify images and videos from user input at multiple levels; (3) integration of generic visual detectors in solving practical tasks in the specific domain of consumer photography.
In particular, we present the following: (1) novel conceptual structures for classifying visual attributes (the Multi-Level Indexing Pyramid ); (2) a novel framework for learning structured visual detectors from user input (the Visual Apprentice); (3) a new study of human eye movements in observing images of different visual categories; (4) a new framework for the detection of non-identical duplicate consumer photographs in an interactive consumer image organization system; (5) detailed study of duplicate consumer photographs.
In the Visual Apprentice (VA), first a user defines a model via a multiple-level definition hierarchy (a scene consists of objects, object-parts, etc.). Then, the user labels example images or videos based on the hierarchy (a handshake image contains two faces and a handshake) and visual features are extracted from each example. Finally, several machine learning algorithms are used to learn classifiers for different nodes of the hierarchy. The best classifiers and features are automatically selected to produce a Visual Detector (e.g., for a handshake), which is applied to new images or videos.
In the human eye tracking experiments we examine variations in the way people look at images within and across different visual categories and explore ways of integrating eye tracking analysis with the VA framework.
Finally, we present a novel framework for the detection of non-identical duplicate consumer images for systems that help users automatically organize their collections. Our approach is based on a multiple strategy that combines knowledge about the geometry of multiple views of the same scene, the extraction of low-level features, the detection of objects using the VA and domain knowledge.
TL;DR: The morphological segmentation and normalization procedures, as well as the mappings from subwords to MeSH terms, are described, and results from an evaluation carried out on a German-language corpus are discussed.
Abstract: We consider three alternative procedures for the automatic indexing of medical documents using MeSH thesaurus identifiers as target units (document descriptors). Rather than considering complete words as the starting point of the indexing procedure, we here propose morphologically plausible subwords as basic units from which MeSH terms are derived. We describe the morphological segmentation and normalization procedures, as well as the mappings from subwords to MeSH terms, and discuss results from an evaluation carried out on a German-language corpus.
TL;DR: The research prototype, SOCIS, goes beyond keyword-based approaches and methods that extract syntactic relations from captions; it relies on advanced Natural Language Processing techniques in order to extract relational facts.
Abstract: We present a text-based approach for the automatic indexing and retrieval of digital photographs taken at crime scenes Our research prototype, SOCIS, goes beyond keyword-based approaches and methods that extract syntactic relations from captions; it relies on advanced Natural Language Processing techniques in order to extract relational facts These relational facts consist of a "pragmatic relation" and the entities this relation connects (triples of the form: ARG1-REL- ARG2) In SOCIS, the triples are used as complex image indexing terms; however, the extraction mechanism is used not only for indexing purposes but also for image retrieval using free text queries The retrieval mechanism computes similarity scores between query-triples and indexing-triples making use of a domain-specific ontology
TL;DR: An automatic indexing method named PAI (Priming Activation Indexing) that extracts keywords expressing the author’s main point from a document based on the priming effect without using corpus, thesaurus, syntactic analysis, dependency relations between terms or any other knowledge except for stop-word list is proposed.
Abstract: This paper proposes an automatic indexing method named PAI (Priming Activation Indexing) that extracts keywords expressing the author's main point from a document based on the priming effect. The basic idea is that since the author writes a document emphasizing his/her main point, impressive terms born in the mind of the reader could represent the asserted keywords. Our approach employs a spreading activation model without using corpus, thesaurus, syntactic analysis, dependency relations between terms or any other knowledge except for stop-word list. Experimental evaluations are reported by applying PAI to journal/conference papers.
TL;DR: Experimental results show that topic segmentation using word conceptual vectors is superior to the conventional method using local word co-occurrence frequencies, and that the integrated segmentation provides better news story structures than would be possible with any single type of information.
Abstract: This paper describes an automatic multimedia content indexing system that includes acoustic segmentation, automatic speech recognition, topic segmentation, and video indexing features. The system is intended for indexing of multimedia news programs. Speech segments extracted from news content are delivered to the speech recognition module. The speech recognition result is segmented into topics using a segmentation algorithm based on word conceptual vectors. The indexing results derived from audio and speech information are integrated with video indexing results to extract the story structure. Experimental results show that topic segmentation using word conceptual vectors is superior to the conventional method using local word co-occurrence frequencies, and that the integrated segmentation provides better news story structures than would be possible with any single type of information.
TL;DR: This paper presents an overview and recent results of the RAIVES project, a French research project on audio indexing, and presents speech/music segmentation, speaker tracking, and keywords detection.
Abstract: The "Invisible Web" is composed of documents which can not be currently accessed by Web search engines, because they have a dynamic URL or are not textual, like video or audio documents. For audio documents, one solution is automatic indexing. It consists in finding good descriptors of audio documents which can be used as indexes for archiving and search. This paper presents an overview and recent results of the RAIVES project, a French research project on audio indexing. We present speech/music segmentation, speaker tracking, and keywords detection. We also give a few perspectives of the RAIVES project.
TL;DR: The survey surveyed a number of research efforts that deal all with football commentary but for which the technology focus differs: retrieval, interpretation or generation of commentary and related (but not necessarily language oriented) research.
Abstract: Our interest in the computer processing of football commentary was at first given in by the EU/IST funded project MUMIS (Multimedia Indexing and Searching) that started in 2000 [11, 18, 19]. This project's objective is to develop technology for automatic indexing of multimedia programme material (texts, news streams, speech transcripts) and to develop a user interface that supports the conceptual querying and browsing of related video content over the internet. One of the innovative features of MUMIS is that it aims at the disclosure of video archives by applying information extraction techniques, originally developed for the textual domain. Information extraction is a technique that is typically suited for content in specific domains. In MUMIS the extraction technology is applied to the domain of football. The project requires the integration of lexicons, ontology and information extraction tools for this domain, and the development of merging algorithms to integrate the (incomplete) information coming from different sources. For example, the information extraction components should be able to extract some thirty different event types, using methods such as part of speech tagging, syntactic parsing, semantic tagging, and discourse analysis Typical football events to be detected are: kick-off, penalty, goal, halftime, free-kick, etc. The user interface should help users to formulate queries that can be matched on the annotations generated by the extraction component and linked to the time-codes of the corresponding video fragments. In parallel to the MUMIS project we surveyed a number of research efforts that deal all with football commentary but for which the technology focus differs: retrieval, interpretation or generation of commentary and related (but not necessarily language oriented) research. There were two reasons for this investigation. One obvious reason is to find out how the new technology and tools can be adapted to similar applications. The other is to see how the domain knowledge obtained for extraction purposes can be employed for other intelligent applications in the same domain. Since at the same time several of our M.Sc. students got interested in learning multi-agent systems for developing teams for the RoboCup leagues we decided to stick close to the football domain in this survey as well. In this paper we start with the survey of the domain specific research. The aim of the survey was to obtain a comprehensive view of the field could guide the selection of new research themes. In the second part of the paper we give a short introduction to the MUMIS project as it can be embedded in the general football related language technology research.
TL;DR: It may be determined that digitisation and automatic indexing and retrieval techniques do not at present offer an alternative to the textual subject descriptive process necessary for access to information stored in the form of moving imagery.
Abstract: Moving image media record much of the history of the twentieth century, and as such form an
important aspect of our cultural heritage. Although potentially of great importance to both the
education and commercial sectors, much of this store of knowledge is not accessible, because its
content is not documented.
Digitisation is being considered as a means of making historic footage more accessible by
allowing moving imagery to be displayed via the Internet. Further, digitisation of still and moving
imagery opens the possibility of relieving the time-consuming and expensive process of
descriptive cataloguing, by using automated indexing and retrieval techniques, based on the
physical attributes present in the imagery, such as colour, texture, shapes, spatial and spatiotemporal
distribution. These techniques, developed by the computer science community, are
generically known as Content Based Image Retrieval (CBIR).
But will this type of image retrieval answer moving image archive users' information
requirements? A project is being undertaken which researches the information needs of users of
such archives; one of the objectives of this project is determine whether CBIR techniques can be
used to answer these requirements.
An analysis of requests for moving image footage received by eleven representative film
collections determined that nearly 70% of the requests were for footage of a uniquely named
person, group, place, event or time, and in many cases a combination of several of these facets.
These are data that require to be documented in words.
From this and other analyses, it may be determined that digitisation and automatic indexing and
retrieval techniques do not at present offer an alternative to the textual subject descriptive process
necessary for access to information stored in the form of moving imagery.
TL;DR: An empirical investigation was conducted that aimed at assessing and evaluating the use of automatic indexing for the OPACs of the Austrian Library Network and the results include an increase of relevant hits at only moderately lower precision, the reduction of zero-hit results and insights into the role of existing subject headings.
Abstract: In the 1990s the German MILOS projects examined the suitability of an automatic linguistic indexing technique for library OPACs. Following this approach, an empirical investigation was conducted that aimed at assessing and evaluating the use of automatic indexing for the OPACs of the Austrian Library Network. As most users prefer to do their OPAC searches in the basic index the study focused on the effects of enriching this index with automatically generated terms. For this purpose an Aleph 500 OPAC consisting of a representative random sample of records drawn from the Austrian Union Catalogue was used for searching 100 queries in the basic index before and after adding the new index terms. The results include an increase of relevant hits at only moderately lower precision, the reduction of zero-hit results and insights into the role of existing subject headings.
TL;DR: In this paper, the indexing device of an automatic indexing milling machine is described, consisting of a pair of drive incomplete gear and driven incomplete gear which are mutually engaged.
Abstract: The utility model discloses the indexing device of an automatic indexing milling machine, mainly consisting of a pair of drive incomplete gear and driven incomplete gear which are mutually engaged. A drive shaft and a driven shaft are extended and exposed out of an indexing box of the milling machine; the driven incomplete gear is sleeved on the external end of the drive end; the driven incomplete gear is sleeved on the external end of the driven shaft. Owning to the externally hanging of the indexing device and the indexing implemented by the matching of incomplete gears, the whole indexing device of the utility model has the advantages of simple structure, precise and reliable indexing, extremely easy replacement and maintenance, being able to conveniently replace the incomplete gears with corresponding specifications according to the types screwdriver heads to be processed, being suitable for processing screwdriver heads with a plurality of types, implementing multipurpose with one machine, wide applicable range, and reducing manufacturing cost.
TL;DR: A statistical method for inserting periods into raw speech transcriptions for improving the readability and the effectiveness of the automatic extraction of key sentences from lecture audio archives is presented.
Abstract: Automatic extraction of key sentences from lecture audio archives is addressed. The method makes use of the characteristic expressions used in initial utterances of sections, which are defined as discourse markers and derived in an unsupervised manner based on word statistics. The statistics of the discourse markers is then used to define the importance of the sentences. It is also combined with the conventional tf-idf measure for content words. Experimental results confirm the effectiveness of the method using the discourse markers and its combination with the keyword-based method. We also present a statistical method for inserting periods into raw speech transcriptions for improving the readability.
TL;DR: This paper designs and implements a concept-based image retrieval system using feature information, more specifically, edge histogram description and demonstrates that this approach makes a favorable comparison with an approach based on color or edge features.
Abstract: In this paper, we design and implement a concept-based image retrieval system using feature information, more specifically, edge histogram description. The general edge histogram framework is a novel index mechanism which allows us to describe a content of images. However, there is a significant drawback in the framework that it can not accommodate a concept-based retrieval. Even if images are only conceptually related with user queries, it may be capable of proving them to be irrelevant since their features can be different each other. Our system adapts an edge histogram descriptor and includes a knowledge used for capturing concepts from images. In the knowledge base, a concept is expressed as some of templates, which can be described by common edge histograms for the images to represent the concept well. The templates can be generated by clustering the training images related with a concept. Consequently, since an image can also be matched with some of the templates, our system is able to support an automatic mechanism for indexing the image with the concept. The indexing mechanism enables users to retrieve the images related with a query which is formulated with their intended concepts. In addition, we also demonstrate that our concept-based approach makes a favorable comparison with an approach based on color or edge features.
TL;DR: An automatic Chinese information index system based on network environment is composed of subject table for whole profession, univeral Chinese splitting rule library, special splitting rule libraries for each professions, universal obsolete character library, and special obsolete character libraries for every professions as discussed by the authors.
Abstract: An automatic Chinese information index system based on network environment is composed of subject table for whole profession, univeral Chinese splitting rule library, special splitting rule librariesfor each professions, universal obsolete character library, special obsolete character libraries for each professions, Chinese geographic name library, geographic name splitting rule library, index inference rule library, and special index inference rule libraries for each professions.
TL;DR: An automatic mechanism for constructing the theme melody index from large music database is suggested and it is shown how the theme melodies index can be used for content-based music retrievals by implementing a prototype system.
Abstract: In traditional content-based music information retrieval systems, users may face with longer response time, since the traditional systems mostly do syntactic processing to match query melody and whole melodies of the underlying music database. Hence, there has been a growing need for theme melody index that can support to quick retrieve the relevant music to user's query melody. In this paper, we suggested an automatic mechanism for constructing the theme melody index from large music database and also showed how the theme melody index can be used for content-based music retrievals by implementing a prototype system.
TL;DR: The article concludes by pointing out that the use of standards, like Mpeg-7, can pro- mote the development of new and richer applications based on video.
Abstract: After an initial discussion of the problems presented by image and video indexing as compared to text indexing, the authors describe some of the ba- sic techniques for automatic video indexing. The content-based retrieval paradigm and some automatic methods for segmentation and key-frame identification are further described. Certain low-level parameters for identifying an image are also introduced. The authors discuss the drawbacks of such automatic methods based solely on the image and give examples from projects using accompanying information as well, such as audio and captions. The article concludes by pointing out that the use of standards, like Mpeg-7, can pro- mote the development of new and richer applications based on video.
TL;DR: Spectro-temporal sound representation is used for the purpose of automatic musical instrument recognition and Telescopic vector trees are used jointly with FS-trees to construct a new Query Answering System (QAS) for audio data.
Abstract: An automatic content extraction from multimedia files based both on manual and automatic indexing is extensively explored. However, in the domain of musical data, an automatic content description of musical sounds has not been broadly investigated yet and still needs an intensive research. In this paper, spectro-temporal sound representation is used for the purpose of automatic musical instrument recognition. Assuming that musical instruments can be learned in terms of a group of features and also based on them either automatic or manual indexing of an audio file is done, Frame Segment Trees (FS-trees) can be used to identify segments of an audio marked by the same indexes. Telescopic vector trees (TV-trees) are known from their applications in text processing and recently in data clustering algorithms. In this paper, we use them jointly with FS-trees to construct a new Query Answering System (QAS) for audio data. Audio segments are returned by QAS as answers to user queries. Heuristic strategy to build adaptable TV-trees is proposed.
TL;DR: A new approach to the automatic semantic indexing of digital photographs based on the extraction of logic relations from their textual descriptions using an ontology for the domain of application is presented.
Abstract: In this paper we present a new approach to the automatic semantic indexing of digital photographs based on the extraction of logic relations from their textual descriptions. The method is based on shallow parsing and propositional analysis of the descriptions using an ontology for the domain of application. We describe the semantic representation formalism, the ontology, and the algorithms involved in the automatic derivation of semantic indexes from texts linked to images. The method has been integrated into the Scene of the Crime Information System, a crime management system for storing, indexing and retrieval of crime information.
TL;DR: The process by which words are extracted from text corpora and a thesaurus for storage on the basis of specified conditions is described and it is concluded that the compiled dictionaries are of practical use for indexing since the recall is more important in that case.
TL;DR: In this article, the authors automatically index audio-textual (DAT) and audio-visual digital documents with a difficulty of comprehensibility index (IDC) of each specific document based on an index (IVE) resulting from a comparison of the elocution speed of the assessed document with at least a threshold.
Abstract: The server (SD) automatically indexes audio-textual (DAT) and audio-visual digital documents in particular with a difficulty of comprehensibility index (IDC) of each specific document based on an index (IVE) resulting from a comparison of the elocution speed of the assessed document with at least a threshold. The difficulty of comprehensibility index may also depend on the numbers (nbFG) of predetermined grammatical structures included in each document as well as a vocabulary index (IVC) determined relative to predetermined glossaries. Each document is thus associated with a label assembling all the indices thereof to enable in particular students and teachers of a foreign language in front of their terminals (TE) to search for the documents in a base (SGBD) depending on their comprehension and their knowledge of the language of the documents.
TL;DR: Preliminary results show that although this method is not as precise as others in terms of resource categorization, it can significantly benefit indexing.
Abstract: This paper is positioned within the development of an automated indexing system for the CISMeF quality controlled health gateway. For disambiguation purposes, we wish to perform text categorization prior to indexing. Hence, a global approach contrasting with the classical analytical methods based on the analysis of keyword counts extracted from the text is necessary. The use of statistical compression models enables us to proceed avoiding keyword extraction at this stage. Preliminary results show that althought this method is not as precise as others in terms of resource categorization, it can significantly benefit indexing.