TL;DR: This document is only a minor revision to the text/enriched MIME type, and is only intended to be used in the short term until other MIME types for text formatting in Internet mail are developed and deployed.
Abstract: MIME [RFC-1521] defines a format and general framework for the representation of a wide variety of data types in Internet mail. This document defines one particular type of MIME data, the text/enriched MIME type. The text/enriched MIME type is intended to facilitate the wider interoperation of simple enriched text across a wide variety of hardware and software platforms. This document is only a minor revision to the text/enriched MIME type that was first described in [RFC-1523] and [RFC-1563], and is only intended to be used in the short term until other MIME types for text formatting in Internet mail are developed and deployed.
TL;DR: This paper looks at what representation to use for documents and which extra information hidden in HTML pages to take into consideration to improve the classification task, and how to deal with the very high number of features of texts.
Abstract: As the Web expands exponentially, the need to put some order to its content becomes apparent. Hypertext categorization, that is the automatic classification of web documents into predefined classes, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and metadata provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what representation to use for documents and which extra information hidden in HTML pages to take into consideration to improve the classification task, and (ii) how to deal with the very high number of features of texts. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-nearest neighbor, support vector machines and C4.5) were used to exploit the enriched text representation along with feature reduction. The results showed that enhancing the basic text content with HTML page keywords, title and anchor links improved the accuracy of the classification algorithms.
TL;DR: Using a personal computer based system, a standard text is processed graphically so as to display an enriched text that includes visual clues as to the pronunciation and/or meaning of words in the orginal text.
Abstract: Using a personal computer based system, a standard text is processed graphically so as to display an enriched text that includes visual clues as to the pronunciation and/or meaning of words in the orginal text In one embodiment the visual clues indicate the phonetic structure of words in the text In another embodiment pictograms are added to the text to indicate the meanings of the words in the text A method of 'morphing' between the graphical display of standard and enriched texts is also disclosed
TL;DR: In this paper, a multilayered context enriched text translation interface includes a simulation layer comprising one or more text objects and a translation layer, which is used to display simulated views of an application GUI in the simulation layer.
Abstract: A multilayered context enriched text translation interface includes a simulation layer comprising one or more text objects and a translation layer. The interface displays one or more mimicked views of an application GUI in the simulation layer. Subsequent to a user engaging a text object, the interface displays a prompt for a text translation of the text object within a translation layer. In certain embodiments, the mimicked views are graphical reproductions of the application GUI pages with functionality of one or more text objects of the application GUI disabled. In certain embodiments, the prompt includes an accentuation objects to visually accentuate the text object, a text-editing object to receive the text translation of the text object, and a link object to visually connect the accentuation objects and text-editing object.
TL;DR: This thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.
Abstract: This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.