About: Unicode is a research topic. Over the lifetime, 1360 publications have been published within this topic receiving 13934 citations. The topic is also known as: The Unicode Standard & Unicode Standard.
TL;DR: The authors compared bi-LSTMs with word, character, and unicode byte embeddings for POS tagging and showed that biLSTM is less sensitive to training data size and label corruptions than previously assumed.
Abstract: Bidirectional long short-term memory (biLSTM) networks have recently proven successful for various NLP sequence modeling tasks, but little is known about their reliance to input representations, target languages, data set size, and label noise. We address these issues and evaluate bi-LSTMs with word, character, and unicode byte embeddings for POS tagging. We compare bi-LSTMs to traditional POS taggers across languages and data sizes. We also present a novel biLSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words. The model obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages. Our analysis suggests that biLSTMs are less sensitive to training data size and label corruptions (at small noise levels) than previously assumed.
TL;DR: GATE is presented, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion and can be used to develop applications and Resources in multiple languages, based on its thorough Unicode support.
Abstract: In this paper we present GATE, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion. The GATE architecture has enabled us not only to develop a number of successful applications for various language processing tasks (such as Information Extraction), but also to build and annotate corpora and carry out evaluations on the applications generated. The framework can be used to develop applications and resources in multiple languages, based on its thorough Unicode support.
TL;DR: An open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages, made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository.
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
TL;DR: This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifiers (URI), which means that IRIs can be used instead of URIs, where appropriate, to identify resources.
Abstract: This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement of the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
URIs is defined, which means that IRIs can be used instead of URIs,
where appropriate, to identify resources. The approach of defining a
new protocol element was chosen instead of extending or changing the
definition of URIs. This was done in order to allow a clear
distinction and to avoid incompatibilities with existing software.
Guidelines are provided for the use and deployment of IRIs in various
protocols, formats, and software components that currently deal with
URIs.
TL;DR: The authors proposed a pre-trained embeddings for all Unicode emoji which are learned from their description in the Unicode emoji standard, which can be readily used in downstream social natural language processing applications alongside word2vec.
Abstract: Many current natural language processing applications for social media rely on representation learning and utilize pre-trained word embeddings. There currently exist several publicly-available, pre-trained sets of word embeddings, but they contain few or no emoji representations even as emoji usage in social media has increased. In this paper we release emoji2vec, pre-trained embeddings for all Unicode emoji which are learned from their description in the Unicode emoji standard. The resulting emoji embeddings can be readily used in downstream social natural language processing applications alongside word2vec. We demonstrate, for the downstream task of sentiment analysis, that emoji embeddings learned from short descriptions outperforms a skip-gram model trained on a large collection of tweets, while avoiding the need for contexts in which emoji need to appear frequently in order to estimate a representation.