Unicode

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•10.18653/V1/P16-2067•

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

[...]

Barbara Plank¹, Anders Søgaard², Yoav Goldberg³•Institutions (3)

University of Groningen¹, University of Copenhagen², Bar-Ilan University³

19 Apr 2016

TL;DR: The authors compared bi-LSTMs with word, character, and unicode byte embeddings for POS tagging and showed that biLSTM is less sensitive to training data size and label corruptions than previously assumed.

...read moreread less

Abstract: Bidirectional long short-term memory (biLSTM) networks have recently proven successful for various NLP sequence modeling tasks, but little is known about their reliance to input representations, target languages, data set size, and label noise. We address these issues and evaluate bi-LSTMs with word, character, and unicode byte embeddings for POS tagging. We compare bi-LSTMs to traditional POS taggers across languages and data sizes. We also present a novel biLSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words. The model obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages. Our analysis suggests that biLSTMs are less sensitive to training data size and label corruptions (at small noise levels) than previously assumed.

...read moreread less

556 citations

Proceedings Article•10.3115/1073083.1073112•

GATE: an Architecture for Development of Robust HLT applications

[...]

Hamish Cunningham¹, Diana Maynard¹, Kalina Bontcheva¹, Valentin Tablan¹•Institutions (1)

University of Sheffield¹

6 Jul 2002

TL;DR: GATE is presented, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion and can be used to develop applications and Resources in multiple languages, based on its thorough Unicode support.

...read moreread less

Abstract: In this paper we present GATE, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion. The GATE architecture has enabled us not only to develop a number of successful applications for various language processing tasks (such as Information Extraction), but also to build and annotate corpora and carry out evaluations on the applications generated. The framework can be used to develop applications and resources in multiple languages, based on its thorough Unicode support.

...read moreread less

451 citations

Proceedings Article•

Linking and Extending an Open Multilingual Wordnet

[...]

Francis Bond, Ryan Foster

1 Aug 2013

TL;DR: An open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages, made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository.

...read moreread less

Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.

...read moreread less

355 citations

Internationalized Resource Identifiers (IRIs)

[...]

Michel Suignard

1 Jan 2005

TL;DR: This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifiers (URI), which means that IRIs can be used instead of URIs, where appropriate, to identify resources.

...read moreread less

Abstract: This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs, where appropriate, to identify resources. The approach of defining a new protocol element was chosen instead of extending or changing the definition of URIs. This was done in order to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines are provided for the use and deployment of IRIs in various protocols, formats, and software components that currently deal with URIs.

...read moreread less

327 citations

Proceedings Article•10.18653/V1/W16-6208•

emoji2vec: Learning Emoji Representations from their Description

[...]

Ben Eisner¹, Tim Rocktäschel², Isabelle Augenstein², Matko Bošnjak², Sebastian Riedel² - Show less +1 more•Institutions (2)

Samsung¹, University College London²

1 Nov 2016

TL;DR: The authors proposed a pre-trained embeddings for all Unicode emoji which are learned from their description in the Unicode emoji standard, which can be readily used in downstream social natural language processing applications alongside word2vec.

...read moreread less

Abstract: Many current natural language processing applications for social media rely on representation learning and utilize pre-trained word embeddings. There currently exist several publicly-available, pre-trained sets of word embeddings, but they contain few or no emoji representations even as emoji usage in social media has increased. In this paper we release emoji2vec, pre-trained embeddings for all Unicode emoji which are learned from their description in the Unicode emoji standard. The resulting emoji embeddings can be readily used in downstream social natural language processing applications alongside word2vec. We demonstrate, for the downstream task of sentiment analysis, that emoji embeddings learned from short descriptions outperforms a skip-gram model trained on a large collection of tweets, while avoiding the need for contexts in which emoji need to appear frequently in order to estimate a representation.

...read moreread less

314 citations

...

Expand

Year	Papers
2026	1
2025	25
2024	37
2023	92
2022	91
2021	50

Topic Tools

Papers published on a yearly basis

Papers

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

GATE: an Architecture for Development of Robust HLT applications

Linking and Extending an Open Multilingual Wordnet

Internationalized Resource Identifiers (IRIs)

emoji2vec: Learning Emoji Representations from their Description

Related Topics (5)

Performance Metrics