Tokenization (data security)

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•10.18653/V1/2020.ACL-DEMOS.14•

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

[...]

Peng Qi¹, Yuhao Zhang¹, Yuhui Zhang², Jason Bolton¹, Christopher D. Manning¹ - Show less +1 more•Institutions (2)

Stanford University¹, Tsinghua University²

16 Mar 2020

TL;DR: This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.

...read moreread less

Abstract: We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/

...read moreread less

1,790 citations

Posted Content•

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

[...]

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis Eh Tay, Jiashi Feng, Shuicheng Yan - Show less +5 more

28 Jan 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: T2T-ViT as mentioned in this paper proposes a token-to-token transformation to progressively transform the image to tokens by recursively aggregating neighboring tokens into one token (Token-To-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced.

...read moreread less

Abstract: Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, \eg, the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among neighboring pixels, leading to low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study. Notably, T2T-ViT reduces the parameter count and MACs of vanilla ViT by half, while achieving more than 3.0\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with comparable size to ResNet50 (21.5M parameters) can achieve 83.3\% top1 accuracy in image resolution 384$\times$384 on ImageNet. (Code: this https URL)

...read moreread less

1,532 citations

Proceedings Article•10.18653/V1/S17-2126•

DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis.

[...]

Christos Baziotis¹, Nikos Pelekis², Christos Doulkeridis²•Institutions (2)

National Technical University of Athens¹, University of Piraeus²

1 Aug 2017

TL;DR: Two deep-learning systems that competed at SemEval-2017 Task 4 “Sentiment Analysis in Twitter” are presented, which use Long Short-Term Memory networks augmented with two kinds of attention mechanisms, on top of word embeddings pre-trained on a big collection of Twitter messages.

...read moreread less

Abstract: In this paper we present two deep-learning systems that competed at SemEval-2017 Task 4 “Sentiment Analysis in Twitter”. We participated in all subtasks for English tweets, involving message-level and topic-based sentiment polarity classification and quantification. We use Long Short-Term Memory (LSTM) networks augmented with two kinds of attention mechanisms, on top of word embeddings pre-trained on a big collection of Twitter messages. Also, we present a text processing tool suitable for social network messages, which performs tokenization, word normalization, segmentation and spell correction. Moreover, our approach uses no hand-crafted features or sentiment lexicons. We ranked 1st (tie) in Subtask A, and achieved very competitive results in the rest of the Subtasks. Both the word embeddings and our text processing tool are available to the research community.

...read moreread less

589 citations

Proceedings Article•10.3115/1219840.1219911•

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

[...]

Nizar Habash¹, Owen Rambow¹•Institutions (1)

Columbia University¹

25 Jun 2005

TL;DR: An approach to using a morphological analyzer for tokenizing and morphologically tagging Arabic words in one process using classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer.

...read moreread less

Abstract: We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including part-of-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

...read moreread less

543 citations

Journal Article•10.1021/ACS.JCIM.6B00207•

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.

[...]

Matthew C. Swain¹, Jacqueline M. Cole¹•Institutions (1)

University of Cambridge¹

06 Oct 2016-Journal of Chemical Information and Modeling

TL;DR: This system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing, and the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables.

...read moreread less

Abstract: The emergence of “big data” initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing an...

...read moreread less

453 citations

...

Expand

Year	Papers
2022	4
2021	150
2020	137
2019	104
2018	84
2017	63

Topic Tools

Papers published on a yearly basis

Papers

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.

Related Topics (5)

Performance Metrics