TL;DR: This work has shown that the transformation-based learning approach can be applied at a higher level of textual interpretation for locating chunks in the tagged text, including non-recursive “baseNP” chunks.
Abstract: Transformation-based learning, a technique introduced by Eric Brill (1993b), has been shown to do part-of-speech tagging with fairly high accuracy. This same method can be applied at a higher level of textual interpretation for locating chunks in the tagged text, including non-recursive “baseNP” chunks. For this purpose, it is convenient to view chunking as a tagging problem by encoding the chunk structure in new tags attached to each word. In automatic tests using Treebank-derived data, this technique achieved recall and precision rates of roughly 93% for baseNP chunks (trained on 950K words) and 88% for somewhat more complex chunks that partition the sentence (trained on 200K words). Working in this new application and with larger template and training sets has also required some interesting adaptations to the transformation-based learning approach.
TL;DR: The CoNLL-2000 shared task: dividing text into syntactically related non-overlapping groups of words, so-called text chunking is described.
Abstract: We describe the CoNLL-2000 shared task: dividing text into syntactically related non-overlapping groups of words, so-called text chunking. We give background information on the data sets, present a general overview of the systems that have taken part in the shared task and briefly discuss their performance.
TL;DR: This paper investigates the incorporation of diverse lexical, syntactic and semantic knowledge in feature-based relation extraction using SVM and illustrates that the base phrase chunking information is very effective for relation extraction and contributes to most of the performance improvement from syntactic aspect while additional information from full parsing gives limited further enhancement.
Abstract: Extracting semantic relationships between entities is challenging. This paper investigates the incorporation of diverse lexical, syntactic and semantic knowledge in feature-based relation extraction using SVM. Our study illustrates that the base phrase chunking information is very effective for relation extraction and contributes to most of the performance improvement from syntactic aspect while additional information from full parsing gives limited further enhancement. This suggests that most of useful information in full parse trees for relation extraction is shallow and can be captured by chunking. We also demonstrate how semantic information such as WordNet and Name List, can be used in feature-based relation extraction to further improve the performance. Evaluation on the ACE corpus shows that effective incorporation of diverse features enables our system outperform previously best-reported systems on the 24 ACE relation subtypes and significantly outperforms tree kernel-based systems by over 20 in F-measure on the 5 ACE relation types.
TL;DR: A stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data is described.
TL;DR: This paper presents the first phase of building YADAC, a multi-genre Dialectal Arabic corpus that is compiled using Web data from microblogs and question-answer pairs extracted from online knowledge market services in which both questions and answers are user-generated.
Abstract: This paper presents the first phase of building YADAC ― a multi-genre Dialectal Arabic (DA) corpus ― that is compiled using Web data from microblogs (i.e. Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (i.e. microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related to building DA corpora that have not been handled in previous studies: function-based Web harvesting and dialect identification, vowel-based spelling variation, linguistic hypercorrection and its effect on spelling variation, unsupervised Part-of-Speech (POS) tagging and base phrase chunking for DA. Although the algorithms for both POS tagging and base-phrase chunking are still under development, the results are promising.