Big5

Topic Tools

Papers

Journal Article•10.1007/S10032-010-0116-6•

SCUT-COUCH2009—a comprehensive online unconstrained Chinese handwriting database and benchmark evaluation

[...]

Lianwen Jin¹, Yan Gao¹, Gang Liu¹, Yunyang Li¹, Kai Ding¹ - Show less +1 more•Institutions (1)

01 Mar 2011-International Journal on Document Analysis and Recognition

TL;DR: The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples and some evaluation results on the database are reported using state-of-the-art recognizers for benchmarking.

...read moreread less

Abstract: A comprehensive online unconstrained Chinese handwriting dataset, SCUT-COUCH2009, is introduced in this paper. As a revision of SCUT-COUCH2008 [1], the SCUT-COUCH2009 database consists of more datasets with larger vocabularies and more writers. The database is built to facilitate the research of unconstrained online Chinese handwriting recognition. It is comprehensive in the sense that it consists of 11 datasets of different vocabularies, named GB1, GB2, TradGB1, Big5, Pinyin, Letters, Digit, Symbol, Word8888, Word17366 and Word44208. In particular, the SCUT-COUCH2009 database contains handwritten samples of 6,763 single Chinese characters in the GB2312-80 standard, 5,401 traditional Chinese characters of the Big5 standard, 1,384 traditional Chinese characters corresponding to the level 1 characters of the GB2312-80 standard, 8,888 frequently used Chinese words, 17,366 daily-used Chinese words, 44,208 complete words from the Fourth Edition of “The Contemporary Chinese Dictionary”, 2,010 Pinyin and 184 daily-used symbols. The samples were collected using PDAs (Personal Digit Assistant) and smart phones with touch screens and were contributed by more than 190 persons. The total number of character samples is over 3.6 million. The SCUT-COUCH2009 database is the first publicly available large vocabulary online Chinese handwriting database containing multi-type character/word samples. We report some evaluation results on the database using state-of-the-art recognizers for benchmarking.

...read moreread less

82 citations

Description of the NTU System used for MET-2.

[...]

Hsin-Hsi Chen¹, Yung-Wei Ding, Shih-Chung Tsai, Guo-Wei Bian•Institutions (1)

National Taiwan University¹

1 Jan 1998

TL;DR: This paper employs different types of information from different levels of text to extract named entities, including character conditions, statistic information, titles, punctuation marks, organization and location keywords, speech-act and locative verbs, cache and n-gram model.

...read moreread less

Abstract: Named entities form the major components in a document. When we catch the fundamental entities, we can understand a document to some degree. This paper employs different types of information from different levels of text to extract named entities, including character conditions, statistic information, titles, punctuation marks, organization and location keywords, speech-act and locative verbs, cache and n-gram model. In the formal run of MET-2, the F-measures P&R, 2P&R and P&2R are 79.61%, 77.88% and 81.42%, respectively. INTRODUCTION People, affairs, time, places and things are five basic entities in a document. When we catch the fundamental entities, we can understand a document to some degree. Natural Language Processing Laboratory (NLPL) in Department of Computer Science and Information Engineering (CSIE), National Taiwan University (NTU) starts to study named entity extraction problem in 1993. At first, we focus on the extraction of Chinese person names, transliterated person names [1] and organization names [2]. The training data and the testing data in these experiments are selected from three Taiwan newspaper corpora (China Times, Liberty Times News and United Daily News). Chen and Lee [3] reported the precision rate and the recall rate for the extraction of Chinese person names, transliterated person names and organization names are (88.04%, 92.56%), (50.62%, 71.93%) and (61.79%, 54.50%), respectively in the 16th International Conference on Computational Linguistics. We employ these results to several applications. Chen and Wu [4] considered person names as one of clues in sentence alignment. Chen and Lee [3] show its application to anaphora resolution. Chen and Bian [5] proposed a method to construct white pages for Internet/Intranet users automatically. We extract information from World Wide Web documents, including proper nouns, E-mail addresses and home page URLs, and find the relationship among these data. Chen, Ding and Tsai [6,7] dealt with proper noun extraction for information retrieval. In MUC-7 and MET-2, we attend named entity extraction tasks for both English and Chinese. We extend our previous work on this problem to cover more named entity types such as locations, date/time expressions and monetary and percentage expressions. Several issues have to be addressed during extension. One of the major differences between Chinese and English language processing is that segmentation is required for Chinese. That is, we have to identify word boundary in Chinese sentences beforehand. That makes Chinese named entity extraction tasks more changeable. Besides, the vocabulary set and the Chinese coding set used in Taiwan and in China are not the same. The documents adopted in MET-2 are selected from newspapers in China, thus we have to transform simplified Chinese characters in GB coding set to traditional Chinese characters in Big-5 coding set before testing. A word that is known may become unknown due to transformation. For example, the character "u" in "u" (early morning) is used in traditional Chinese characters. However, "D" is used in simplified Chinese characters and it is also a legal traditional Chinese character that denotes another meaning. In other words, the mapping from GB to Big5 is "D", which is an unknown word based on our dictionary. The different vocabulary set between China and Taiwan results in different segmentation. This paper is organized as follows. Section 2 illustrates the flow of named entity extraction and the summary scores of our team in MET-2 formal run. Sections 3, 4 and 5 propose methods to extract named people, organizations and locations. Section 6 deals with the rest of named entities, i.e., date/time expressions and monetary and percentage expressions. After each section, we discuss the sources of errors in the formal run. Section 7 concludes the remarks. FLOW OF NAMED ENTITY EXTRACTION The following shows the flow of named entity extraction in MET-2 formal run. (1) Transform Chinese texts in GB codes into texts in Big-5 codes. (2) Segment Chinese texts into a sequence of tokens.

...read moreread less

54 citations

Patent•

User interface and database structure for chinese phrasal stroke and phonetic text input

[...]

Lu Zhang, Pim Van Meurs, Lian He, Ethan R. Bradford, Jianchao Wu, Jenny Huang-Yu Lai, Keng Chong Wong, Siu Ming Louis Leung - Show less +4 more

20 Jul 2005

TL;DR: In this article, a stroke and phonetic text input entry system for Chinese phraseal stroke and text input was proposed. But the input is a phrasal input rather than a character input, and the stroke sequence entered by the user can be split into a few groups that are separated by zero or more delimiters.

...read moreread less

Abstract: The invention provides a stroke and phonetic text input entry system that has substantially the same definition of stroke match as that used in T9, where the input is a phrasal input rather than a character input. The invention solves the problem of Chinese phrasal stroke and phonetic text input by allowing users to enter an arbitrary number of strokes for each character in a phrase, where each character is separated by a delimiter. In this way, the invention provides a system that is easily learned and efficiently applied. Thus, the invention makes it possible for users to enter multiple characters while keeping their single character input habits. Each Chinese character has a standard stroke sequence in Guo Biao (GB), which is the standard for mainland China, or multiple sequences for BIG5 Chinese Character Encoding for Traditional (Complex) Characters, which is the de facto standard in Taiwan but not used in mainland China. With the invention, users do not have to enter the complete sequence for a single character, but instead can stop at any point and enter a delimiter, which indicates the end of the previous character and the start of the next character. The whole stroke sequence entered by the user can then be split into a few groups that are separated by zero or more delimiters. Phrases can then be identified by user entry of groups of characters. The presently preferred phrase matching criteria are as follows: the first stroke group matches the leading stroke sequence of the first character of the phrase; the second stroke group matches the leading stroke sequence of the second character of the phrase, etc; the phrases that match the entered stroke sequence are presented to the user for selection. A user interface design for Chinese phrasal stroke text input is also provided.

...read moreread less

19 citations

Journal Article•10.1002/SPE.427•

A page‐shift transformation format of ISO 10646

[...]

Pei-Chi Wu

01 Jan 2002-Software - Practice and Experience

TL;DR: This paper proposes a page‐shift transformation format of ISO 10646, called UTF‐S, suitable for replacing locale‐specific character sets withISO 10646 in Internet applications, such as the World Wide Web.

...read moreread less

Abstract: ISO 10646 Universal Character Set (UCS) or Unicode covers symbols in most of the World's written languages. There are various UCS transformation formats (UTF). UTF-8 is compatible with systems that assume 8-bit characters. One of the problems with UTF-8 is its space efficiency. For files containing most Asian characters such as Han ideographs, the file sizes increase by about 50% by using UTF-8. Although the Standard Compression Scheme for Unicode (SCSU) can compress Unicode strings to the size of a locale-specific character set, it is complicated and is not intended to serve as a general purpose interchange format. This paper proposes a page-shift transformation format of ISO 10646, called UTF-S. There are four pages: 1-byte, 2-byte, 3-byte and 4-byte. Shift to page 0 uses a special code ; shift to page 1, 2, and 3 uses ISO 2022 shift codes SO, SS2, and SS3, respectively. We test several text files and compare these UTF with Big5, a locale-specific character set. The result shows that the space efficiency of UTF-S is better than that of UTF-16 and UTF-8 and is close to that of SCSU. UTF-S is suitable for replacing locale-specific character sets with ISO 10646 in Internet applications, such as the World Wide Web. Copyright © 2001 John Wiley & Sons, Ltd.

...read moreread less

Topic Tools

Papers

SCUT-COUCH2009—a comprehensive online unconstrained Chinese handwriting database and benchmark evaluation

Description of the NTU System used for MET-2.

User interface and database structure for chinese phrasal stroke and phonetic text input

A page‐shift transformation format of ISO 10646

Related Topics (5)

Performance Metrics

No. of papers in the topic in previous years
Year	Papers
2011	1
2005	1
2002	1
1998	1