Named Entity Recognition
Named entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and cast linguistic units in unstructured text into predefined categories such as the names of persons, organizations, locations, and etc.
NER is always limited by its lower recall due to imbalanced distribution where NONE class dominates the entity classes. Classifiers built on such dataset typically have a higher precision and a lower recall and tend to overproduce the NONE class.
We have designed two kinds of effective features: 0/1 features and none-local features to overcome this problem. Our final systems utilize these features together with the local features to perform NER task.
Our method proved to be effective in "The Fourth International Chinese Language Processing Bakeoff & the First CIPS Chinese Language Processing Evaluation (Bakeoff-04)", and 3 Global Champions in NER subtasks have been obtained. (Paper [1] below is our technical report for Bakeoff-04)
Related Papers:
[1] " Chinese Word Segmentation and Named Entity Recognition Based on Conditional Random Fields " , Xinnian Mao, Yuan Dong, Saike He, Haila Wang, The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India ACL2007.
[2] " Using Non-local Features to Improve Named Entity Recognition Recall " , Xinnian Mao, Xu Wei, Yuan Dong, Saike He and Haila Wang, 2007. In Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation, 303-310, Seoul, Korea.
[3] "A Two-Step Approach for Chinese Named Entity Recognition Based on Conditional Random Fields and Maximum Entropy" , Xinnian Mao, Yuan Dong, Wenbo Pang, Saike He, Haila Wang. Chinese Computing Technologies and Related Linguistic Issues--Proceedings of the 7th International Conference on Chinese Computing.
Chinese Word Segmentation
Words are the basic linguistic units of natural language. However, Chinese texts are character based, not word based. Thus, the identification of lexical words or the delimitation of words in running texts is a prerequisite of Natural Language Processing.
Chinese word segmentation (CWS) can be cast as simple and effective formulation of character sequence labeling. A prevailing technique for this kind of labeling task would be Conditional Random Fields (CRFs). Although CRFs could exert predominant performance on the known words (which refer to those words exist in both the testing and training data), yet further improvement for CWS systems are usually limited by the comparative large fraction of unknown words (which refer to those words exist only in the testing data).
Regarding this nontrivial issue, we are intended to adopt a semi-supervised methodology: incorporates an unsupervised method into supervised segmentation. Control experiments have been conducted, taking a champion CWS system [1] in Bakeoff-04 as its baseline.
Related Papers:
[1] " Chinese Word Segmentation and Named Entity Recognition Based on Conditional Random Fields " , Xinnian Mao, Yuan Dong, Saike He, Haila Wang, The Sixth SIGHAN Workshop on Chinese Language Processing (SIGHAN-6), Hyderabad, India ACL2007.
[2] " Normalized Accessor Variety in Chinese Word Segmentation Based on Conditional Random Fields " , Saike He, Xiaojie Wang, Yuan Dong, CNCCL-2009.
Coreference Resolution
Coreference Resolution is an important subtask in Natural Language Processing. It focuses on the problem of determining if two mentions from the documents (within or cross) refer to the same entity in the world.
Traditional methods solve this task in two steps:
(1) a classification phase that exams individual pair-wise similarity score between mentions independently of each other.
(2) a clusterization phase that groups together all mentions that refer to the same entity with a kind of threshold metric, above which mentions are corefered.
However, the similarity scores derived in the first step are inherently noisy and the answer to one pair-wise coreference decision may not be independent of another. A classical case is: if we measure the similarity between all of the three possible pairs among three mentions, two of the similarity may be above threshold, but one below ¨C an inconsistency due to noise and imperfect measurement ¨C which couldn't be solved by local optimization.
In contrast, globally optimized clustering decisions could solve such problem to some extent, where coreference space is represented as an undirected edge-weighted graph, and the clusterization is cast as finding a best cut to the graph. Best-Cut is a kind of clustering algorithm that falls into such category, and has been employed in our Coreference Resolution system for NIST ACE08 Evaluation.
Related Papers:
[1] " A Neural Networks-Based Graph Algorithm for Cross-Document Coreference Resolution", Saike He, Yuan Dong, Haila Wang, 2008 IEEE International Conference on Natural Language Processing and Knowledge Engineering ((IEEE NLP-KE¡¯08)).
Semantic Role Labeling
Semantic parsing of text corpora is needed to support tasks such as information extraction and question-answering. In particular, shallow semantic parsing focuses on identifying the semantic roles of the arguments of a verb (or any predicate) rather than parsing the whole sentence in detail.
Traditional shallow semantic parsing systems, which employ machine learning method, focus on the selection of features and their various combinations to train a statistical model. However, we usually have no idea of what a good model is like, neither do we know which features to select. Fortunately, Multi-task Learning (MTL) proved us with a substantial solution to work out this nontrivial problem.
Related Papers:
[1] " Incorporating Multi-task Learning in Conditional Random Fields for Chunking in Semantic Role Labeling " , Saike He, Taozheng Zhang, Xue Bai, Xiaojie Wang, Yuan Dong, 2009 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE¡¯09).
Opinion Analysis
Recently, interest in automatically detecting language in which an opinion is expressed, the polarity of expression, targets and opinion holders has been receiving more attention in the research community. Applications include tracking response to and opinions about commercial products, governmental policies, tracking blog entries for potential political scandals and so on.
Our Opinion Analysis system built for the NTCIR-8 international contest ranks 2nd in MOAT Task.
Related Papers:
[1] " NECLC at Multilingual Opinion Analysis Task in NTCIR8 " , Bo Chen and Saike He, In Proceedings of NTCIR-8 Workshop Meeting, June 15¨C18, 2010, Tokyo, Japan.
[2] " Opinion and Polarity Detection within Far-East Languages in NTCIR-7 " , Olena ZUBARYEVA and Jacques SAVOY, InProceedings of NTCIR-7 Workshop Meeting, December 16¨C19, 2008, Tokyo, Japan.
TOP >
|