Home Page Image

Named Entity Recognition >

Chinese Word Segmentation >

Coreference Resolution >

Semantic Role Labeling >

Opinion Analysis >


Named Entity Recognition

Named entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and cast linguistic units in unstructured text into predefined categories such as the names of persons, organizations, locations, and etc.

    NER is always limited by its lower recall due to imbalanced distribution where NONE class dominates the entity classes. Classifiers built on such dataset typically have a higher precision and a lower recall and tend to overproduce the NONE class.

    We have designed two kinds of effective features: 0/1 features and none-local features to overcome this problem. Our final systems utilize these features together with the local features to perform NER task.

    Our method proved to be effective in "The Fourth International Chinese Language Processing Bakeoff & the First CIPS Chinese Language Processing Evaluation (Bakeoff-04)", and 3 Global Champions in NER subtasks have been obtained. (Paper [1] below is our technical report for Bakeoff-04)

Chinese Word Segmentation

Words are the basic linguistic units of natural language. However, Chinese texts are character based, not word based. Thus, the identification of lexical words or the delimitation of words in running texts is a prerequisite of Natural Language Processing.

    Chinese word segmentation (CWS) can be cast as simple and effective formulation of character sequence labeling. A prevailing technique for this kind of labeling task would be Conditional Random Fields (CRFs). Although CRFs could exert predominant performance on the known words (which refer to those words exist in both the testing and training data), yet further improvement for CWS systems are usually limited by the comparative large fraction of unknown words (which refer to those words exist only in the testing data).

     Regarding this nontrivial issue, we are intended to adopt a semi-supervised methodology: incorporates an unsupervised method into supervised segmentation. Control experiments have been conducted, taking a champion CWS system [1] in Bakeoff-04 as its baseline.

Coreference Resolution

Coreference Resolution is an important subtask in Natural Language Processing. It focuses on the problem of determining if two mentions from the documents (within or cross) refer to the same entity in the world.

    Traditional methods solve this task in two steps:
   (1) a classification phase that exams individual pair-wise similarity score between mentions independently of each other.
    (2) a clusterization phase that groups together all mentions that refer to the same entity with a kind of threshold metric, above which mentions are corefered

    However, the similarity scores derived in the first step are inherently noisy and the answer to one pair-wise coreference decision may not be independent of another. A classical case is: if we measure the similarity between all of the three possible pairs among three mentions, two of the similarity may be above threshold, but one below C an inconsistency due to noise and imperfect measurement C which couldn't be solved by local optimization.

    In contrast, globally optimized clustering decisions could solve such problem to some extent, where coreference space is represented as an undirected edge-weighted graph, and the clusterization is cast as finding a best cut to the graph. Best-Cut is a kind of clustering algorithm that falls into such category, and has been employed in our Coreference Resolution system for NIST ACE08 Evaluation.

Semantic Role Labeling

Semantic parsing of text corpora is needed to support tasks such as information extraction and question-answering. In particular, shallow semantic parsing focuses on identifying the semantic roles of the arguments of a verb (or any predicate) rather than parsing the whole sentence in detail.

    Traditional shallow semantic parsing systems, which employ machine learning method, focus on the selection of features and their various combinations to train a statistical model. However, we usually have no idea of what a good model is like, neither do we know which features to select. Fortunately, Multi-task Learning (MTL) proved us with a substantial solution to work out this nontrivial problem.

Opinion Analysis

Recently, interest in automatically detecting language in which an opinion is expressed, the polarity of expression, targets and opinion holders has been receiving more attention in the research community. Applications include tracking response to and opinions about commercial products, governmental policies, tracking blog entries for potential political scandals and so on.

   Our Opinion Analysis system built for the NTCIR-8 international contest ranks 2nd in MOAT Task.