Word Sense Disambiguation Corpus Automatic Acquisition
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||Natural language processing Word sense disambiguation Language model Pointwise mutual information|
The phenomenon that one word has several senses brings many difficulties to the processing of natural language by computer. In the final analysis, plenty of problems from natural language understanding are to solve the problem of ambiguous terms. Since the issue’s impact was noted, it has passed more than 60 years. During that period, academics put forward a number of ways to word sense disambiguation (WSD). With the development of large-scale computer text-processing technology, supervised machine learning methods predominates in the approaches toward WSD tasks due to their high accuracy. However, these methods’successes depend on enough training data deeply. And the annotation of these data is time consuming and laborious as well as difficult to guarantee the consistency. Data sparseness led by the lack of training data restricts the promotion of the supervised methods. Some studies started in the purpose of obtaining training corpus automatically. Among them, a method using synonyms to expand training corpus has lower resources costs and better expandability. However, the experiment found that the corpus this method obtained contains too much noise and has high bias. Therefore, focusing on how to obtain effective training corpus automatically, this article promotes a two-stage strategy of expansion-verification, which eliminates noise in the training corpus brought by expansion stage. Here we focus on the verification capabilities of two ways which are based on language model and pointwise mutual information respectively.In order to contrast in the follow experiment, an SVM based supervised WSD system was developed in this article. Experiment on Semeval-2007 English lexical sample corpus shows that the linear kernel SVM has the best performance. Next we use the synonyms of the target words in Senseval-3 Chinese corpus and Semeval-2007 English corpus to obtain candidate WSD corpus on Web and raw corpus, then filter these corpus using language model and pointwise mutual information approaches and append these expansion corpus into the supervised systems respectively. The results show that both of these two approaches have the capability to verify and improve the final performance of the system. Language model approach improves the accuracy of the system on Senseval-3 Chinese lexical sample corpus from 62% up to 63.06%. Evaluation on Semeval-2007 English lexical sample corpus shows the accuracy improves from 88.19% to 88.46% by the pointwise mutual information verification approach.