Research and Implementation of Chinese Cross-Document Coreference Resolution |
|
Author | LuChangSheng |
Tutor | ZhuQiaoMing |
School | Suzhou University |
Course | Applied Computer Technology |
Keywords | Anaphora resolution Chinese cross-document coreference resolution Biographical information Compliance Information Document layer information Vector space model B_CUBED algorithm |
CLC | TP391.1 |
Type | Master's thesis |
Year | 2010 |
Downloads | 39 |
Quotes | 0 |
Cross-document anaphora resolution is the focus and one of the difficulties of natural language processing, information retrieval, information extraction, an important part of the multi-document summary application system. In a few decades, the study is only limited to a single research documents on behalf of the digestion. With further research, cross-document coreference resolution more and more popular, because it between chapters to build multi-chain refers to not only get more detailed information about an entity, but can some very the value of information feedback to anaphora resolution, anaphora resolution breakthrough progress. Chinese cross-document refers to, on behalf of the resolution research is still in its infancy. In this paper, in-depth analysis of the The existing English across documents refer to digestion technology, the Chinese cross-document coreference resolution system is designed, the system includes two parts Chinese names across documents anaphora resolution and Chinese place names across documents anaphora resolution. Chinese names across documents refer to digestion, proposed a two-step program: first extract biographical information, compatibility information, refer to chain simple merger, separation and marking the formation of the initial generation of chain collection. Then using clustering method based on vector space model (VSM) clustering refers chain, refers to the formation of the final set of substituting chain. Chinese place names across documents anaphora resolution, proposed by the document-level information extraction and VSM-based clustering strategy of combining. In addition, as the Chinese cross-document on behalf of the digestion lack of corpus, we collected from the search engine and finishing 113 with the same names \processing, artificial proofread and check as the corpus of Chinese names and places. In this paper, B-Cubed algorithm to evaluate the system the Chinese names corpus, the F value is up to 95.71%, corresponding to the precision and recall rate of 92.41% and 99.25%, respectively. Chinese names corpus F up to 89.30%, corresponding to the precision and recall rate of 100% and 80.66%, respectively. In particular, the paper systematically in-depth study of the different features and combinations of features, different similarity calculation methods, different threshold values ??of the interval, biographical information, compatibility information and documentation layer or not the impact on system performance, at the same time also studied Chinese anaphora resolution cross-document coreference resolution relationship with the Chinese. Chinese cross-document by comparing the experimental results, check the experimental error, anaphora resolution types of errors and solutions, and laid the foundation for the work. The experiments show that the Chinese cross-document on behalf of the digestion system performance.