Research on Key Technologies of the Information Extraction
|School||Beijing University of Posts and Telecommunications|
|Course||Signal and Information Processing|
|Keywords||named entity recognition entity relation extraction comprehensive information theory maximum entropy conditional random fields|
With the development of the Internet technology and growth of the number ofelectronic text, which has caused the difficulty when the user want to get moreinteresting information from web. Some intelligentized tools had been proposedto solve the sea information problem. Information extraction technology is oneof them; the purpose of information extraction systems is to extractdomain-specific information from natural language text. According to theevaluation task of ACE conference, the information extraction had been studiedincluding: named entity recognition, entity relation extraction, anaphoraresolution and event detection, our research had been concentrated on namedentity recognition and entity relation extraction. The advanced machine learningalgorithm and Comprehensive Information Theory (CIT) had been appliedduring our work.1．A new approach of combining statistical with rule-based was proposed tosolve named entity recognition, human knowledge had been fully consideredalso. According to different entity , method is different. Person and location hadbeen recognized when the rule-based are earlier than statistical, the organizationrecognition is opposite. The Maximum Entropy (ME) and Condition RandomFields (CRFs) algorithms had been used for our statistical method. The maincontributes are listed as follow:Firstly, before the recognition of the person and location, we collect candidateperson and location with rule at first, then, send the candidate entities intorelevant model to recognize. We propose a dynamic priority method to solve theproblem that section of a foreign personal name would be collected orrecognized as Chineses personal name, we collect some high frequencyambiguous characters which can be used both in a Chinese surname and a foreign personal name, our method is that searching forward and backward incontext to find some characters which maybe belong to Chinese personal nameor foreign personal name, according to the results of collection, the appropriatemodel will be chosen for the candidate personal name to recognize, experimentsshow that dynamic priority method is promising, the recall and precision ofpersonal name recognition have been improved. Location names often end withthe some specific words like "省/province", the difference between locationname recognizing with person recognizing is different search direction whencollecting candidate entity. We design the following features for recognitionmodels: entity contextual surroundings, specific entity contextual semantic andthe different word or character conrtibution degree for recognizing entity,considering this, we propose a probabilistic feature, which will be insteaded ofthe binary feature for person recognition, and distinguish difference by usingdifferent probability values feature, this gives model the capability of exploringfiner-grain difference in instances. Probabilistic feature is one of the severaldifferences between our model and the most of the previous model; we alsoexplore several new features in our model, which includes confidence functions,position of features etc.Secondly, we use the cascade multi-models. To improve the performance, weuse sub-models to model Chinese personal name, foreign personal name,location and organization respectively, the multi-models structure is cascadeway. At the same time, we bring some new techniques in these sub-models withdifferent features.Thirdly, the organization recognition method is different from that of theperson and location, because of the change of length, we use the phraserecognition technology, we design four labels to recognize organization, and thetask of the organization recognition can be simplified into the task of the labelrecognition.Finally, we took part in the SIGHAN(2006) entity recognition open trackcontest for Microsoft Research Asia (MSRA) corpus, and achieved the highestF-measure, but also, we also use 7M corpus of one-month People’s Daily(January, 1998) to make experiment. We respectively apply ME and CRFs algorithms to solve named entity recognition. The experiments show that CRFsis better than the other approaches.2．We propose an automatic entity relation extraction approach based on CRFs,we extracted relation between two entities in a sentence. Our work concentrateson:Firstly, we collect and tag corpus. Based on the "management succession"domain, we collect the corpus from the Internet and People’s Daily, afterpreprocessed steps should be applied: word segmentation, POS tagged, then, thecontext are converted into XML format. Based on the processed corpus, we tagmanually three entity relation type and the number of negative instances in thedata set, the relation are among position, person and company, this tagged workis the foundation in my following research.Secondly, based the tagged data set, we propose a new approach to solve theautomatic entity relation extraction based on CRFs, we constructe systemarchitecture in order to realize relation extraction experiment. Besides, wechoose the different feature for the different relation extraction type based onCRFs, which includes morphology, grammar and semantic feature. Finally, wecompare the performance between ME and CRFs, the experiments show thatCRFs is better than the other approaches.3．We propose a new entity relation extraction approach based on CIT in thispaper, with the help of CIT, we use syntactic, semantic and pragmaticknowledge to excavate the imply entity relation and clear relation among entitiesat the same time. Our work concentrates on:We firstly achieve the syntactic knowledge based on machine learningalgorithm, which are composed of many extraction pattern. Duo to lack of thetagged corpus, we propose the unsupervised learning method to get patterns,based on the bootstrapping algorithm, we design the hierarchy knowledgeextraction model which including the inner specific word extraction model andouter pattern extraction model can be nested each other to extract automaticallyknowledge, which the specific dictionary and pattern rules can be used for theentity relation extraction.We build the comprehensive information knowledge-base. We use the semantic frame method and combine the "pattern-action" to analyze the result ofthe pattern extraction, then we obtain the imply relation through analysis andinference, and revise the wrong obtained entity relation at the same time.Through the inference and revising for the final result, the complete result is sentto user. The experiment shows that the approach based the CIT can usefullysolve relation extraction among entities.