Research on Word Alignment Based on Statistics and Linguistics and Correlation Fusion Strategy
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||statistical method word alignment linguistical knowledge multiple classifiers combination|
The boom of the Internet and the growth of information available all over the world has led to a great demand for understanding and spread context in different languages. In this backstops, the classical topic, machine translation, has been provided with new horizons for development. As an intermediate result in statistical machine translation, word alignment plays an important role in machine translation. Besides that, it has been applied widely in many natural language processing fields such as word sense disambiguation and translation lexicon building.Traditionally, a statistic based word alignment requires high size of corpus. How to deal with the data sparseness so as to improve alignment on small size of corpus is one of the hot topics in word alignment. This paper proposed a method combining statistical and linguistic knowledge to solve the question raised above.We adopt the classic IBM Model as a basic model. By combining dictionaries, rules and syntactic structures, taking position information and part of speech as constraint, we achieve the target by adding potential correct alignment, delete potential error alignment and disambiguate the uncertain alignment that more than one same words links to one word. Experiments show that combining dictionaries and syntactic structures method improve in precision and recall respectively. The rule based method works excellent in both aspects, reaching the lowest alignment error rate (AER) 0.2503.Additionally, we employ the concept of classic study. Regarding the alignment models as independent classifiers, we use simple voting and weight voting strategies to combine them. Experiments show that all strategies increased precision compare with the sole alignment classifiers. The weight voting strategy gets the highest recall and lowest AER, increasing by 17.22%and decreasing by 36.47% respectively.