Research on Key Technologies of English-Chinese Machine Translation System
|School||PLA Information Engineering University|
|Keywords||machine translation morphological analysis part of speech data fusion base noun phrase identification rough set shallow syntactic parsing example-pattern method structure transform target language generation|
Machine Translation has been accepted as one of key areas of research in natural language processing with wide prospects of application. With English Chinese machine translation as its research focus, this dissertation covers machine translation mechanism, source language analysis, example-pattern match, structure transform and target language generation, proposes some innovative ideas, and develops an English Chinese machine translation system with higher translation accuracy.This dissertation puts forward a transfer based (TB) & example/pattern based (TB-EPB) translation mechanism, integrating the merit of TB with its stability and of EB with its higher translation accuracy. Based on the example/pattern based (EPB) mechanism, input sentences are matched to the example patterns at several layers after morphological analysis and shallow syntactic parsing. Experiments indicate that the TB-EPB mechanism produces better translation quality and faster as well.According to system model, we construct a rule based morphological analyzer, for which lexical rules and data structures are given in form of our rule description language, and a comprehensive dictionary, for which we design a hash algorithm for entry search, and introduce the algorithms of sub-modules of morphological analysis including morphological preprocessing, morphological parsing, unknown word processing, combination parsing and part of speech tagging etc.Having conducted research of four kinds of POS tagging methods based corpus, we propose a novel kind of data fusion strategy in POS tagging — correlative voting method, analyze its advantage in theory, and do contrastive experiment with other fusion strategies. The result of experiment shows that linguistic knowledge of POS tagging can be more roundly described by applying data fusion, accordingly the task of POS tagging can be fulfilled better, and the correlative voting is more outstanding than other fusion methods which depress an average of 27.85% in tagging error rate.For Base noun phrase (BaseNP) identification, a novel rough set based approach to BaseNP identification is proposed in this dissertation, which uses rough set theory to resolve BaseNP tagging subtask, and implements BaseNP identification with a finite state transducer (FST). Rough sets-based rule learning mechanism and concerning algorithms are introduced, flow charts of BaseNP tagging and identification are described, and solution to instance collision is put forward for improving performance of BaseNP identification. Meanwhile, detailed experimental steps and results, and comparison with representative similar systems are provided.