Research of Some Key Issues in Highly Adaptive Example-Based Machine Translation |
|
Author | RenFeiLiang |
Tutor | YuGe;YaoTianShun |
School | Northeastern University |
Course | Computer Software and Theory |
Keywords | nature language processing machine translation example-based machine translation language model Chinese word segmentation coreference resolution word sense disambiguation |
CLC | TP391.2 |
Type | PhD thesis |
Year | 2008 |
Downloads | 107 |
Quotes | 1 |
Internet is really a medium without borders. It provides an information platform for the people that come from different counties and different areas, and allows them communicate with each other and express themselves freely. But there is a great challenge in this Internet world:if an ordinary Chinese man wants to browse some news on an American website or on a Germany website, the ability of his foreign language would be a necessary premise for him to understand the news well. This is just the reason that there are so many companies and research organizations (including us) dive into the research and development of machine translation.In this dissertation, we proposed a high adaptive example-based machine translation model that is based on shallow parse. It has the virtues of easily being built, and easily being transplanted. With this method, researchers can build example-based machine translation systems among multi-languages. The author made some researches on the key issues that are used when developping a translation system that is based on our method. The main work and innovations of this dissertation are listed as follows.1. We proposed an example-based machine translation model that is based on finite state automata transfer generation. In this method, it first seletcts some translation examples that are similar to the input text; then it analyzes these translation examples and the input text, and assigns some states according to the previous analysis results; then it constructs an automaton and generates the translation result in a finite state automata transfer manner; and during the process of translation generation, it used language model to solve the problem of word sense disambiguation. This method used both the character of example-based machine translation that is source-similarity-based and the character of statistical machine translation that is target-similarity-based, and we also used some rules to translate some special expressions. Generally speaking, our method combined the technologies of example-based machine translation, statistical machine translation, and rule-based machine translation. And experiment resultsexperimental results indicate that the the proposed method was effective and system’s performance was encouraging. Based on this translation method, we took part in the Englsih-to-Chinese limited machine translation evaluation and Chinese-to-English limited machine translation in the 3rd symposium on statistical machine translation (SSMT 2007). and got the fifth rank and the seventh rank respectively. 2. In response to the problem that there are many coreference phenomenons in the bilingual resources obtained from Internet, we proposed a hybrid machine learning method that uses conditional random fields and active learning method for coreference resolution. In this method, we proposed a novel cascade clustering algorithm. And based on these methods, we took part in the EDR (Entity Detection and Recognize) task of ACE (Automatic Content Extraction) 2007 orgnized by NIST, and achieved proxime accessit.3. We proposed a word sense disambiguation method based on n-gram language model. It takes the fluency as the only criterion for word senses’choice, and takes n-gram language model to evaluate this fluency. This method is language-independent. It can be easily replicated. And experimental results indicated that this method was effective.4. We performed the Chinese word segmentation task using SVM in a comprehensive and systematic way. During the experiments, we proposed a dynamic weighted method for the assignments of features’weights. And experimental results showed this method can improve the performance of word segmentation greatly.5. In response to the problem that SVM is very time-consuming for Chinese segmentation task, we proposed an algorithm that removed large number of redundant samples from training set. Experiments showed that we remove almost 40% training samples from training set, but the final system’s performance almostly remains unchanged.6. We proposed a method that builds translation memory system based on N-gram. This method doesn’t need any language parser. It can provide both exact translation proposals at sentence level and relevant translation proposals at subsentence level. And experiments indicated this method is very fast and can satisfy some real-time cases.