Research on Chinese Sentential Intelligence Input Method
|Course||Applied Computer Technology|
|Keywords||Chinese sentential intelligence input Statistical Language Model State Space Model|
Since the computer enters our country, we face with Chinese input question. Chinese keyboard input mothed developes to sentential input from character input and word input after more than 20 years research. It enables the input method more and more to have intelligence and can get support from characteristic of the language to enhance the performance of the input method. This paper mainly studies sentential intelligence input. In fact it transforms the inputed code to the candidate Chinese word, and then selects the most greatly possible candidate sentence as the final result by Chinese custom. It supports continuous input and does not interrupt user’s thinking. Although it has such merits, it is not widely used due to low conversion accuracy and occupancy rate of system resources. This paper does research on sentential input and takes the Pinyin input as an example in order to improve the performance of S&R stroke input method.Chinese sentential intelligence input question can be described by the source- channel model of information theory. We suppose that source takes probability p(S) to produce sentence S, the noise channel transforms the text sentence to Pinyin sequence A according to p (A|S). This question is described as getting original text sentence S according to Pinyin sequence A from the noise channel, namely we choose the biggest probabilityaposteriori p (S|A) as the output result. Therefore it can use statistical method to realize the Chinese sentential intelligence input. In the N-gram model, the natural language is regarded as a discrete Markov model on the assumption that appearance probability of current word only relates to n-1 words before and is irrelevant to other words.We use the Bigram model due to the system time and space request of Chinese sentential intelligence input. This paper gets unigram and bigram statistical information by the SRILM statistics language toolkit. Statistical information sorts by a block position code and can be found by half search, thus we establish the effective statistical language model.Chinese sentential Pinyin intelligence input system includes Pinyin pretreatment module, state space production module, and machine learning module. The Pinyin pretreatment module partitions the input continual Pinyin flow by the least participle algorithm, outputs a discrete Pinyin sequence, and finally sends it to the next module. State space production module produces state space according to input syllable and inserts candidate words corresponding. We combine the state space module with the Viterbi dynamic programming algorithm, integrate the system statistical language module with the user statistical language module through the weighting, calculate the cumulative probability of all the candidate nodes, and choose candidate sentence with the biggest probability as result by trace algorithm. This module output a best sentence to user. User revises it into the correct sentence if it is not correct, and sends it to the auto-adapted study module. In it user statistical language module is modified and is carried on the memory study, thus causes the system auto-adapted ability more to use well. In state space model insertion operation only need produce candidate word related with Pinyin node recently inserted and need not modify Pinyin node and candidate word nodes before. Deletion operation is very simple; it only had to delete the candidate node produced by this Pinyin node. It can get the best sentence by father pointer of candidate nodes with the biggest probability which points to the end of Pinyin node chain in right indicator.We implement Pinyin sentential input method based above frame. Train Corpus comes from People’s Daily in January 1998 announced by Beijing University Computational Linguistics Research Institute. Train Corpus adopts linear interpolation smooth algorithm. Test Corpus comes from CNLP platform and is involved with Art, Literature, Education, Philosophy, Communication, Space, Energy, Electronics, Medical, and Agriculture. Its average character transformation accuracy achieves 83.81%. After integrating the language knowledge such as long word first principle and some grammar rules, its accuracy enhances and achieves 85.42%.This paper solves Pinyin string segmentation different meanings problem using state space model. If Pinyin string has no separative signal, it has different meanings possibly. State space model can retain all the possible segment results to participate in Pinyin sentenctial transformation competition and get the best answer rather than use the general segmentation algorithm only to be able to retain one kind of segments result. It avoids losing the opportunity of competition as the best result due to wrong partition. In order to avoid constructing state space model many times, we propose sentential input method based on the phoneme further. It integrates the syllable segmentation into sentence transformation and inserts a phoneme rather than a segmented syllable each time. It can get optimal result by one construction of status space model composed with all the candidate words produced by all the possible syllable combination.At last we design a test program to constrast this input method and Microsoft Pinyin input method to improve our sentential function.Our works include:(1) Use the limited resources to construct statistics linguistic model under little primitive accumulation condition.(2) Add sentential input function based on Pinyin word input method. The result is enough good to approach the level of Microsoft Pinyin input method basically.(3) Propose Pinyin string segmentation different meanings problem using state space model, it retain all the possible partition results.(4) Propose sentential input method based phoneme to enhance conversion accuracy of the sentence existing different meanings by segmentation.(5) Realize test program to compare Microsoft Pinyin input method and our input method, it facilitates to improve ours.