Automatic answering system based on ultra- large-scale Q non - restricted areas of the library and the voice interface
|School||University of Science and Technology of China|
|Course||Signal and Information Processing|
|Keywords||Question Answering Information Extraction Speech Interface Chinese Processing|
Owing to the invention and development of the Internet, for the first time knowledge and information can be equally shared and widely spread all around the world. And owing to the search engines, such as Google and Baidu, people are able to turn over all the billions of web pages on the internet just for one information he/she wants. Currently the desired information can only be specified by one or several keywords, which just satisfied the basic requirement of information retrieval. In the past few years, researchers kept exploring what should be the next generation of search engine. During the exploration, automatic Question Answering (QA) system is widely concerned and intensively researched because of its ability to directly answer people’s question specified by natural language. Under this background, aimed to construct one open-domain QA system, motivated by the expectation on the far-reaching influence from rapidly accumulated large scale of Frequently Asked Question (FAQ) corpus to question-answering research, this thesis accomplished an in-depth research on the FAQ-based QA system construction approach. This thesis first refers previous researchers’ work and employs traditional keyword-based document retrieval system as baseline, then for the first time this thesis explores the optimal setting for each of the components for the baseline system in the conditions of QA task and large scale of question answer pairs is provided, and a series of valuable conclusions are achieved. Then focusing on the QA pair ranking function, which is the kernel component of QA system, we re-design the function and employ supervised training for further optimization. Together with these novel approaches proposed in this thesis, the performance of FAQ-based QA system is promoted quite significant. We also extend traditional text-based QA system with a speech interface in this thesis, and for the first time we implemented one practical, open-domain and speech interface based question answering system. The major contents and contributions of this thesis include:First, we analyze the opportunity and challenge to the research of question answering system derived from the rapidly accumulated huge scale question/answer pairs on internet. Millions of FAQ web pages accumulated in the internet together with the recent popularity of the knowledge sharing web sites, such as http://zhidao.baidu.com, provide incomparable rich corpus resource for question answering research, but also lead new challenge to traditional question answering technology. According to our precise experimental results, for 76.5% of common questions, at least one correct answer can be found from 3.8 millions automatic extracted question-answer pairs, and 8～10% improvement can be expected if double the QA pair corpus, which indicates the significant value of QA pair corpus from internet and the broad and optimistic prospects of FAQ-based question answering system.Second, orienting to FAQ-based QA system development, we carry on the research on the question answer pair extraction and propose one Decision Tree and Markov Model based extraction algorithm. Experiments show that the precision of this algorithm can achieve 99%, which prove its capability in practical applications. We also construct one QA pair database consisted of 3.9 million high quality QA pairs extracted from http://zhidao.baidu.com. This database provides a strong and fundamental support for later researches in this thesis.Third, we accomplish the optimal setting exploration for each of the components of the baseline FAQ-based QA system. We first construct one database for objective evaluation, which consists of 1000 common questions and their correlated QA pairs from one 3.8 million QA pairs database. Then we explore different settings for each of the key components in baseline QA system borrowed from keyword-based document retrieval system, and a series of novel and valuable conclusions are achieved: 1) among the three common ranking functions employed in traditional document retrieval (TFIDF, BM25 and the language model based retrieval method) the simplest TFIDF is most suitable for question answering task; 2) among the three fields of the question-answer pair (question, question description and answer, referred as Q, D and A respectively), Q field is the most important field to QA system, followed by A; 3) Different Chinese word segmentation should be applied for different field: for Q field, best performance is achieved by segmenting all the texts into single characters, while for the other two field, traditional dictionary based word segmentation is preferred. Experimental results show that finalized baseline system can successfully answer 43.88% user’s questions if just one answer can be returned to user.Forth, based on the finalized baseline system, this thesis perform an in-depth analysis on the essential difference between the task of document ranking according inputted several keywords and the task of QA pair ranking according inputted question in natural sentence. Started from TFIDF ranking function, orienting to the FAQ-based QA system, we designed one novel unified ranking function, which includes four parameters to control the four influences derived from word frequency and IDF of co-occurred word, IDF of unseen word and document length. Experimental results show that this design can improve the performance of QA system significant. Further, considering more features can be utilized in the ranking function design, we employ weighted linear model to combine the contribution of numerous of features extracted from user’s question and QA pairs. The features proposed in this thesis include semantic similarity (according Tongyicicilin), edit distance, part of speech and bigram co-occurrences. And hill-climbing algorithm is employed for supervised training. Experimental results show that the accuracy of QA system can be significantly promoted to 52.37% (19.35% relative improvement). Finally, we accomplish some research on the confidence measure of FAQ-based QA system. We find that the accuracy of QA system’s response could be improved (but not significant) if some questions with lowest confidences are automatically rejected, and if we modified the optimization goal in the supervised learning step, the efficacy of confidence measure can be enlarged.Finally, for the first time, we try to introduce speech interface into open-domain QA system which extent the research and application scope of QA system. We first analyze the great values and big challenges embedded in the QA system with speech interface and point out the advantage of FAQ-based QA system and the inherent conflict between speech recognition and question answering. To construct our QA system with speech interface, named as SpeechQoogle, we employ large vocabulary continuous speech recognition technique to handle the user’s question in speech, and corpus based speech synthesis system to synthesize the generated text answer back into speech. Then, we conduct the customizations of acoustic model and language model for speech recognition module in this thesis, which promote the accuracy of recognition higher enough to touch the basic requirement of question answering system. We also examine the contribution of pinyin layers of recognition results , confidence measure and n-best hypothesis, and achieve a slight slight improvements proved by experiments. Finally, 36.7% of common questions can be successfully answered by our SpeechQoogle system only in the channel of speech, which indicate the feasibility and prospect of the large scale of QA pairs based approach to construct QA system with speech interface.