Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

Chinese Spoken Document Retrieval Method Based on Stop-word Processing

Author JiangBin
Tutor LiHaiFeng
School Harbin Institute of Technology
Course Computer Science and Technology
Keywords Chinese SDR stopword entropy syllable lattice VSM
CLC TP391.1
Type Master's thesis
Year 2008
Downloads 76
Quotes 2
Download Dissertation

As the development of Internet and multi-medial techonology, the amount of spoken doucuments has increased rapidly. An effective retrieval method for spoken doucuments becomes more and more important. As a new field of speech recognisition ,the aim of SDR (spoken document retrival)is to search in the collection of spoken documents and return the query-related spoken doucuments segmentation or spoken documents to users. Based on indexes of spoken documents created beforehand, it can search effectively based on content.This paper investigates the strategy to improve the performance of Chinese SDR.according to frequent occurency of stop-word in spoken documents, this paper introduces the technology of stop-word processing to SDR.Stop-word is defined as these words that appear frequently in documents but make no sense for retrieval. There must be negative influence to the performance of SDR because of the introduction of non-content stop-word. Because of the particularity of SDR, this paper applying the entropy mehod to extract stop-word,designed the algorithm of stop-word extraction. comparing with the word-frequency mothod ,this method has better performance and reflects the context better.this paper supplys a whole on-line processing of spoken document retrieval, which includes the creation of index based on syllable lattice, the similarity calculation between query and spoken document based on vector space model, orders the result according to similarity and outputs results to users. Every spoken document is presented by a feature vector, which is constructed based on syllable lattice. Extracting the acoustic score of syllable and syllable-pair from every spoken document by searching every syllable lattice of spoken documents to form the feature vectors of spoken documents. Because of the error rate of the ASR(automatic speech recognizer) and multiple characters per syllable, we weighted syllables of stop-word by a punished value to reduce the weight of stop-word syllable in the feature vector, the value is set 0.1 through comparing retrieval results of different value. The cosine similarity is used to estimate the relevance between the query and the document. By experiments, the improved system has a good improvement compared with the baseline system.The main contributions of this paper are: proposing the stop-word extracting algorithm based on left-right entropy, extracting stop-word properly from syllable lattice .proposing the improved VSM based on stop-word punishment and improving the performance of retrieval system.

Related Dissertations
More Dissertations