Research on Indexing Model of Retrieval System and Retrieval-related Technologies
|Course||Computer Software and Theory|
|Keywords||information retrieval indexing model inter-relevant successive tree Chinese segmentation textual classification frequent itemsets mining|
With the improvment of computer system performance, Internet spreading broadly and fast and improvement of enterprise informatization, Chinese information resources are accumulating very greatly. Information retrieval is such a kind of technology of finding what people need from the massive information resources.Information retrieval, generally referenced to text information retrieval, composes of storage, organization, performance, retrieval of information and its core techenoly is indexing and retrieving of text information. After several years’ rapid development, information retrieval is now in the intelligent and networking stage. In order to improve precision and user experience, the technologies related with indexing are being studied broadly except for the research of indexing. Similar to stemming of English language, segmentation is peculiar in dealing with Chinese language. Segmentation decides in some degree the precision of retrieval. Auto text classification is useful in information organization and navigation. Its aim is to do help for users to better find, organize and represent information and to meet the higher demand of knowledge extraction. And thus it do good to the user quick evaluation of retrieval results. Text association analysis, especially frequent itemset mining, can help in transforming user’s retrieval demand to retrieving key word, which improves friendness of human interaction of information retrieval.Research on indexing model of information retrieval and its related technologies, including Chinese segmention, fast textual classification, and textual association analysis are done in the paper. The paper proposes a novel indexing model based on sorted duality inter-relevant successive tree, a fast segementation algorithm based on inter-relevant successive tree, a fast KNN algorithim based on simulated annealing and a novel efficient algorithm for mining frequent patterns. Our primary works are as follow.1 Research on indexing model improvement of inter-relevant successive treeInter-relevant successive tree is an excellent indexing model proposed by Chinese with the merit of quickly creating index, high space efficiency and ability of restoring original text by index. In order to meet the demand of internet application dealing with massive data, the paper studies further about this index model and proposes sorted successive indexing model based on inter-relevant successive tree which can return the expected result quickly and improves the time efficiency of retrieval by intersection of sorted subtrees.2 A fast segementation algorithm based on inter-relevant successive treeChinese retrieval precision is related to Chinese segmentation closely. Now quite a few segmentation algorithms have good precision at the cost of sacraficing time. In the environment of Internet, it is a compromise between efficiency and precision of segmentation algorithm. Segmentation tries its best to improve the precision on the premise that it can meet the segmentation efficiency. In order to improve the speed of segmentation, the paper proposes a new algorithm with the data structure-inter-relevant successive tree. The main reason of low precision of segmentation lies in ambiguity word and other words which are not included in the dictionary and most of which are the names of organization and places. The paper studies the characteristic of names of organization and places and sums the features of them, then proposes a new segmentation algorithm combining the rules and methods of machine learning. The experiment testifies that it is an excellent segmentation with higher precision and time efficiency.3 Fast KNN algorithm based on simulated annealingIn the high responsive scenario of internet, there are two important issues concerning with textual classification: one is changing categorization and the other is massive data. The first factor can be solved by adopting the template model matching algorithm-k nearest neighbors algorithm; as for the second factor, we can sort all the features of high demensions of text features, then borrow the idea of simulated annealing, in the tolerant circumstance of decreasing precision, this algorithm can classify the documents quickly. The experiments with different Chinese document sets, show that it has a good practical prospect of application.4 textual frequent itemsets mining algorithm based on projection sum treeSince time efficiency is increasing exponentially with the item increasing, improving time efficiency of mining is a key factor in the area of frequent itemsets mining. A novel data structure-projection sum tree is proposed in the paper. When creating projection sum tree, we can count and sum the items and then we need not do counting and summing when mining; this algorithm is a depth first one, and traverse the tree for once, which improves time efficiency. The experiments show that this algorithm can get higher efficiency compared with similar algorithms.5 Chinese yellowpage information retrieval system(phase I) for Yellow Page Information Co. of China Telecom GroupUsing the above-mentioned innovative technologies, we build a yellowpage information retrieval system. Although this system is a special one dealing with yellowpage information, the technologies can be used in any retrieval system and they are equally effective and practical.