A study of information retrieval techniques based on log analysis
|School||University of Electronic Science and Technology|
|Course||Applied Computer Technology|
|Keywords||Log Analysis User behavior Session segmentation Query Expansion Nutch|
With the explosive growth of Internet-scale search engine has become the primary tool for Internet users to information retrieval and filtering, its important position has become increasingly evident. Search engines, however, when a flood of Internet information retrieval is not yet well positioned to meet the retrieval needs of the user, and often return to retrieve records retrieval with user intent independent, resulting in less efficient retrieval. In this paper, Chinese retrieval to retrieve the word its use, expounded the significance of the information retrieval techniques based on query expansion. Subsequently, this paper combined the characteristics of the user's search behavior, based on the similarity between the search terms on search log analysis process modeling, and log analysis results used in this paper an improved query expansion model, in order to solve the search terms short semantic ambiguity, the problem of the poor quality of the search results. Mainly include the following three aspects: 1, based on search behavior log analysis model. Time span of the traditional HTTP session model often contains multiple search topics in a single session. HTTP session segmentation analysis, based on the similarity between the search term, and in accordance with the the defined herein session similarity session polymerization cut retrieved from the user's behavior. Subsequently, the actual search log analysis, proved more suitable for the extraction of the potential user feedback search log log analysis model based on search behavior. 2, query expansion method study. This paper first discusses and contrast the main query expansion method, and then take into account the history of search terms is the manifestation of the intention of the users to retrieve page document index terms are the search engine positioning of web documents, improved query expansion method in this paper will be both in accordance with word frequency probability associated result set as an extension word associated sources. At the same time, also the weight distribution between the expansion terms. Experimental proof text query expansion method has higher precision than other methods. Nutch-based prototype system design and implementation. In this paper, on the basis of the Apache Software Foundation open source project Nutch to achieve the query expansion module, improved Nutch segmentation. The main task of query expansion module is based on the extended dictionary, thesaurus, extend the original search term; Nutch default of a sub-lexical segmentation major improvements, better support Chinese retrieved. Finally, through experiments comparing the effect as well as the Home of the segmentation of the prototype system and Nutch hit rate. In this paper, based on the actual search log data, in order to improve the quality of search engine retrieval goal. Feedback of the HTTP sessions segmentation, filtering irrelevant search log data, and potential users in the search logs for mining; main query expansion method, the search log history search term and retrieval The results of indexing terminology associated, and the associated results for query expansion. The experiments confirmed that this article improved method to obtain better results. The end of this article, a summary of the thesis work, as well as focus on the analysis of the follow-up study.