Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Retrieval machine

Design and Implementation of Distributed Index and Search System Based on Cloud Platform

Author YangJianDan
Tutor BaoYuBin
School Northeastern University
Course Computer Software and Theory
Keywords cloud compute distributed search parallel index Hadoop Lucene inverted index
CLC TP391.3
Type Master's thesis
Year 2011
Downloads 3
Quotes 0
Download Dissertation

With the development of computer technology and the beginning of Internet era, the amount of information on the Internet is on explosive growth. Faced with these huge amounts of data, the indexing time will be on linear growth with the increasing of files needed to be indexed, and when there is high traffic or large amounts of index data, the search servers can not process the requests within limited time. Consequently, how to create indexes fast and how to search indexes efficiently become very crucial issues. On the other hand, the search results of current search engines(such as Google and Baidu) only contain Web page data, and do not include structured data, thus users must select a Web page to find the required structural information, and search results can not show detailed information directly, and which leads to the user experience is not ideal. Solving those two kinds of problems is extremely important to get information from the Internet.To solve the above problems, we designed and implemented a distributed index and search system with layered architecture on the cloud compute platform. First of all, for the massive volume of data to be indexed, we propose a parallel method using Lucene and running on multiple nodes of Hadoop cluster to create inverted indexs. Because multiple machines simultaneously index data, this mothod greatly accelerates the speed of indexing. Secondly, we propose a distributed retrieval method based on Katta, and successfully resolve the problems of high traffic and large scale index files slowing down search. On the one hand, the system caches previous search results at different levels, and if the cache is hit it directly returns the results, else it executes the process of search. On the other hand, the system distributes index files to many nodes of Katta cluster and stores index files for multiple copys, and multiple nodes search index files at the same time when searching, which improves the retrieval speed, reliability and scalability of the system. Then we present a search result show way which shows the structural data in the form of tree and shows Web data like Baidu and Google to improve the user query experience. Finally, through the analysis of Web data, we choose Web pages including mobile and company information to test system comprehensively. The experiments and practical application show that the designed system can quickly create indexes on the massive data and have the ability to quickly respond to queries. What is more, the query results display structured data in an intuitive way. One the whole, the system also has good scalability and fault tolerance.

Related Dissertations
More Dissertations