Web-based search engine block indexing systems thinking
|Course||Applied Computer Technology|
|Keywords||Page segmentation Indexing system Classification Search Engine|
Existing search engines are indexing the entire page and is used to retrieve , but some pages may contain different themes block , if the user just submitted multiple keywords are located within blocks of different themes , even pages and users retrieval request is not related to the search engine will put the pages back to the user . To improve search engine indexing system , the introduction of page segmentation ideas. This algorithm is chosen as the site VIPS block algorithm , but the classic algorithm VIPS practical application of good control of particle size cut of the problem of too thick for the cutting and slicing of too small two cases , nodes are introduced for this depth threshold threshold number of nodes and leaves , so that algorithms can VIPS characteristics according to the page size of the adaptive segmentation . In the three portals crawling pages as a test set, by improving the contrast with the classical algorithms tests proved that the improved algorithm. On a given page first sub-block and block-based content will be relevant to the subject block into subdocuments and then separately for each sub-document indexing. So that only when the user submits multiple keywords contained entirely within the document in a child , the search engine will return to the original page to the user . Web-based block designed to improve search engine indexing system , developed a number of rules has nothing to do with the body block filter , and the rest of the block classification. Finally, through the development of three groups of seed keyword group, and Google submit an inquiry to get the test set , the collection and index improved retrieval results were compared. Experiments show that this indexing scheme provided a large extent, improve the retrieval accuracy and F1 test value.