Link Context-Based Web Page Prediction
|Course||Computer Software and Theory|
|Keywords||Focused Crawling Link context Web prediction|
Recently, focused crawling has become one of the hot research topics inthe field of network information retrieval. Unlike the traditionalgeneral-purpose web crawlers that traverse the web and collect web pagesindiscriminately, a focused crawler is designed to gather pages on a specifictopic prescribed by the users. Web page prediction, which forecasts thecontent of the unvisited URLs’ corresponding web pages, calculates thepages’ content similarity with the topic based on the prediction result, andthen prioritizes the unvisited URLs queue to determine the order of thembeing picked up and crawled, is the crucial part of the whole web pagecollection process, thus the core of the focused crawling technology.Generally speaking, users with the knowledge of relevant fields areskillful at judging whether they would be interested in the web page beforethey click it. However, it is very challenging for us to simulate human being’sbehavior at the point of web page prediction by computer program.The traditional web page prediction methods take the whole page as thesmallest process unit in the course of forecasting in common. With the rapiddevelopment of Internet, the design of web pages become more complicatedand their content had transformed from single-topic model to multi-topicsmodel. Based on the whole web page, the prediction scheme is apt tooverlook some very important details which are highly relevant to users’retrieving topic, leading to some topic-related web pages never being visitedor their visits being delayed. And this phenomenon will be more serious asthe degree of page complication increasing.Link context-based web page prediction, the algorithm we presented inthis paper is designed to solve this problem effectively. URL’s anchor textand its context usually contain the simple but accurate semantic clues aboutthe content of URL’s corresponding web page. At the same time, regardlessof the web page’s topic model, texts and links relevant to a specific topic areoften designed and put in the same block. If a text block contains some link,including the text information in that block into link context by anappropriate method is a good way to enlarge URL’s anchor text effectively,thereby improving the accuracy of web page prediction.URL’s anchor text as part of link context is no doubt. However, it isalways precise but too short for computer program to make judgments justbased on these few words. The texts in the neighborhood of anchor label areanother important source of link context. How to pick up that kind of textinformation are the focus and the difficult point of the study. After a varietyof methods explored in this paper, we provide an advanced link contextextraction algorithm, A-GPant. It combines the advantage of both DOMOffset method and Aggregation Node approach and applies the technology ofidentifying web page’s navigator column and removing advertisement. Thealgorithm first selects candidate URLs from all the out links extracted fromthe current visiting web page based on some criteria, then picks up linkcontext from the web page’s corresponding DOM hierarchy structure for eachcandidate URL. The web page prediction helps crawler focusing on a specifictopic in the network by making judgments on the enough heuristicinformation provided by the link context. Experimental evidence shows thatlink contexts extracted by applying our A-GPant algorithm have higheraverage context similarity, lower zero-similarity frequency and more wordsrelevant to the topic than those of Aggregation Node approach.Secondly, how to gather the relevant knowledge as the topic descriptionrequired in the process of web page prediction is another challenge weconfronted. In response to different topic user prescribed, different fields ofknowledge are needed. Whether we can gather the topic-related knowledgerapidly and accurately is another issue we need to address.The traditional topic description gathering methods are all relied on ahierarchy directory structure, which strictly limits the users’ behavior. Also,there are relatively a very large number of words in the topic descriptiongenerated by the traditional methods and the dimension of the correspondingdictionary is too high to prioritize the unvisited URLs on the prediction result.Because link context compared to the whole web page has fewer words, itmatches only a few word-based features and makes the distinctions betweenany two URLs so small that there is no difference according to the linkcontext similarity calculation results. It becomes meaningless to prioritizeand determine the order of the unvisited URLs being crawled and the focusedcrawler guided by link context-based prediction scheme dose not work as wepreviewed.To fit the link context’s concision and shortness, we develop a flexibleand quick method to gather topic description according to user’s requirement.Anchor texts in a page’s backward URL’s corresponding web pages describethat page’s content in both delicate and coarse granularity. We use SeedURLs’ backward pages to quickly gather topic description and builddictionary containing lots of topic-related words and word-based feature.This kind of dictionary and feature suit the link context and will be used asstandard feature for measuring the similarity between the topic and linkcontext picked up for the URLs from the web pages encountered in theprocess of crawling. It also amplifies the distinctions of different URLs’ linkcontext similarity to the topic, enhances the link context’s function in webpage prediction. The dictionary could quickly changes or updates accordingto the Seed URLs’ topic tendency. In theory, it is an optimum solution tobuild an expert dictionary suited to the size and length of the link context. Butit is hard to put into practice. Experiments show that the small-size dictionarygenerated from backward URLs automatically can follow the performance ofmanually built expert dictionary in measuring the link context similarity totopic in most cases, while the automatically built dictionary with moreflexibility in generating and updating.Finally, we build a focused crawling system guided by the entirealgorithm we presented. A comprehensive experiment on different SeedURLs on the topic of Linux has been conducted, the result shows obviouslythat this approach outperforms Best-First and Breadth-First algorithm both inharvest ratio and efficiency in most cases.The future work including some new challenges and technologicalpossibilities is mentioned at the end of this paper.