Template independent web information extraction
|Course||Applied Computer Technology|
|Keywords||web information extraction machine learning template-independent|
Recently, with the rapid growth of information on Internet, the information that people can access grows exponentially. The web information extraction technology that can automatically manage tremendous information is getting increasing attention of researchers. But on the other hand, for any specific user, only a small part of the information is useful. So the web information extraction technology to retrieve relevant texts from huge web data becomes more and more important. Information extracted from the Internet is not only useful for end users, but also can be used to build intelligent query and data mining systems. Currently, research of web information extraction technology has become one of the hotspots in the field of information retrieval.In this paper, we first introduce the key technologies of web information extraction. In terms of data representation, we use Dom-Tree to re-display the web page code. And with the nodes of Dom-Tree as training samples, we use visual features and human designs to represent the structure information of these samples. And then two categories of web information extraction methods are introduced:template-dependent methods and template-independent methods. By presenting and analyzing these methods, the pros and cons, and scope of these two categories of methods are summarized.Secondly, we study the meaning and goal of news page and forums page extraction. We start our research with the characteristics of web corpus, and then design an HTML parser for analysis, debugging, and labeling tasks of web corpus. We finally completed a wrapper through the establishment of model and the training of classifier.The experimental results in both English and Chinese corpus show that the F-Value can reach 96.7% and 89.1% in News Pages and Forums Pages separately. Comparative experiments demonstrate that our methods can significantly improve the extraction accuracy. And the absolute results also show that our method is qualified for a real system.