Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Retrieval machine

Study on Key Techniques of Web Mining for Intelligent Information Retrieval

Author YuanFang
Tutor YuGe
School Northeastern University
Course Computer Software and Theory
Keywords Intelligent Information Retrieval Data Mining Web Mining Personalized services Data Preprocessing Information Extraction Clustering analyzing Classification Rule Web User Web Page Ontology Conceptual Retrieval
CLC TP391.3
Type PhD thesis
Year 2006
Downloads 1681
Quotes 9
Download Dissertation

Since WWW came into the world in 1991, it has been developed quickly and is becoming an important information source of human society. With the rapid development and perfection of Internet techniques, WWW will serve as an important medium from which people obtain information. In the past years, it is convenient for people to search for the useful information, but with the huge increment of the amount of information in the Internet, people feel it is more and more difficult to search what he needs. The reason is that the traditional information retrieval technology has not adapted well to the massive information any longer. Thus it is urgent to expect the appearance of a more intellectualized information retrieval technology for the massive information retrieval in Internet.This dissertation researches some key techniques on Web mining for intelligent information retrieval. It mainly focuses on data preprocessing, classification/clustering of Web pages or Web users, conceptual retrieval and personalized services. We propose or improve some Web mining algorithms for intelligent information retrieval. And we also develop an intelligent information retrieval prototype system.Data preprocessing includes information extraction from PDF documents, Chinese word segmentation and Web log preprocessing. For information extraction from PDF documents, we propose a rule extraction algorithm based on format infusion, and an information extraction algorithm based on tree model; For Chinese word segmentation, a method based on gradual enriching dictionary was proposed. Comparing with the single dictionary matching or statistic method respectively, this new method obtains much better result; For Web log preprocessing, the path complement is mainly discussed and a new algorithm is given in this dissertation.In the researches on Web pages’ classification, this dissertation discusses various methods of text classification and mainly discuss the k-nearest neighbor (k-NN) that has higher classification accuracy of text classification. To improve the efficiency of k-NN, we propose a training samples reduction method based on the density of class and a gradual classification pattern. By computing each density of class in training set and the average density of the whole training set, some samples in the high-density class can be deleted using the training samples reduction method. The gradual classification pattern reduced the proportion of analyzing the whole document by simulating manual classification intelligently.

Related Dissertations
More Dissertations