Research and Implementation of Topic Web Spider in Enterprise Competitive Intelligence
|School||Xi'an University of Electronic Science and Technology|
|Course||Applied Computer Technology|
|Keywords||Topic Web Spider Correlation Calculate Search Algorithm Enterprise Competitive Intelligence|
Competitive Intelligence System is increasingly becoming an indispensable tool formodern enterprises. Internet have become an important way of gaining information forenterprises. However, web information is scattered all over the world. How to getspecific topics from the web resources, and provide valuable intelligence timely havebecome a new problem in the field of information gathering. In recent years,subject-oriented web spider come into being and has become an important tool to solvethe topic search.Research on key technologies of Topic Web Spider at foreign and domestic, thispaper focuses on some studies, such as Web Content Analysis, Text Feature VectorExtraction, Topic Related Calculation and Web Search Algorithm. In this paper, weresolves the web page into document tree, and then get text and relevant links bytraversing the tree. After get the content, we segment the text. According to thecharacteristics of web documents, we improve the feature weight calculation TF-IDFalgorithm, and propose a feature vector calculation algorithm(FAT Algorithm)based onthe frequency and tag. Based on the word of feature vector, and with the anchor text andweb content we propose a link topic correlation calculation algorithm(LTC Algorithm).It ensure the Web spider to download web page relevant to topic as much as possible.On the web search algorithm, we introduce non-greedy selection strategy and thegenetic search strategy, and propose a non-greedy genetic search algorithm(NGGSAlgorithm), expand the search space and avoid thelocal optimal search problem.Research on the basis of above, we design a Tipoc Web Spider System(BlueSpiderSystem), and describe deeply the implementation details by a lot of picture, classes,diagrams, flowcharts and forms.