The Research of Distributed Text-based Data Filtering Technology and System Implementation Based on MapReduce |
|
Author | LiHu |
Tutor | ZouPeng |
School | National University of Defense Science and Technology |
Course | Computer Science and Technology |
Keywords | Text Filtering MapReduce Distributed Computing Vector space model Feature extension |
CLC | TP391.1 |
Type | Master's thesis |
Year | 2011 |
Downloads | 56 |
Quotes | 0 |
The rapid development of the Internet has brought explosive growth of information , accurate seem urgent and necessary to obtain useful information in a flood of information . Information filtering technology based on the needs of users , the information does not meet the requirements of dynamic information flow filter out automatically screened useful information . The face of the huge amounts of data , the traditional method has been difficult to meet the needs of distributed computing platforms is the inevitable trend of future development . Filtering technology based on the content of the text data using the vector space model to represent text , by calculating the cosine of the angle between the text and user interest template to determine the relevance of the text . Mature theory , the method is simple and easy to understand , and higher accuracy . The MapReduce model framework to achieve the huge amounts of data on a large computer cluster of distributed parallel processing . Users only need to customize the map function and reduce function will be able to achieve most of the distributed computing tasks . In the real world a lot of calculations can be easily use MapReduce model to represent . Data filtering model based on the contents of the text as the basis for the shortcomings of the existing text filtering system to study the MapReduce model framework to achieve the huge amounts of data in real-time filtering involves key technology . The main work is as follows : ( 1 ) study the theory and technology related to the content-based information filtering system , in-depth analysis and discussion of some key technologies , the strengths and weaknesses of the existing method in practical applications . ( 2 ) In-depth analysis of the working principle of the the MapReduce model framework and its components . With examples of elaborate MapReduce - based distributed application development . ( 3 ) design of a model based on the feature item of the the HowNet Chinese Knowledge Base extended , will have the same meaning as the feature item merged to reduce the vector dimension at the same time , improve accuracy represented . (4) The proposed algorithm for computing feature items TF-IDF value a MapReduce model - based framework , through the decomposition of the computational tasks to achieve parallel computing tasks . (5) The design and implementation of a distributed text data based on the MapReduce model framework the filter prototype system , the feasibility of this method is proved by experiments .