De-duplication Technology Research and Implementation of Large-scale Short Texts Orient
|School||National University of Defense Science and Technology|
|Keywords||Text De-duplication Text Mining ARFA ARFA-SA|
With the rapid development of computer science and communication technology, short texts such as immediate communication, BBS, newsgroup, and e-mail have also been increasing fast. Although the rapid growth of text messages has brought convenience to People’s life, it has made people difficult to obtain useful information because the increasing short texts are out of people’s control. At the same time, useless and harmful information seriously affect the decisions of government departments, companies, enterprises and other managers. Research shows that close to half of the massive text messages are the repeated information. Through de-duplication, not only can the users optimize the data storage, but also can find hot topics for analysis-using and decision-making.Auto de-duplication as a basic technology in text mining, can not only been used in data preparation, like data cleaning, data merging and data exchanging, but also in data analysis, like duplicate records detecting. At present, auto de-duplication mainly includes field matching techniques and detecting duplication records. Field matching techniques can effectively detect the mismatches in database, for example, spelling mistakes, breviary and excessive words. Detecting duplication records can put the duplication or identical texts into the same category through machine learning and intelligent method.The application of text de-duplication technique is restricted by the short and large-scale characteristics of short texts. Because feature selection is not effective for short texts, classification and clustering can not be well applied in de-duplication field.In regard with the application of de-duplication in text mining, and by combining the requirement of users, this paper will introduce:1. Association Rule and Feature Code Based Fast Remove Duplication Algorithm, ARFA. Considering texts attribute, ARFA implements de-duplication by differentiating texts through association rules, and detecting duplication texts through feature code. The experiment shows that this algorithm has well performed, which can deal with large-scale information effectively. In addition, it displays high compression ratio.2. ARFA-SA implements de-duplication based on ARFA. When the similarity between texts is more than a threshold value, similarity transfer occurs. According to this hypothesis, identical or similar texts are put into the same group through similarity computation.3. The application of ARFA and ARFA-SA. The application of de-duplication algorithm in data mining system realizes duplication records detection and storage of data optimization. The function of duplication records detection includes detecting users who send group messages, and who accept group messages, and the related short text IDs. The function of storage optimization includes removing or merging redundant data.