Research on Compression, Operation and Query Processing Methods of Massive Datasets
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||Massive Data Scientific and Statistic Database Compression Database Column-Compressed Storage|
Nowadays, the information technology developed rapidly and we have entered a new stage with massive data. It is an urgent mission to study the management on massive data for the social informationization. This is a great challenge to DBMS on how to store and manage the massive data efficiently and support SQL queries effectively.The massive database, such as the scientific and statistical database, is widely used in earthquake monitor, weather forecast, experiments about physics and chemistry, and so on. There are lots of data redundancy in such database which means the same data exist in different places repeatedly. If we store the data directly, not only the storage is wasted but also the performance of query is degreed. In addition, the relation schema is relatively stable and the candidate values for each attribute are limited. The new arrival data are only appended to the end of the current data area without updating exited data. Queries on data are only relative with minority among the plenty of attributes.The compressed database technology is the combination of data compress technology and database technology to process the storage and query on massive database. The compressed database technology includes data compression methods, data operation algorithms and query processing techniques.In this paper, we propose a new compression method and storage architecture which are suitable for massive database and supporting data operation and query processing efficiently.The compression method proposed in this paper adopts the idea of Column-Compressed Storage and uses the Binary Encoding, Unary Encoding, K-of-N Encoding and Superimposed Encoding to compress the massive data. The encoded data are then stored according to the encoding bit with an extended run length encoding.We also propose data operation algorithms on compressed data without decompressing, including the selection and projection. The operations on original data are converted into operations on the compressed bit files which are simple to realize. A prototype of compression and query on data in massive database is designed and implemented with the above technology. Theoretical analysis and preliminary experiments results show that compression using column-oriented storage can reduce the storage space, lower the query cost and improve the query efficiency.