Implementation of Data Compression, Operation and Query Processing System Based on BAP
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||Massive Data High Frequency Data Compressed Database Data Operation Column-Compressed Storage System|
Accompanying with the development of information techniques and its wide application in finance, traffic, national defense, environment and ecosystem monitoring, massive data is deluging the whole world. This is a gread challenge to DBMS. As the ratio between the capability and price of disk becomes higher and higher, the really problem is how to store and execute queries on massive data efficiently, instead of the storage of massive data itself.There are a lot of data redundancy in massive high frequency data, which means the same data always exist in different places repeatedly. Such redundancy not only wastes storage but also degrades the performance of query. And if we make full use of the compressed database technology, we can reduce the storage amd I/O bandwidth. The research of compressed database technology includes the design of compression algorithms and compressed data query algorithms.There has been renewed interest in column-oriented database architectures in recent years. For read-mostly query workloads such as those found in data warehouse and decision support applications,“column-stores”have been show to perform particularly well relative to“row stores”. Storing data in columns presents a number of opportuneities for improved performance from compression algorithms when compared to row-oriented architectures.Based on the existing relational database techniques,this paper focuses on the researching about data compression methods and storage architectures which are suitable for high frequency data and corresponding query processing technology on them, including data operations and some query optimizations. The main results are as follows:It proposes one kind of compression and storage strategy called TIDC. TIDC is a column oriented compression method based on attribute partition. It uses the information of position (called TupleID in the paper) to connect all the attributes in the database. By only storing the position and its value of the non-constant data from the same attributee, TIDC reduces the storage of the data and makes complete mapping from the original data to the compressed data. To operate on the compressed data, we can get the result of a query without decompressing the compressed data. It presents data operation algorithms including selection, projection and join, and some optimization strategies based on compressed data corresponding to TIDC method.It proposes compression algorithm and data operation algorithms including selection, projection and join, and also give some optimization strategies for the query processiong corresponding to BAP method.A prototype of compressed DBMS using above technology is implemented. Theoretical analysis and preliminary experiments results show that by compressing and storing by column-oriented strategy based on attribute partion, it can greatly reduce storage space, lower query I/O cost and improve query efficiency. What’s more, the amount of massive data has less effect on query efficiency using TIDC than that of BAP.