Research on Processing Methods of Data Stream Based on Parallel Computing
|School||Dalian University of Technology|
|Course||Applied Computer Technology|
|Keywords||Data Stream Parallel Computing GPU Trend Prediction Frequent ItemSets Data Stream Correlation|
It is attracting significant attention for mining large volumes of data at high speed in the world. High performance methods are extremely demanded to achieve the continuous data stream mining. This type of dynamic data, compared with its static counterpart, exhibits such new characteristics that the data are sequentially acquired for continuous real-time access. We have to address great challenges on the accuracy and online ability of data stream trend and correlation analysis processing due to limited computational and/or storage resources. And the processing time delay has also become a sharp bottleneck problem to restrict the data stream mining. This thesis focuses on the parallel computing models and algorithms for the trend and correlation analysis on data streams. These models and algorithms are capable of efficiently working on both CPU (Central Process Unit) and GPU (Graphic Process Unit) of high performance. The main research contents are summarized as follows:Firstly, we present a new online analysis method derived from the classical Hilbert-Huang Transform (HHT) in order to process the nonlinear and non-stationary time series data streams. This method combines the neural networks with radial basis functions (RBF) for the online prediction on the streams. We design a chain-style sliding window which can be rewritten to read and write the time-series data stream. Moreover, it divides the whole data into several segments to use CPU multi-threaded parallel processing for the prediction in a parallel fashion, and then glues the segments to a final stream. The online HHT method does not only render adaptive time-frequency analysis capabilities, but also accelerates computing speed. The partitioned results given by the method also reduce the input complexity of the RBF neural networks. Compared with the existing methods, the proposed method is able to handle online short-term trend prediction of the time series data stream.Secondly, we propose a new genetic algorithm with nested sliding windows (NSWGA) to replace the complicated pattern trees widely used in frequent item mining of data stream. This improved genetic algorithm uses nested sliding windows to segment data streams, and leverage the MPI parallel processing so as to effciently discover all frequent patterns for the nested windows. It can achieve incremental maintenance of frequent item sets through the updating of new data and removing of expired data. It also makes it possible for high efficiency processing in limited storage buffer space. Thirdly, we build a GPU-based generic process framework for data streams to tackle the processing delay and efficiency issues. This framework adapts to the characteristics of data streams and meets the high-performance requirements. We construct the parallel computing architecture of stream blocks with two granularity levels (big and small) by using SIMT mode of GPU and basic window model in sliding window. The big granularity parallelism is responsible for the parallel control of divided tasks, while the small granularity parallelism is grouped by computing thread grid and responsible for extract the synopsis data for various parallel mining algorithms. Both of them aim to achieve high efficiency of data exchange and performance parallel algorithm. Furthermore, we give a new parallel data quantile computing method named GSQ (GPU Stream Quantiles) in this generic framework. It can call GPU kernel to generate synopsis data histograms by Hash functions and finally query data stream quantile. Experimental results show the significant improvements on processing bandwidth, response time and speedup.Fourthly, we address the issue of the constraints of memory resources and execution sequences for multiple data stream correlation analysis on CPU. We propose a four-layer sliding window frame for multiple data streams, which crosses the bus and collaborates between CPU and GPU. Thus, parallel computing of basic window offsets can be processed when multiple data streams are completely mapped to the GPU memory space and created SID index for each. Then, we construct correlation parallel algorithms GSSCCA (GPU Single-Dimensional Stream Canonical Correlation Analysis) by s→Thread and s→Block multi-level parallel computing. Experimental results show that the algorithm has high accuracy and faster computing speed.Fifthly, the high-dimensional data streams appear more complex constraints of resources and execution sequences than single-dimensional data stream in the calculation accuracy and performance. To address this issue, we present the high-dimensional data stream correlation analysis method GMSCCA(GPU Multi-Dimensional Stream Canonical Correlation Analysis) algorithm in basis of study of related mathematical model. This method can quickly and accurately complete the calculation in the environment of limited computing resources and high-efficiency requirements by using data cube pattern and dimensionality-reduction technique. It also can give balanced compromise between high performance and approximation accuracy.