Design and Implementation of ETL Oriented on Big Data
|School||Huazhong University of Science and Technology|
|Course||Computer Software and Theory|
|Keywords||data processing extraction transformation loading big data workflow|
With the development of information technology, more and more data are generated.These data, both contain large amounts of structured data, and also contains large amountsof unstructured data and semi-structured data. Data capacity becomes bigger, data growthrate becomes faster, the format of the data becomes more complex, the demand of dataprocessing becomes more urgent, all these bring new challenges to ETL. Design a ETLwhich can effectively handle big data has important practical significance.First, according to the characteristics of big data, we puts forward the system functionalobjectives and performance goals on the basis of the needs analysis. According to the bigdata processing requirements, we designs a ETL architecture with effective support forbig data processing and designs the ETL workflow. In order to optimize ETL workflow,improve the efficiency of data processing, we designs the rules of ETL by classification,merger which is suitable for big data environment. The same time, according to thecharacteristics of the MapReduce,we give the design of MapReduce workflow and themapping rules between MapReduce workflow and ETL workflow.Again, the realization of the system is introduced. Universal data access moduleimplements the data extraction and loading, especially for unstructured data extraction.The workflow module is used to parse the metadata to generate local workflow andMapReduce workflow model. The execution module is used to complete the procedurefrom data extraction to data loading. Metadata management module realize the storage ofmetadata.Finally, the experiment showed that the system realizes the function of big dataprocessing and meets the design goals.Through the use of MapReduce can enhance theETL data processing efficiency in the certain degree.