Spatially clustered join on heterogeneous scientific data sets

B Dong, S Byna, K Wu - … Conference on Big Data (Big Data), 2015 - ieeexplore.ieee.org
2015 IEEE International Conference on Big Data (Big Data), 2015ieeexplore.ieee.org
In the era of data-intensive scientific discovery, data analysis is critical for scientists to
identify essential information from the mountains of data generated by large-scale
simulations or experiments. A generic operation in scientific data analysis is to combine
information from multiple data sets, which are stored in heterogeneous ile formats. This
operation is typically known as a Join in database management field. Currently, a join
operation involving multiple data sets in different file formats is time-consuming because of …
In the era of data-intensive scientific discovery, data analysis is critical for scientists to identify essential information from the mountains of data generated by large-scale simulations or experiments. A generic operation in scientific data analysis is to combine information from multiple data sets, which are stored in heterogeneous ile formats. This operation is typically known as a Join in database management field. Currently, a join operation involving multiple data sets in different file formats is time-consuming because of the need to prepare data (i.e., to convert data into a uniform format or to ingest into a database) and to run the join algorithms. Furthermore, data processing languages, such as SQL (Structured Query Language), can not easily express typical scientific analysis tasks such as interpolation. In this paper, we propose three techniques to address these challenges: a two-level data model to process data from different file formats without converting to a uniform format, a data organization structure known as Multi-Dimensional Binning (MDBin), and a join processing algorithm known as Spatially Clustered Join (SCJoin). Together, these techniques allow scientific data files to be used for query processing with less I/O cost and fast query response time without the extra cost to perform ile format conversion and data ingestion. Evaluation of our proposed techniques in joining and interpolating data sets generated by a plasma physics simulation studying space weather phenomenon showed up to 8X improvement over FastQuery. Querying with our solution outperforms SciDB, a popular array data management system for scientific data, by 43X-143X. We also demonstrate that our methods scale to 64K CPU cores in analyzing 32TB data on a large-scale supercomputing system.
ieeexplore.ieee.org