[go: up one dir, main page]

skip to main content
10.1145/3078597.3078599acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

ArrayUDF: User-Defined Scientific Data Analysis on Arrays

Published: 26 June 2017 Publication History

Abstract

User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management tasks to the system. This general approach enables numerous custom analysis functions and is at the heart of the modern Big Data systems. Even though the UDF mechanism can theoretically support arbitrary operations, a wide variety of common operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute. Since these operations are traditionally performed on multi-dimensional arrays, we propose to extend the expressiveness of structural locality for supporting UDF operations on arrays. We further propose an in situ UDF mechanism, called ArrayUDF, to implement the structural locality. ArrayUDF allows users to define computations on adjacent array cells without the use of join operations and executes the UDF directly on arrays stored in data files without requiring to load their content into a data management system. Additionally, we present a thorough theoretical analysis of the data access cost to exploit the structural locality, which enables ArrayUDF to automatically select the best array partitioning strategy for a given UDF operation. In a series of performance evaluations on large scientific datasets, we have observed that -- using the generic UDF interface -- ArrayUDF consistently outperforms Spark, SciDB, and RasDaMan.

References

[1]
The Coral Reef Temperature Anomaly Database (CoRTAD) Version 4 - Global, 4 km Sea Surface Temperature and Related Thermal Stress Metrics for 1981--10--31 to 2010--12--31 (NODC Accession 0087989), 2012.
[2]
M. Abadi, A. Agarwal, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2015.
[3]
I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. Nodb: Efficient query execution on raw data files. In SIGMOD '12, 2012.
[4]
L. Antova, A. El-Helw, M. A. Soliman, Z. Gu, M. Petropoulos, and F. Waas. Optimizing Queries over Partitioned Tables in MPP Systems. In SIGMOD, 2014.
[5]
P. Baumann. Management of Multidimensional Discrete Data. VLDB J., 1994.
[6]
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. The Multidimensional Database System RasDaMan. SIGMOD Rec., 27(2):575--577, 1998.
[7]
M. Bianco and B. Cumming. A generic strategy for multi-stage stencils. In Euro-Par'14, pages 584--595, 2014.
[8]
S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel Data Analysis Directly on Scientific File Formats. In SIGMOD'2014.
[9]
P. G. Brown. Overview of SciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD, 2010.
[10]
J. B. Buck, N. Watkins, and et al. SciHadoop: Array-based Query Processing in Hadoop. In Supercomputing Conference (SC), 2011.
[11]
S. Byna, J. Chou, O. Rübel, Prabhat, H. Karimabadi, et al. Parallel I/O, Analysis, and Visualization of a Trillion Particle Simulation. In SC, 2012.
[12]
N. Chaimov, A. Malony, S. Canon, C. Iancu, and et al. Scaling Spark on HPC Systems. In HPDC 2016, 2016.
[13]
S. Chaudhuri and V. R. Narasayya. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. In VLDB '97, 1997.
[14]
J. H. Chen, A. Choudhary, B. de Supinski, and et al. Terascale Direct Numerical Simulations of Turbulent Combustion Using S3D. Computational Science & Discovery, 2(1):015001, 2009.
[15]
Y. Cheng, C. Qin, and F. Rusu. GLADE: Big Data Analytics Made Easy. In SIGMOD 2012.
[16]
Y. Cheng and F. Rusu. Astronomical Data Processing in EXTASCID. In SSDBM 2013.
[17]
Y. Cheng and F. Rusu. Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID. Distributed and Parallel Databases, 33(3):277--317, 2015.
[18]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107--113, Jan. 2008.
[19]
B. Dong, S. Byna, and K. Wu. SDS: A Framework for Scientific Data Services. In Proceedings of the 8th Parallel Data Storage Workshop, PDSW '13, pages 27--32, New York, NY, USA, 2013. ACM.
[20]
B. Dong, S. Byna, and K. Wu. Spatially Clustered Join on Heterogeneous Scientific Data Sets. In 2015 IEEE Big Data, pages 371--380, Oct 2015.
[21]
E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A Practical Approach to Self-describing, Polymorphic, and Parallelizable User-defined Functions. Proc. VLDB Endow., 2(2):1402--1413, Aug. 2009.
[22]
Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson, O. O'Malley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang. Major Technical Advancements in Apache Hive. In SIGMOD '14, 2014.
[23]
L. Jiang, H. Kawashima, and O. Tatebe. Efficient Window Aggregate Method on Array Database System. Journal of Information Processing, 24(6):867--877, 2016.
[24]
M. Kersten, Y. Zhang, M. Ivanova, and N. Nes. Sciql, a query language for science applications. In AD '11, 2011.
[25]
D. E. Keyes, Y. Saad, and D. G. Truhlar, editors. Doman-Based Parallelism and Problem Decomposition Methods in Computational Science and Engineering. SIAM, Philadelphia, PA, 1995.
[26]
J. Liu, E. Racah, Q. Koziol, and et al. H5Spark: Bridging the I/O Gap between Spark and Scientific Data Formats on HPC Systems. In Cray User Group, 2016.
[27]
A. P. Marathe and K. Salem. A Language for Manipulating Arrays. In VLDB '97.
[28]
V. Markl. Breaking the Chains: On Declarative Data Analysis and Data Independence in the Big Data Era. Proc. VLDB Endow., 7(13):1730--1733, Aug. 2014.
[29]
N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: An implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers. In SC '11, pages 11:1--11:12, New York, NY, USA, 2011. ACM.
[30]
B. Momjian. PostgreSQL: Introduction and Concepts. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2001.
[31]
M. A. Onabid. Solving Three-Dimensional (3D) Laplace Equations by Successive Over-Relaxation Method. AJMCSR, 5(13), 2012.
[32]
Y. Peng et al. The design of the variable sampling interval generalized likelihood ratio chart for monitoring the process mean. Qual. Reliab. Engng. Int, 31(2), 2015.
[33]
M. Raasveldt. Vectorized UDFs in Column-Stores (Master Thesis), 2015.
[34]
O. Rübel, A. Greiner, and et al. OpenMSI: A High-Performance Web-Based Platform for Mass Spectrometry Imaging. Analytical Chemistry, 2013.
[35]
F. Rusu and Y. Cheng. A Survey on Array Storage, Query Languages, and Systems. CoRR, abs/1302.0103, 2013.
[36]
A. Shoshani and D. Rotem, editors. Scientific Data Management: Challenges, Technology, and Deployment. Chapman & Hall/CRC Press, 2010.
[37]
E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A Storage Manager for Complex Parallel Array Processing. In SIGMOD'2011. ACM, 2011.
[38]
H. Tang, S. Byna, S. Harenberg, et al. Usage pattern-driven dynamic data layout reorganization. In CCGrid'2016, pages 356--365, May 2016.
[39]
The HDF Group. HDF5 User Guide, 2010.
[40]
Y. Wang, W. Jiang, and G. Agrawal. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats. In CCGrid'2012, pages 443--450, 2012.
[41]
Y. Wang, A. Nandi, and G. Agrawal. SAGA: Array Storage As a DB with Support for Structural Aggregations. In SSDBM '14, New York, NY, USA, 2014. ACM.
[42]
M. Widenius and D. Axmark. MySQL Reference Manual. O'Reilly & Associates, Inc., Sebastopol, CA, USA, 2002.
[43]
W. Zhao, F. Rusu, B. Dong, and K. Wu. Similarity Join over Array Data. In SIGMOD 2016.
[44]
P. C. Zikopoulos and R. B. Melnyk. DB2: The Complete Reference. McGraw-Hill, Inc., New York, NY, USA, 2001.

Cited By

View all
  • (2024)Datacubes as enabler for advanced decision support systems in land managementLand Degradation & Development10.1002/ldr.515335:11(3579-3592)Online publication date: 26-May-2024
  • (2023)ADT-FSE: A New Encoder for SZProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607044(1-13)Online publication date: 12-Nov-2023
  • (2022)A Serverless-Based, On-the-Fly Computing Framework for Remote Sensing Image CollectionRemote Sensing10.3390/rs1407172814:7(1728)Online publication date: 3-Apr-2022
  • Show More Cited By

Index Terms

  1. ArrayUDF: User-Defined Scientific Data Analysis on Arrays

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
    June 2017
    254 pages
    ISBN:9781450346993
    DOI:10.1145/3078597
    © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 June 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. array structural locality
    2. arrayudf
    3. mapreduce
    4. scidb
    5. spark
    6. user-defined data analysis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    HPDC '17
    Sponsor:

    Acceptance Rates

    HPDC '17 Paper Acceptance Rate 19 of 100 submissions, 19%;
    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 04 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Datacubes as enabler for advanced decision support systems in land managementLand Degradation & Development10.1002/ldr.515335:11(3579-3592)Online publication date: 26-May-2024
    • (2023)ADT-FSE: A New Encoder for SZProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607044(1-13)Online publication date: 12-Nov-2023
    • (2022)A Serverless-Based, On-the-Fly Computing Framework for Remote Sensing Image CollectionRemote Sensing10.3390/rs1407172814:7(1728)Online publication date: 3-Apr-2022
    • (2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
    • (2022)Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQLProceedings of the 34th International Conference on Scientific and Statistical Database Management10.1145/3538712.3538715(1-12)Online publication date: 6-Jul-2022
    • (2022)Shadow: Exploiting the Power of Choice for Efficient Shuffling in MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2019.29434738:1(253-267)Online publication date: 1-Feb-2022
    • (2022)Recursive SQL and GPU-support for in-database machine learningDistributed and Parallel Databases10.1007/s10619-022-07417-740:2-3(205-259)Online publication date: 9-Jul-2022
    • (2021)Array DBMSProceedings of the VLDB Endowment10.14778/3476311.347640414:12(3186-3189)Online publication date: 28-Oct-2021
    • (2021)X-composerProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470621(1-10)Online publication date: 5-Jul-2021
    • (2021)FasTensor User InterfaceUser-Defined Tensor Data Analysis10.1007/978-3-030-70750-7_3(23-71)Online publication date: 22-Feb-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media