Abstract
Today’s ETL tools provide capabilities for developing custom code as user-defined functions (UDFs) to extend the expressiveness of standard ETL operators. However, a custom code of an UDF may execute inefficiently due to its poor implementation (e.g., due to the lack of using parallel processing or adequate data structures). In this paper we address the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Ali, S.M.F.: Next-generation ETL framework to address the challenges posed by Big Data. In: International Workshop Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) (2018)
Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. (AMCS) 29, 69–79 (2019)
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26, 1–25 (2017)
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: ACM Symposium on Cloud Computing, pp. 119–130 (2010)
Borthakur, D.: The Hadoop distributed file system: Architecture and design. Hadoop Project Website, vol. 11, p. 21 (2007)
Caruccio, L., Deufemia, V., Polese, G.: Visual data integration based on description logic reasoning. In: International Database Engineering Applications Symposium, pp. 19–28 (2014)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Evans, J.P., Steuer, R.E.: A revised simplex method for linear multiple objective programs. Math. Program. 5(1), 54–72 (1973)
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)
Gartner: Magic Quadrant for Data Integration Tools. https://www.gartner.com/doc/3883264/magic-quadrant-data-integration-tools. Accessed 18 Mar 2019
Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In: International Conference on Scientific and Statistical Database Management, p. 36. ACM (2014)
Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research (CIDR), vol. 11, pp. 261–272 (2011)
Hueske, F., et al.: Peeking into the optimization of data flow programs with MapReduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)
Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)
Ibaraki, T., Hasegawa, T., Teranaka, K., Iwase, J.: The multiple choice knapsack problem. J. Oper. Res. Soc. Japan 21(1), 59–93 (1978)
IBM: IBM InfoSphere DataStage Balanced Optimization. IBM Whitepaper. Accessed 18 Mar 2019
Informatica: How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. https://www.informatica.com/downloads/pushdown_wp_6650_web.pdf. Accessed 18 Mar 2019
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)
Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: VLDB Workshop on Enabling Real-Time Business Intelligence, pp. 68–83 (2010)
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)
Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. https://www.ibm.com/developerworks/data/library/techarticle/dm-1402optimizebdfs/index.html. Accessed 18 Mar 2019
Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Russom, P.: Data lakes: purposes, practices, patterns, and platforms. TDWI white paper (2017)
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)
Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL processes using graph transformations. J. Data Semant. 13, 120–146 (2009)
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)
Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: ACM SIGMOD International Conference on Management of Data (2010)
Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 34–52 (2019)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Acknowledgements
The work of Fawad Ali is partially supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).
The work of Robert Wrembel is partially supported by: (1) the grant No. 2015/19/B/ST6/02637 of the National Science Center and (2) the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ali, S.M.F., Wrembel, R. (2019). Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-28730-6_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28729-0
Online ISBN: 978-3-030-28730-6
eBook Packages: Computer ScienceComputer Science (R0)