Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics

Ali, Syed Muhammad Fawad; Wrembel, Robert

doi:10.1007/978-3-030-28730-6_27

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11695))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

996 Accesses
8 Citations

Abstract

Today’s ETL tools provide capabilities for developing custom code as user-defined functions (UDFs) to extend the expressiveness of standard ETL operators. However, a custom code of an UDF may execute inefficiently due to its poor implementation (e.g., due to the lack of using parallel processing or adequate data structures). In this paper we address the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

CHF34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: CHF 24.95; Price includes VAT (Switzerland)

eBook: CHF 70.00; Price excludes VAT (Switzerland)

Softcover Book: CHF 87.50; Price excludes VAT (Switzerland)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Framework to Optimize Data Processing Pipelines Using Performance Metrics

Optimization of data flow execution in a parallel environment

Article 22 August 2018

Budget Constrained Scheduling Strategies for On-line Workflow Applications

Notes

References

Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Article Google Scholar
Ali, S.M.F.: Next-generation ETL framework to address the challenges posed by Big Data. In: International Workshop Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) (2018)
Google Scholar
Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. (AMCS) 29, 69–79 (2019)
Article Google Scholar
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26, 1–25 (2017)
Article Google Scholar
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: ACM Symposium on Cloud Computing, pp. 119–130 (2010)
Google Scholar
Borthakur, D.: The Hadoop distributed file system: Architecture and design. Hadoop Project Website, vol. 11, p. 21 (2007)
Google Scholar
Caruccio, L., Deufemia, V., Polese, G.: Visual data integration based on description logic reasoning. In: International Database Engineering Applications Symposium, pp. 19–28 (2014)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Evans, J.P., Steuer, R.E.: A revised simplex method for linear multiple objective programs. Math. Program. 5(1), 54–72 (1973)
Article MathSciNet Google Scholar
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)
Article Google Scholar
Gartner: Magic Quadrant for Data Integration Tools. https://www.gartner.com/doc/3883264/magic-quadrant-data-integration-tools. Accessed 18 Mar 2019
Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In: International Conference on Scientific and Statistical Database Management, p. 36. ACM (2014)
Google Scholar
Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)
Google Scholar
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research (CIDR), vol. 11, pp. 261–272 (2011)
Google Scholar
Hueske, F., et al.: Peeking into the optimization of data flow programs with MapReduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)
Google Scholar
Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)
Article Google Scholar
Ibaraki, T., Hasegawa, T., Teranaka, K., Iwase, J.: The multiple choice knapsack problem. J. Oper. Res. Soc. Japan 21(1), 59–93 (1978)
MathSciNet MATH Google Scholar
IBM: IBM InfoSphere DataStage Balanced Optimization. IBM Whitepaper. Accessed 18 Mar 2019
Google Scholar
Informatica: How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. https://www.informatica.com/downloads/pushdown_wp_6650_web.pdf. Accessed 18 Mar 2019
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)
Article Google Scholar
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)
Article Google Scholar
Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: VLDB Workshop on Enabling Real-Time Business Intelligence, pp. 68–83 (2010)
Google Scholar
Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)
Article MathSciNet Google Scholar
Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. https://www.ibm.com/developerworks/data/library/techarticle/dm-1402optimizebdfs/index.html. Accessed 18 Mar 2019
Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)
Google Scholar
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Article Google Scholar
Russom, P.: Data lakes: purposes, practices, patterns, and platforms. TDWI white paper (2017)
Google Scholar
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)
Article Google Scholar
Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL processes using graph transformations. J. Data Semant. 13, 120–146 (2009)
Article Google Scholar
Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Google Scholar
Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)
Article Google Scholar
Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6
Book Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: ACM SIGMOD International Conference on Management of Data (2010)
Google Scholar
Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 34–52 (2019)
Article Google Scholar
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar

Download references

Acknowledgements

The work of Fawad Ali is partially supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

The work of Robert Wrembel is partially supported by: (1) the grant No. 2015/19/B/ST6/02637 of the National Science Center and (2) the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.

Author information

Authors and Affiliations

Poznan University of Technology, Poznan, Poland
Syed Muhammad Fawad Ali & Robert Wrembel

Authors

Syed Muhammad Fawad Ali
View author publications
You can also search for this author in PubMed Google Scholar
Robert Wrembel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Muhammad Fawad Ali .

Editor information

Editors and Affiliations

University of Maribor, Maribor, Slovenia
Tatjana Welzer
Alpen-Adria Universität Klagenfurt, Klagenfurt, Austria
Johann Eder
University of Maribor, Maribor, Slovenia
Vili Podgorelec
University of Maribor, Maribor, Slovenia
Aida Kamišalić Latifić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ali, S.M.F., Wrembel, R. (2019). Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-28730-6_27
Published: 13 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28729-0
Online ISBN: 978-3-030-28730-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Framework to Optimize Data Processing Pipelines Using Performance Metrics

Optimization of data flow execution in a parallel environment

Budget Constrained Scheduling Strategies for On-line Workflow Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Framework to Optimize Data Processing Pipelines Using Performance Metrics

Optimization of data flow execution in a parallel environment

Budget Constrained Scheduling Strategies for On-line Workflow Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation