Skip to main content

Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2019)

Abstract

Today’s ETL tools provide capabilities for developing custom code as user-defined functions (UDFs) to extend the expressiveness of standard ETL operators. However, a custom code of an UDF may execute inefficiently due to its poor implementation (e.g., due to the lack of using parallel processing or adequate data structures). In this paper we address the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
CHF34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
CHF 24.95
Price includes VAT (Switzerland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
CHF 70.00
Price excludes VAT (Switzerland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
CHF 87.50
Price excludes VAT (Switzerland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/pentaho/pentaho-kettle.

  2. 2.

    https://aws.amazon.com/emr/.

  3. 3.

    https://aws.amazon.com/ec2/.

  4. 4.

    https://calculator.s3.amazonaws.com/index.html.

  5. 5.

    http://lpsolve.sourceforge.net/.

  6. 6.

    https://github.com/fawadali/MCKPCostModel.

References

  1. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)

    Article  Google Scholar 

  2. Ali, S.M.F.: Next-generation ETL framework to address the challenges posed by Big Data. In: International Workshop Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP) (2018)

    Google Scholar 

  3. Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. (AMCS) 29, 69–79 (2019)

    Article  Google Scholar 

  4. Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26, 1–25 (2017)

    Article  Google Scholar 

  5. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: ACM Symposium on Cloud Computing, pp. 119–130 (2010)

    Google Scholar 

  6. Borthakur, D.: The Hadoop distributed file system: Architecture and design. Hadoop Project Website, vol. 11, p. 21 (2007)

    Google Scholar 

  7. Caruccio, L., Deufemia, V., Polese, G.: Visual data integration based on description logic reasoning. In: International Database Engineering Applications Symposium, pp. 19–28 (2014)

    Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  9. Evans, J.P., Steuer, R.E.: A revised simplex method for linear multiple objective programs. Math. Program. 5(1), 54–72 (1973)

    Article  MathSciNet  Google Scholar 

  10. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)

    Article  Google Scholar 

  11. Gartner: Magic Quadrant for Data Integration Tools. https://www.gartner.com/doc/3883264/magic-quadrant-data-integration-tools. Accessed 18 Mar 2019

  12. Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In: International Conference on Scientific and Statistical Database Management, p. 36. ACM (2014)

    Google Scholar 

  13. Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)

    Google Scholar 

  14. Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research (CIDR), vol. 11, pp. 261–272 (2011)

    Google Scholar 

  15. Hueske, F., et al.: Peeking into the optimization of data flow programs with MapReduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)

    Google Scholar 

  16. Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)

    Article  Google Scholar 

  17. Ibaraki, T., Hasegawa, T., Teranaka, K., Iwase, J.: The multiple choice knapsack problem. J. Oper. Res. Soc. Japan 21(1), 59–93 (1978)

    MathSciNet  MATH  Google Scholar 

  18. IBM: IBM InfoSphere DataStage Balanced Optimization. IBM Whitepaper. Accessed 18 Mar 2019

    Google Scholar 

  19. Informatica: How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. https://www.informatica.com/downloads/pushdown_wp_6650_web.pdf. Accessed 18 Mar 2019

  20. Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)

    Article  Google Scholar 

  21. Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)

    Article  Google Scholar 

  22. Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: VLDB Workshop on Enabling Real-Time Business Intelligence, pp. 68–83 (2010)

    Google Scholar 

  23. Lawler, E.L., Wood, D.E.: Branch-and-bound methods: a survey. Oper. Res. 14(4), 699–719 (1966)

    Article  MathSciNet  Google Scholar 

  24. Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. https://www.ibm.com/developerworks/data/library/techarticle/dm-1402optimizebdfs/index.html. Accessed 18 Mar 2019

  25. Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)

    Google Scholar 

  26. Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)

    Article  Google Scholar 

  27. Russom, P.: Data lakes: purposes, practices, patterns, and platforms. TDWI white paper (2017)

    Google Scholar 

  28. Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)

    Article  Google Scholar 

  29. Skoutas, D., Simitsis, A., Sellis, T.: Ontology-driven conceptual design of ETL processes using graph transformations. J. Data Semant. 13, 120–146 (2009)

    Article  Google Scholar 

  30. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  31. Vaandrager, F.: Model learning. Commun. ACM 60(2), 86–95 (2017)

    Article  Google Scholar 

  32. Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54655-6

    Book  Google Scholar 

  33. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: ACM SIGMOD International Conference on Management of Data (2010)

    Google Scholar 

  34. Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 34–52 (2019)

    Article  Google Scholar 

  35. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

The work of Fawad Ali is partially supported by the European Commission through the Erasmus Mundus Joint Doctorate project Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

The work of Robert Wrembel is partially supported by: (1) the grant No. 2015/19/B/ST6/02637 of the National Science Center and (2) the grant of the Polish National Agency for Academic Exchange, within the Bekker programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Syed Muhammad Fawad Ali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ali, S.M.F., Wrembel, R. (2019). Towards a Cost Model to Optimize User-Defined Functions in an ETL Workflow Based on User-Defined Performance Metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds) Advances in Databases and Information Systems. ADBIS 2019. Lecture Notes in Computer Science(), vol 11695. Springer, Cham. https://doi.org/10.1007/978-3-030-28730-6_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-28730-6_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-28729-0

  • Online ISBN: 978-3-030-28730-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

  NODES
INTERN 6
Note 3
Project 2