Skip to main content

Research on the Optimization of Spark Big Table Equal Join

  • Conference paper
  • First Online:
Artificial Intelligence and Security (ICAIS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11633))

Included in the following conference series:

Abstract

The big table equal join operation is one of the key operations of Spark for processing large-scale data. However, when Spark handles large table equal join problems, the network transmission overhead is relatively expensive and the I/O cost is high, so this paper proposes an optimized Spark large table join method. Firstly, this method proposes a Split Compressed Bloom Filter algorithm which is suitable for filtering data sets with unknown data volume. Then, the Maxdiff histogram is used to statistically analyze the data distribution of the connected data tables, and the skew data in the data set is obtained. According to the statistical results, the RDD is split, and finally the data connection is joined by a suitable join algorithm, and the sub-results are combined to obtain the final result. Experiments show that the Spark large table equal join optimization method proposed in this paper has obvious advantages in shuffle write, shuffle read and task running time compared with Spark original method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
CHF34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
CHF 24.95
Price includes VAT (Switzerland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
CHF 47.00
Price excludes VAT (Switzerland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
CHF 59.00
Price excludes VAT (Switzerland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Apache Spark. http://spark.apache.org. Accessed 28 Apr 2018

  2. Sun, H.: Join processing and optimization on large datasets based on hadoop framework. Nanjing University of Posts and Telecommunications (2013)

    Google Scholar 

  3. Zhang, Z.D., Zheng, Y.B.: Optimizaiton of two-table equivalent connection process based on spark. Appl. Res. Comput. 02, 1–2 (2019)

    Google Scholar 

  4. Bian, H.Q., Chen, Y.G., Du, X.Y.: Equi-join optimization on spark. J. East China Normal Univ. (Nat. Sci.) 2014(5), 263–270 (2014)

    Google Scholar 

  5. Liu, H., Xiao, J., Peng, F.: Scalable hash ripple join on spark. In: 23rd International Conference on Parallel and Distributed Systems, pp. 419–428. IEEE, Shenzhen (2014)

    Google Scholar 

  6. Hoel, E., Whitman, R.T., Park, M.B.: Spatio-temporal join on apache spark. In: 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, California (2017)

    Google Scholar 

  7. Wang, S.Z., Zhang, Y.P., Zhang, L., et al.: An improved memory cache management study based on spark. Comput., Mater. Continua 56(3), 415–431 (2018)

    Google Scholar 

  8. Lin, D.G.: Hadoop + spark big data massive analysis and machine learning integration development, 1st edn. Tsinghua University Press, Beijing (2017)

    Google Scholar 

  9. Zhang, X.: An Intermediate Data Placement Algorithm for Load Balancing in Spark Computing Environment. Hunan University (2016)

    Google Scholar 

  10. Zhang, W.H.: Implementation and optimization for join operation in spark, National University of Defense Technology (2016)

    Google Scholar 

  11. Pi, X.J.: Optimization and Application of the Equi-join Problem based on Grid Big Data in Spark. Chongqing University (2016)

    Google Scholar 

  12. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM(CACM) 13(7), 422–426 (1970)

    Article  Google Scholar 

  13. Ioannidis, Y.: The history of histograms (abridged). In: 29th International Conference on Very Large Data Bases, pp. 19–30. VLDB Endowment, Berlin (2003)

    Google Scholar 

  14. Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: 2004 ACM SIGMOD International Conference on Management of Data, pp. 287–298. ACM, Paris (2004)

    Google Scholar 

  15. Jagadish, H.V., Poosala, V., Koudas, N.: Optimal histograms with quality guarantees. In: 24th International Conference on Very Large Data Bases, pp. 275–286. Morgan Kaufmann Publishers Inc. (1998)

    Google Scholar 

  16. Tang, M.W.: Efficient and scalable monitoring and summarization of large probalistic data. In: SIGMOD 2013 PhD Symposium, pp. 61–66. New York (2013)

    Google Scholar 

  17. Zhang, C.C.: Design and optimize big-data join algorithms using MapReduce. University of Science and Technology of China (2014)

    Google Scholar 

  18. Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Networking 10(5), 604–612 (2001)

    Article  Google Scholar 

  19. Xiao, M.Z.H., Dai, Y.F., Li, X.M.: Split Bloom filter. Acta Electronica Sinica 32(2), 241–245 (2004)

    Google Scholar 

  20. Poosala, V., Haas, P.J., Ioannidis, Y.E.: Improved histograms for selectivity estimation of range predicates. ACM SIGMOD Rec. 25(2), 294–305 (1996)

    Article  Google Scholar 

  21. Zhang, D.D.: Load balancing in MapReduce based on Maxdiff histogram. Zhengzhou University, (2015)

    Google Scholar 

  22. Wang, S.Z., Zhang, L., Zhang, Y.P., et al.: Natural language semantic construction based on cloud database. Comput., Mater. Continua 57(3), 603–619 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This paper is partially supported by the Education technology Foundation of the Ministry of Education (No. 2017A01020), the Major Project of the Hebei Province Education Department (No. 2017GJJG083) and the Graduate Innovation Program of Hebei University of Economics and Business in 2018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suzhen Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, S., Zhang, L., Zhang, Y. (2019). Research on the Optimization of Spark Big Table Equal Join. In: Sun, X., Pan, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2019. Lecture Notes in Computer Science(), vol 11633. Springer, Cham. https://doi.org/10.1007/978-3-030-24265-7_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-24265-7_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-24264-0

  • Online ISBN: 978-3-030-24265-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

  NODES
innovation 1
INTERN 6
Note 2
Project 1