Abstract
The big table equal join operation is one of the key operations of Spark for processing large-scale data. However, when Spark handles large table equal join problems, the network transmission overhead is relatively expensive and the I/O cost is high, so this paper proposes an optimized Spark large table join method. Firstly, this method proposes a Split Compressed Bloom Filter algorithm which is suitable for filtering data sets with unknown data volume. Then, the Maxdiff histogram is used to statistically analyze the data distribution of the connected data tables, and the skew data in the data set is obtained. According to the statistical results, the RDD is split, and finally the data connection is joined by a suitable join algorithm, and the sub-results are combined to obtain the final result. Experiments show that the Spark large table equal join optimization method proposed in this paper has obvious advantages in shuffle write, shuffle read and task running time compared with Spark original method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Apache Spark. http://spark.apache.org. Accessed 28 Apr 2018
Sun, H.: Join processing and optimization on large datasets based on hadoop framework. Nanjing University of Posts and Telecommunications (2013)
Zhang, Z.D., Zheng, Y.B.: Optimizaiton of two-table equivalent connection process based on spark. Appl. Res. Comput. 02, 1–2 (2019)
Bian, H.Q., Chen, Y.G., Du, X.Y.: Equi-join optimization on spark. J. East China Normal Univ. (Nat. Sci.) 2014(5), 263–270 (2014)
Liu, H., Xiao, J., Peng, F.: Scalable hash ripple join on spark. In: 23rd International Conference on Parallel and Distributed Systems, pp. 419–428. IEEE, Shenzhen (2014)
Hoel, E., Whitman, R.T., Park, M.B.: Spatio-temporal join on apache spark. In: 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, California (2017)
Wang, S.Z., Zhang, Y.P., Zhang, L., et al.: An improved memory cache management study based on spark. Comput., Mater. Continua 56(3), 415–431 (2018)
Lin, D.G.: Hadoop + spark big data massive analysis and machine learning integration development, 1st edn. Tsinghua University Press, Beijing (2017)
Zhang, X.: An Intermediate Data Placement Algorithm for Load Balancing in Spark Computing Environment. Hunan University (2016)
Zhang, W.H.: Implementation and optimization for join operation in spark, National University of Defense Technology (2016)
Pi, X.J.: Optimization and Application of the Equi-join Problem based on Grid Big Data in Spark. Chongqing University (2016)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM(CACM) 13(7), 422–426 (1970)
Ioannidis, Y.: The history of histograms (abridged). In: 29th International Conference on Very Large Data Bases, pp. 19–30. VLDB Endowment, Berlin (2003)
Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: 2004 ACM SIGMOD International Conference on Management of Data, pp. 287–298. ACM, Paris (2004)
Jagadish, H.V., Poosala, V., Koudas, N.: Optimal histograms with quality guarantees. In: 24th International Conference on Very Large Data Bases, pp. 275–286. Morgan Kaufmann Publishers Inc. (1998)
Tang, M.W.: Efficient and scalable monitoring and summarization of large probalistic data. In: SIGMOD 2013 PhD Symposium, pp. 61–66. New York (2013)
Zhang, C.C.: Design and optimize big-data join algorithms using MapReduce. University of Science and Technology of China (2014)
Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Networking 10(5), 604–612 (2001)
Xiao, M.Z.H., Dai, Y.F., Li, X.M.: Split Bloom filter. Acta Electronica Sinica 32(2), 241–245 (2004)
Poosala, V., Haas, P.J., Ioannidis, Y.E.: Improved histograms for selectivity estimation of range predicates. ACM SIGMOD Rec. 25(2), 294–305 (1996)
Zhang, D.D.: Load balancing in MapReduce based on Maxdiff histogram. Zhengzhou University, (2015)
Wang, S.Z., Zhang, L., Zhang, Y.P., et al.: Natural language semantic construction based on cloud database. Comput., Mater. Continua 57(3), 603–619 (2018)
Acknowledgements
This paper is partially supported by the Education technology Foundation of the Ministry of Education (No. 2017A01020), the Major Project of the Hebei Province Education Department (No. 2017GJJG083) and the Graduate Innovation Program of Hebei University of Economics and Business in 2018.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, S., Zhang, L., Zhang, Y. (2019). Research on the Optimization of Spark Big Table Equal Join. In: Sun, X., Pan, Z., Bertino, E. (eds) Artificial Intelligence and Security. ICAIS 2019. Lecture Notes in Computer Science(), vol 11633. Springer, Cham. https://doi.org/10.1007/978-3-030-24265-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-24265-7_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24264-0
Online ISBN: 978-3-030-24265-7
eBook Packages: Computer ScienceComputer Science (R0)