{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,8]],"date-time":"2024-08-08T00:14:28Z","timestamp":1723076068816},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,5,29]]},"abstract":"In recent years, there has been a growing recognition that high-quality training data is crucial for the performance of machine learning models. This awareness has catalyzed both research endeavors and industrial initiatives dedicated to data acquisition to enhance diverse dimensions of model performance. Among these dimensions, model confidence holds paramount importance; however, it has often been overlooked in prior investigations into data acquisition methodologies. To address this gap, our work focuses on improving the data acquisition process with the goal of enhancing the confidence of Machine Learning models. Specifically, we operate within a practical context where limited samples can be obtained from a large data pool. We employ well-established model confidence metrics as our foundation, and we propose two methodologies, Bulk Acquisition (BA) and Sequential Acquisition (SA), each geared towards identifying the sets of samples that yield the most substantial gains in model confidence. Recognizing the complexity of BA and SA, we introduce two efficient approximate methods, namely kNN-BA and kNN-SA, restricting data acquisition to promising subsets within the data pool. To broaden the applicability of our solutions, we introduce a Distribution-based Acquisition approach that makes minimal assumption regarding the data pool and facilitates the data acquisition across various settings. Through extensive experimentation encompassing diverse datasets, models, and parameter configurations, we demonstrate the efficacy of our proposed methods across a range of tasks. Comparative experiments with alternative applicable baselines underscore the superior performance of our proposed approaches.<\/jats:p>","DOI":"10.1145\/3654934","type":"journal-article","created":{"date-parts":[[2024,5,30]],"date-time":"2024-05-30T13:44:53Z","timestamp":1717076693000},"page":"1-25","source":"Crossref","is-referenced-by-count":0,"title":["Data Acquisition for Improving Model Confidence"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"http:\/\/orcid.org\/0009-0000-0658-1094","authenticated-orcid":false,"given":"Yifan","family":"Li","sequence":"first","affiliation":[{"name":"York University, Toronto, Ontario, Canada"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-8170-2327","authenticated-orcid":false,"given":"Xiaohui","family":"Yu","sequence":"additional","affiliation":[{"name":"York University, Toronto, Ontario, Canada"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-5648-0638","authenticated-orcid":false,"given":"Nick","family":"Koudas","sequence":"additional","affiliation":[{"name":"University of Toronto, Toronto, Ontario, Canada"}]}],"member":"320","published-online":{"date-parts":[[2024,5,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Moloud Abdar Farhad Pourpanah Sadiq Hussain Dana Rezazadegan Li Liu Mohammad Ghavamzadeh Paul Fieguth Xiaochun Cao Abbas Khosravi U Rajendra Acharya et al. 2021. A review of uncertainty quantification in deep learning: Techniques applications and challenges. Information fusion 76 (2021) 243--297.","DOI":"10.1016\/j.inffus.2021.05.008"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3328526.3329589"},{"volume-title":"Data classification","author":"Aggarwal Charu C","key":"e_1_2_1_3_1","unstructured":"Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and S Yu Philip. 2014. Active learning: A survey. In Data classification. Chapman and Hall\/CRC, 599--634."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551858"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073012.1073017"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517855"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589317"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3523210.3523223"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300078"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219166.3219195"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/3397230.3397235"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022","volume":"432","author":"Chouraqui Gabriella","year":"2022","unstructured":"Gabriella Chouraqui, Liron Cohen, Gil Einziger, and Liel Leman. 2022. A geometric method for improved uncertainty estimation in real-time. In Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1--5 August 2022, Eindhoven, The Netherlands (Proceedings of Machine Learning Research, Vol. 180), James Cussens and Kun Zhang (Eds.). PMLR, 422--432. https:\/\/proceedings.mlr.press\/v180\/chouraqui22a.html"},{"key":"e_1_2_1_13_1","volume-title":"A Holistic Assessment of the Reliability of Machine Learning Systems. arXiv preprint arXiv:2307.10586","author":"Corso Anthony","year":"2023","unstructured":"Anthony Corso, David Karamadian, Romeo Valentin, Mary Cooper, and Mykel J Kochenderfer. 2023. A Holistic Assessment of the Reliability of Machine Learning Systems. arXiv preprint arXiv:2307.10586 (2023)."},{"key":"e_1_2_1_14_1","unstructured":"Dawex. 2023. Dawex. https:\/\/www.dawex.com\/en\/"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2012.2211477"},{"key":"e_1_2_1_16_1","unstructured":"Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http:\/\/archive.ics.uci.edu\/ml"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407800"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1992.4.1.1"},{"key":"e_1_2_1_19_1","volume-title":"International conference on machine learning. PMLR, 1321--1330","author":"Guo Chuan","year":"2017","unstructured":"Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International conference on machine learning. PMLR, 1321--1330."},{"key":"e_1_2_1_20_1","volume-title":"International Conference on Machine Learning. PMLR, 3942--3952","author":"Gupta Chirag","year":"2021","unstructured":"Chirag Gupta and Aaditya Ramdas. 2021. Distribution-free calibration guarantees for histogram binning without sample splitting. In International Conference on Machine Learning. PMLR, 3942--3952."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2009.36"},{"key":"e_1_2_1_22_1","volume-title":"To Trust Or Not To Trust A Classifier. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018","author":"Jiang Heinrich","year":"2018","unstructured":"Heinrich Jiang, Been Kim, Melody Y. Guan, and Maya R. Gupta. 2018. To Trust Or Not To Trust A Classifier. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montr\u00e9al, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol\u00f2 Cesa-Bianchi, and Roman Garnett (Eds.). 5546--5557. https:\/\/proceedings.neurips.cc\/paper\/2018\/ hash\/7180cffd6a8e829dacfc2a31b3f72ece-Abstract.html"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1080\/01431161.2019.1601285"},{"key":"e_1_2_1_24_1","volume-title":"ICML","volume":"96","author":"Kohavi Ron","year":"1996","unstructured":"Ron Kohavi, David H Wolpert, et al. 1996. Bias plus variance decomposition for zero-one loss functions. In ICML, Vol. 96. Citeseer, 275--283."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i02.5583"},{"key":"e_1_2_1_26_1","unstructured":"Alex Krizhevsky Geoffrey Hinton et al. 2009. Learning multiple layers of features from tiny images. (2009)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i4.20327"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3467861.3467872"},{"key":"e_1_2_1_29_1","volume-title":"Dealer: End-to-End Data Marketplace with Model-based Pricing. arXiv:2003.13103 [cs.DB]","author":"Liu Jinfei","year":"2020","unstructured":"Jinfei Liu. 2020. Dealer: End-to-End Data Marketplace with Model-based Pricing. arXiv:2003.13103 [cs.DB]"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","unstructured":"Aleksej Logacjov and Astrid Ustad. 2023. HAR70. UCI Machine Learning Repository. DOI: https:\/\/doi.org\/10.24432\/C5CW3D.","DOI":"10.24432\/C5CW3D"},{"key":"e_1_2_1_31_1","volume-title":"Active Class Selection. In Machine Learning: ECML","author":"Lomasky R.","year":"2007","unstructured":"R. Lomasky, C. E. Brodley, M. Aernecke, D. Walt, and M. Friedl. 2007. Active Class Selection. In Machine Learning: ECML 2007."},{"key":"e_1_2_1_32_1","volume-title":"Learning under concept drift: A review","author":"Lu Jie","year":"2018","unstructured":"Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018. Learning under concept drift: A review. IEEE transactions on knowledge and data engineering 31, 12 (2018), 2346--2363."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389768"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389768"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3328526.3329587"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3522567"},{"key":"e_1_2_1_37_1","volume-title":"An analysis of approximations for maximizing submodular set functions-I. Mathematical programming 14","author":"Nemhauser George L","year":"1978","unstructured":"George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions-I. Mathematical programming 14 (1978), 265--294."},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15","volume":"4850","author":"N\u00f8kland Arild","year":"2019","unstructured":"Arild N\u00f8kland and Lars Hiller Eidnes. 2019. Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97). PMLR, 4839--4850."},{"key":"e_1_2_1_39_1","unstructured":"Brent Pedersen Matthias Adam Stewart Sean Gillies Howard Butler. 2023. R-Tree Implementation. https:\/\/github.com\/ Toblerity\/rtree"},{"key":"e_1_2_1_40_1","unstructured":"Burr Settles. 2009. Active learning literature survey. (2009)."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157804"},{"key":"e_1_2_1_42_1","volume-title":"International Conference on Artificial Intelligence and Statistics. PMLR, 1308--1318","author":"Shui Changjian","year":"2020","unstructured":"Changjian Shui, Fan Zhou, Christian Gagn\u00e9, and Boyu Wang. 2020. Deep active learning: Unified and principled method for query and training. In International Conference on Artificial Intelligence and Statistics. PMLR, 1308--1318."},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. 1308--1318","author":"Shui Changjian","year":"2020","unstructured":"Changjian Shui, Fan Zhou, Christian Gagn\u00e9, and Boyu Wang. 2020. Deep Active Learning: Unified and Principled Method for Query and Training. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. 1308--1318."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00917"},{"key":"e_1_2_1_45_1","unstructured":"WorldQuant. 2023. WorldQuant. https:\/\/data.worldquant.com"},{"key":"e_1_2_1_46_1","unstructured":"Xignite. 2023. xignite. https:\/\/aws.amazon.com\/solutionspace\/financial-services\/solutions\/xignite-market-data-cloudplatform\/"},{"key":"e_1_2_1_47_1","volume-title":"International Conference on Machine Learning. PMLR, 10767--10777","author":"Yang Zitong","year":"2020","unstructured":"Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. 2020. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning. PMLR, 10767--10777."},{"key":"e_1_2_1_48_1","volume-title":"International conference on machine learning. PMLR, 11117--11128","author":"Zhang Jize","year":"2020","unstructured":"Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. 2020. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International conference on machine learning. PMLR, 11117--11128."},{"key":"e_1_2_1_49_1","volume-title":"Berry","author":"Zhang Meng","year":"2020","unstructured":"Meng Zhang, Ahmed Arafa, Ermin Wei, and Randall A. Berry. 2020. Optimal and Quantized Mechanism Design for Fresh Data Acquisition. arXiv:2006.15751"},{"key":"e_1_2_1_50_1","volume-title":"Raul Castro Fernandez, and Mladen Kolar","author":"Zhao Boxin","year":"2023","unstructured":"Boxin Zhao, Boxiang Lyu, Raul Castro Fernandez, and Mladen Kolar. 2023. Addressing Budget Allocation and Revenue Allocation in Data Market Environments Using an Adaptive Sampling Algorithm. arXiv preprint arXiv:2306.02543 (2023)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3654934","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,7]],"date-time":"2024-08-07T22:48:32Z","timestamp":1723070912000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3654934"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,29]]},"references-count":50,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,5,29]]}},"alternative-id":["10.1145\/3654934"],"URL":"https:\/\/doi.org\/10.1145\/3654934","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2024,5,29]]}}}
  NODES