{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,1,9]],"date-time":"2023-01-09T23:11:58Z","timestamp":1673305918171},"reference-count":19,"publisher":"IGI Global","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,7,1]]},"abstract":"
A video\u2019s soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as \u201cngine sounds,\u201d \u201cutdoor\/indoor sounds.\u201d These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question \u201cho spoke when?\u201dby finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.<\/p>","DOI":"10.4018\/jmdem.2012070101","type":"journal-article","created":{"date-parts":[[2012,11,19]],"date-time":"2012-11-19T18:58:27Z","timestamp":1353351507000},"page":"1-19","source":"Crossref","is-referenced-by-count":2,"title":["On the Applicability of Speaker Diarization to Audio Indexing of Non-Speech and Mixed Non-Speech\/Speech Video Soundtracks"],"prefix":"10.4018","volume":"3","author":[{"given":"Robert","family":"Mertens","sequence":"first","affiliation":[{"name":"International Computer Science Institute, University of California, Berkeley, USA"}]},{"given":"Po-Sen","family":"Huang","sequence":"additional","affiliation":[{"name":"Beckman Institute, University of Illinois at Urbana-Champaign, USA"}]},{"given":"Luke","family":"Gottlieb","sequence":"additional","affiliation":[{"name":"International Computer Science Institute, University of California, Berkeley, USA"}]},{"given":"Gerald","family":"Friedland","sequence":"additional","affiliation":[{"name":"International Computer Science Institute, University of California, Berkeley, USA"}]},{"given":"Ajay","family":"Divakaran","sequence":"additional","affiliation":[{"name":"SRI International Sarnoff, USA"}]},{"given":"Mark","family":"Hasegawa-Johnson","sequence":"additional","affiliation":[{"name":"Beckman Institute, University of Illinois at Urbana-Champaign, USA"}]}],"member":"2432","reference":[{"issue":"3","key":"jmdem.2012070101-0","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1961189.1961199","article-title":"LIBSVM: A library for support vector machines.","volume":"2","author":"C.-C.Chang","year":"2011","journal-title":"ACM Transactions on Intelligent Systems and Technology"},{"key":"jmdem.2012070101-1","doi-asserted-by":"crossref","unstructured":"Chaudhuri, S., Harvilla, M., & Raj, B. (2011). Unsupervised learning of acoustic unit descriptors for audio content representation and classification. In Proceedings of the 12th Annual International Conference Interspeech.","DOI":"10.21437\/Interspeech.2011-602"},{"key":"jmdem.2012070101-2","doi-asserted-by":"crossref","unstructured":"Friedland, G., & Vinyals, O. (2008, October). Live speaker identification in conversations. In Proceedings of the ACM International Conference on Multimedia, Vancouver, BC, Canada (pp. 1017-1018).","DOI":"10.1145\/1459359.1459558"},{"key":"jmdem.2012070101-3","doi-asserted-by":"crossref","unstructured":"Huang, J., Liu, Z., Wang, Y., Chen, Y., & Wong, E. K. (1999). Integration of multimodal features for video scene classification based on HMM. In Proceedings of the IEEE 3rd Workshop on Multimedia Signal Processing (pp. 53-58).","DOI":"10.1109\/MMSP.1999.793797"},{"key":"jmdem.2012070101-4","doi-asserted-by":"crossref","unstructured":"Huang, Y., Vinyals, O., Friedland, G., Muller, C., Mirghafori, N., & Wooters, C. (2007). A fast-match approach for robust, faster than real-time speaker diarization. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 693-698).","DOI":"10.1109\/ASRU.2007.4430196"},{"key":"jmdem.2012070101-5","doi-asserted-by":"crossref","unstructured":"Imseng, D., & Friedland, G. (2009, December). Robust speaker diarization for short speech recordings. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 432-437).","DOI":"10.1109\/ASRU.2009.5373254"},{"key":"jmdem.2012070101-6","unstructured":"Jiang, Y.-G., Zeng, X., Ye, G., Bhattacharya, S., Ellis, D., Shah, M., & Chang, S.-F. (2010). Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In Proceedings of the NIST TRECVID Workshop on Video Retrieval Evaluation."},{"key":"jmdem.2012070101-7","doi-asserted-by":"crossref","unstructured":"Lan, M., & Low, H. (2005). A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web (pp. 1032-1033).","DOI":"10.1145\/1062745.1062854"},{"key":"jmdem.2012070101-8","doi-asserted-by":"publisher","DOI":"10.1023\/A:1012491419635"},{"key":"jmdem.2012070101-9","doi-asserted-by":"publisher","DOI":"10.1145\/1126004.1126005"},{"key":"jmdem.2012070101-10","unstructured":"Li, H., Bao, L., Gao, Z., Overwijk, A., Liu, W., & Zhang, L. \u2026Hauptmann, A. (2010). Informedia@trecvid 2010. In Notebook for NIST\u2019s TREC Video Retrieval Evaluation."},{"key":"jmdem.2012070101-11","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2007.911304"},{"key":"jmdem.2012070101-12","doi-asserted-by":"crossref","unstructured":"Mertens, R., Lei, H., Gottlieb, L., Friedland, G., & Divakaran, A. (2011, November 28-December 1). Acoustic super models for large scale video event detection. In Proceedings of the International ACM Workshop on Events in Multimedia, Scottsdale, AZ (pp. 19-24).","DOI":"10.1145\/2072508.2072513"},{"key":"jmdem.2012070101-13","unstructured":"NIST TRECVid. (2011). Evaluation. Retrieved December 15, 2011, from http:\/\/www-nlpir.nist.gov\/projects\/trecvid\/"},{"key":"jmdem.2012070101-14","doi-asserted-by":"publisher","DOI":"10.1108\/00220410410560582"},{"key":"jmdem.2012070101-15","unstructured":"Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., & Trancoso, I. (2010). On the use of audio events for improving video scene segmentation. In Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services (pp. 1-4)."},{"key":"jmdem.2012070101-16","doi-asserted-by":"publisher","DOI":"10.1561\/1500000014"},{"key":"jmdem.2012070101-17","doi-asserted-by":"publisher","DOI":"10.1109\/2.493456"},{"key":"jmdem.2012070101-18","unstructured":"Wooters, C., & Huijbregts, M. (2008). Multimodal technologies for perception of humans. In R. Stiefelhagen & J. Garofolo (Eds.), Proceedings of the First International Evaluation Workshop on Classification of Events, Activities and Relationships (LNCS 4122, pp. 509-519)."}],"container-title":["International Journal of Multimedia Data Engineering and Management"],"original-title":[],"language":"ng","link":[{"URL":"https:\/\/www.igi-global.com\/viewtitle.aspx?TitleId=72890","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,1]],"date-time":"2022-06-01T18:40:55Z","timestamp":1654108855000},"score":1,"resource":{"primary":{"URL":"https:\/\/services.igi-global.com\/resolvedoi\/resolve.aspx?doi=10.4018\/jmdem.2012070101"}},"subtitle":[""],"short-title":[],"issued":{"date-parts":[[2012,7,1]]},"references-count":19,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2012,7]]}},"URL":"https:\/\/doi.org\/10.4018\/jmdem.2012070101","relation":{},"ISSN":["1947-8534","1947-8542"],"issn-type":[{"value":"1947-8534","type":"print"},{"value":"1947-8542","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,7,1]]}}}