Elsevier

Information Systems

Volume 65, April 2017, Pages 52-64
Information Systems

Modeling user interest in social media using news media and wikipedia

https://doi.org/10.1016/j.is.2016.11.003Get rights and content

Abstract

Social media has become an important source of information and a medium for following and spreading trends, news, and ideas all over the world. Although determining the subjects of individual posts is important to extract users' interests from social media, this task is nontrivial because posts are highly contextualized and informal and have limited length. To address this problem, we propose a user modeling framework that maps the content of texts in social media to relevant categories in news media. In our framework, the semantic gaps between social media and news media are reduced by using Wikipedia as an external knowledge base. We map term-based features from a short text and a news category into Wikipedia-based features such as Wikipedia categories and article entities. A user's microposts are thus represented in a rich feature space of words. Experimental results show that our proposed method using Wikipedia-based features outperforms other existing methods of identifying users' interests from social media.

Introduction

Social media services such as Twitter and Facebook attract and encourage millions of users to share and exchange their ideas and opinions and to participate in events. Millions of new posts are generated daily from such open broadcasting platforms, and most of this information is stored in various text formats. Capturing users' interests from texts in social media data has become an important research topic in the area of personalized recommender systems. However, it is difficult to estimate the interests of social media users directly from social media data because their posts do not contain any category information [1]. To address this problem, Han and Lee [1] proposed an approach to map the contents of texts in social media into categories of a news corpus. Social media and news media are similar in that many current issues are posted in both. News media, however, contains additional information because news articles are categorized by experts into predefined categories. In Ref. [1], users' interests were estimated by comparing the features of news categories and features of personal social media data, where features were extracted from keywords in documents. This method is effective for categorization tasks, where the category of a social media post can easily be identified by distinguishable keywords such as “Obama” or “football”, but limitations still remain in dealing with short posts, abbreviations, and infrequently used topical terms. For instance, social media users often use the term “SNS” instead of the full name “social networking service”, whereas news media use the term “social media service”. Furthermore, terms that rarely occur in news media, such as the Indonesian dish “Nasi kuning”, do not provide any categorization information. We refer to this issue as the semantic gap between social media and news media.
To resolve this semantic gap, we employ Wikipedia as an external knowledge resource. Containing more than three million articles, Wikipedia is currently the world's largest knowledge resource. Each article describes a single topic with a succinct and well-formed title. Wikipedia also contains rich information about relationships between different articles in the forms of categories, interlinks, and redirect pages. A redirect page contains no content by itself; instead, it sends the user to another page, usually an article or a section of an article. For instance, searching for the keyword “UK” in Wikipedia results in an article page with the title “United Kingdom”. Thus, abbreviations in social media data can be resolved using Wikipedia. The problem caused by short texts or by infrequently occurring topical terms can be solved by enriching the terms using Wikipedia articles, categories, and their relationships. If we search for the term “Nasi kuning” in Wikipedia articles, we can extract not only the “Food” category but also other semantically related categories of the article.
In this study, we propose a new method to estimate users' interest in social media by mapping social media content into news categories. In this way, features generated using Wikipedia are used to resolve the semantic gap between social media data and news media data. Our proposed Wikipedia-based feature generators consist of the following three components: (i) Wiki-CF-ICF (Wikipedia-category frequency-inverse category frequency), which exploits the category information of Wikipedia articles; (ii) Wiki-AF-IAF (Wikipedia-article frequency-inverse article frequency), which uses the contents of Wikipedia articles; and (iii) Wiki-Cluster, which clusters features from Wiki-CF-ICF to filter out noise categories. Users' interests are estimated by calculating the similarity of features of a social media user and features of a news category. In the proposed approach, we regard each post of a user as a document instead of using an author-pool approach that generates a document for each user by aggregating all messages posted [1]. We evaluated the proposed approach by measuring the similarity score between a user's interests as estimated by our approach and as manually labeled by annotators. The evaluation results show that a mixture of Wikipedia-based approaches outperformed other approaches, such as term frequency-inverse category frequency (TF-ICF) [1] and Latent Dirichlet Allocation (LDA) [2], and that the approach of profiling users by regarding each post in social media as a document estimates users' interests more accurately than the author-pool approach. The remainder of this paper is organized as follows: Section 2 describes related works. Section 3 presents the framework of the proposed method. Section 4 describes the methodology for modeling user interest. Section 5 presents the experimental results. Section 6 presents the conclusions of the study.

Section snippets

Related works

Many personalization systems have been constructed by analyzing web documents visited by users [3], [4]. However, the increasing popularity of social media services such as Twitter and Facebook has shifted personalization systems to analyze users' activities on these platforms. These works either use bag of words [5], [6] or topic model [7] approaches to generate user profiles. In Ref. [5], a user profile is modeled as a bag of words feature vector generated from the user's tweets with term

System framework

In this section, we describe the architecture of our proposed system, as shown in Fig. 1. First, a term-based feature generator is designed and used to map a document from either a message or a news category into a term vector T. The term vector T consists of all pairs of terms in the document and their corresponding TF-ICF weights in Ref. [1] to extract the document's significant terms. In the term-based feature generator, TF-ICF is calculated by the product of the term frequency (TF) in a

Methodology

In this section, we describe a term-based feature generator using a TF-ICF approach as described in Ref. [1] and propose three Wikipedia-based feature generators: (1) a Wiki-CF-ICF feature generator using Wikipedia categories, (2) a Wiki-Cluster feature generator to extract representative features for news categories, and (3) a Wiki-AF-IAF feature generator using Wikipedia articles. Subsequently, an approach for estimating users' interests using Wikipedia-based feature generators will be

Evaluations

To verify the suitability of our proposed approach, we evaluated the approach based on real social media data. In this evaluation, we compared user interest as estimated by our proposed approach with user interest as labeled by annotators.

Conclusion and future work

In this study, we focused on estimating the interests of social media users. The authors in Ref. [1] proposed a method to map categories in news media to social media users. However, directly using news media has some limitations due to the heterogeneity of the two different types of media. We proposed a novel approach that exploits Wikipedia to address the semantic gap between social media data and news media data. The contributions of this work include (1) designing and implementing a

Acknowledgements

This research is supported by Ministry of Culture, Sports and Tourism (MCST) and Korea Creative Content Agency (KOCCA) in the Culture Technology (CT) Research & Development Program 2016.

References (40)

  • J. Han, H. Lee, Characterizing user interest using heterogeneous media, in: Proceedings of the 23nd international...
  • D. Blei et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • D. Godoy, A. Amandi, Modeling user interests by conceptual clustering, in: Information Systems, Vol. 31, 2006, pp....
  • K. Ramanathan, K. Kapoor, Creating user profiles using Wikipedia, Vol. 5829, 2009, pp....
  • J. Chen, R. Nairn, L. Nelson, M. Bernstein, Short and tweet: experiments on recommending content from information...
  • B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, M. Demirbas, Short text classification in twitter to improve...
  • J. Weng, E. Lim, J. Jiang, Q. He, Twitterrank: finding topic-sensitive influential twitterers, in: In Proceedings of...
  • G. Salton et al.

    Introduction to Modern Information Retrieval

    (1986)
  • Q. Pu, Y. G.W, Short-text classification based on ica and lsa, in: ISNN'06 Proceedings of the Third international...
  • X. Phan et al.

    A hidden topic-based framework towards building applications with short web documents

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • S. Zelikovitz, Transductive lsi for short text classification problems,...
  • M. Sahlgren, R. Coster, Using bag-of-concepts to improve the performance of support vector machines in text...
  • L. Hong, B. Davison, Empirical study of topic modeling in twitter, in: Proceedings of the First Workshop on Social...
  • X. Yan, J. Guo, Y. Lan, X. Cheng, A biterm topic model for short texts, in: Proceedings of the 22nd international...
  • M. Sahami, T. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in: Proceedings...
  • D. Bollegala, Y. Matsuo, M. Ishizuka, Measuring semantic similarity between words using web search engines, in:...
  • E. Gabrilovich, S. Markovitch, Feature generation for text categorization using world knowledge, in: In IJCAI '05,...
  • E. Gabrilovich, S. Markovitch, Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization...
  • S. Banerjee, K. Ramanathan, Clustering short texts using wikipedia, in: Proceedings of the 30th annual international...
  • X. Hu, N. Sun, C. Zhang, T. Chua, Exploiting internal and external semantics for the clustering of short texts using...
  • Cited by (41)

    • Using topic models with browsing history in hybrid collaborative filtering recommender system: Experiments with user ratings

      2021, International Journal of Information Management Data Insights
      Citation Excerpt :

      In other words, the semantics of the browsing history might be lost. This is referred to as the semantic gap (Kang & Lee, 2017). Owing to these problems, our work employs Wikipedia corpus1 for topic modeling.

    • A social-semantic recommender system for advertisements

      2020, Information Processing and Management
      Citation Excerpt :

      No predefined formal model is used in this approach to represent users’ interests, which is of utmost importance in our work. Finally, a user modeling framework that maps the content of texts in social media onto relevant categories in news media is described in Kang and Lee (2017). User interest vectors represented by news categories are obtained by considering the similarities between users’ messages and news categories based on Wikipedia (the hierarchy structure of the Wikipedia categories has been used as the basis to resolve the semantic gap).

    • Review on recent advances in information mining from big consumer opinion data for product design

      2019, Journal of Computing and Information Science in Engineering
    View all citing articles on Scopus
    View full text