Modeling user interest in social media using news media and wikipedia

doi:10.1016/j.is.2016.11.003

Information Systems

Volume 65, April 2017, Pages 52-64

https://doi.org/10.1016/j.is.2016.11.003 Get rights and content

Abstract

Social media has become an important source of information and a medium for following and spreading trends, news, and ideas all over the world. Although determining the subjects of individual posts is important to extract users' interests from social media, this task is nontrivial because posts are highly contextualized and informal and have limited length. To address this problem, we propose a user modeling framework that maps the content of texts in social media to relevant categories in news media. In our framework, the semantic gaps between social media and news media are reduced by using Wikipedia as an external knowledge base. We map term-based features from a short text and a news category into Wikipedia-based features such as Wikipedia categories and article entities. A user's microposts are thus represented in a rich feature space of words. Experimental results show that our proposed method using Wikipedia-based features outperforms other existing methods of identifying users' interests from social media.

Introduction

Social media services such as Twitter and Facebook attract and encourage millions of users to share and exchange their ideas and opinions and to participate in events. Millions of new posts are generated daily from such open broadcasting platforms, and most of this information is stored in various text formats. Capturing users' interests from texts in social media data has become an important research topic in the area of personalized recommender systems. However, it is difficult to estimate the interests of social media users directly from social media data because their posts do not contain any category information [1]. To address this problem, Han and Lee [1] proposed an approach to map the contents of texts in social media into categories of a news corpus. Social media and news media are similar in that many current issues are posted in both. News media, however, contains additional information because news articles are categorized by experts into predefined categories. In Ref. [1], users' interests were estimated by comparing the features of news categories and features of personal social media data, where features were extracted from keywords in documents. This method is effective for categorization tasks, where the category of a social media post can easily be identified by distinguishable keywords such as “Obama” or “football”, but limitations still remain in dealing with short posts, abbreviations, and infrequently used topical terms. For instance, social media users often use the term “SNS” instead of the full name “social networking service”, whereas news media use the term “social media service”. Furthermore, terms that rarely occur in news media, such as the Indonesian dish “Nasi kuning”, do not provide any categorization information. We refer to this issue as the semantic gap between social media and news media.

To resolve this semantic gap, we employ Wikipedia as an external knowledge resource. Containing more than three million articles, Wikipedia is currently the world's largest knowledge resource. Each article describes a single topic with a succinct and well-formed title. Wikipedia also contains rich information about relationships between different articles in the forms of categories, interlinks, and redirect pages. A redirect page contains no content by itself; instead, it sends the user to another page, usually an article or a section of an article. For instance, searching for the keyword “UK” in Wikipedia results in an article page with the title “United Kingdom”. Thus, abbreviations in social media data can be resolved using Wikipedia. The problem caused by short texts or by infrequently occurring topical terms can be solved by enriching the terms using Wikipedia articles, categories, and their relationships. If we search for the term “Nasi kuning” in Wikipedia articles, we can extract not only the “Food” category but also other semantically related categories of the article.

In this study, we propose a new method to estimate users' interest in social media by mapping social media content into news categories. In this way, features generated using Wikipedia are used to resolve the semantic gap between social media data and news media data. Our proposed Wikipedia-based feature generators consist of the following three components: (i) Wiki-CF-ICF (Wikipedia-category frequency-inverse category frequency), which exploits the category information of Wikipedia articles; (ii) Wiki-AF-IAF (Wikipedia-article frequency-inverse article frequency), which uses the contents of Wikipedia articles; and (iii) Wiki-Cluster, which clusters features from Wiki-CF-ICF to filter out noise categories. Users' interests are estimated by calculating the similarity of features of a social media user and features of a news category. In the proposed approach, we regard each post of a user as a document instead of using an author-pool approach that generates a document for each user by aggregating all messages posted [1]. We evaluated the proposed approach by measuring the similarity score between a user's interests as estimated by our approach and as manually labeled by annotators. The evaluation results show that a mixture of Wikipedia-based approaches outperformed other approaches, such as term frequency-inverse category frequency (TF-ICF) [1] and Latent Dirichlet Allocation (LDA) [2], and that the approach of profiling users by regarding each post in social media as a document estimates users' interests more accurately than the author-pool approach. The remainder of this paper is organized as follows: Section 2 describes related works. Section 3 presents the framework of the proposed method. Section 4 describes the methodology for modeling user interest. Section 5 presents the experimental results. Section 6 presents the conclusions of the study.

Section snippets

Related works

Many personalization systems have been constructed by analyzing web documents visited by users [3], [4]. However, the increasing popularity of social media services such as Twitter and Facebook has shifted personalization systems to analyze users' activities on these platforms. These works either use bag of words [5], [6] or topic model [7] approaches to generate user profiles. In Ref. [5], a user profile is modeled as a bag of words feature vector generated from the user's tweets with term

System framework

In this section, we describe the architecture of our proposed system, as shown in Fig. 1. First, a term-based feature generator is designed and used to map a document from either a message or a news category into a term vector T. The term vector T consists of all pairs of terms in the document and their corresponding TF-ICF weights in Ref. [1] to extract the document's significant terms. In the term-based feature generator, TF-ICF is calculated by the product of the term frequency (TF) in a

Methodology

In this section, we describe a term-based feature generator using a TF-ICF approach as described in Ref. [1] and propose three Wikipedia-based feature generators: (1) a Wiki-CF-ICF feature generator using Wikipedia categories, (2) a Wiki-Cluster feature generator to extract representative features for news categories, and (3) a Wiki-AF-IAF feature generator using Wikipedia articles. Subsequently, an approach for estimating users' interests using Wikipedia-based feature generators will be

Evaluations

To verify the suitability of our proposed approach, we evaluated the approach based on real social media data. In this evaluation, we compared user interest as estimated by our proposed approach with user interest as labeled by annotators.

Conclusion and future work

In this study, we focused on estimating the interests of social media users. The authors in Ref. [1] proposed a method to map categories in news media to social media users. However, directly using news media has some limitations due to the heterogeneity of the two different types of media. We proposed a novel approach that exploits Wikipedia to address the semantic gap between social media data and news media data. The contributions of this work include (1) designing and implementing a

Acknowledgements

This research is supported by Ministry of Culture, Sports and Tourism (MCST) and Korea Creative Content Agency (KOCCA) in the Culture Technology (CT) Research & Development Program 2016.

References (40)

J. Han, H. Lee, Characterizing user interest using heterogeneous media, in: Proceedings of the 23nd international...
D. Blei et al.
Latent dirichlet allocation
J. Mach. Learn. Res.
(2003)
D. Godoy, A. Amandi, Modeling user interests by conceptual clustering, in: Information Systems, Vol. 31, 2006, pp....
K. Ramanathan, K. Kapoor, Creating user profiles using Wikipedia, Vol. 5829, 2009, pp....
J. Chen, R. Nairn, L. Nelson, M. Bernstein, Short and tweet: experiments on recommending content from information...
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, M. Demirbas, Short text classification in twitter to improve...
J. Weng, E. Lim, J. Jiang, Q. He, Twitterrank: finding topic-sensitive influential twitterers, in: In Proceedings of...
G. Salton et al.
Introduction to Modern Information Retrieval
(1986)
Q. Pu, Y. G.W, Short-text classification based on ica and lsa, in: ISNN'06 Proceedings of the Third international...
X. Phan et al.
A hidden topic-based framework towards building applications with short web documents
IEEE Trans. Knowl. Data Eng.
(2011)

S. Zelikovitz, Transductive lsi for short text classification problems,...

M. Sahlgren, R. Coster, Using bag-of-concepts to improve the performance of support vector machines in text...

L. Hong, B. Davison, Empirical study of topic modeling in twitter, in: Proceedings of the First Workshop on Social...

X. Yan, J. Guo, Y. Lan, X. Cheng, A biterm topic model for short texts, in: Proceedings of the 22nd international...

M. Sahami, T. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in: Proceedings...

D. Bollegala, Y. Matsuo, M. Ishizuka, Measuring semantic similarity between words using web search engines, in:...

E. Gabrilovich, S. Markovitch, Feature generation for text categorization using world knowledge, in: In IJCAI '05,...

E. Gabrilovich, S. Markovitch, Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization...

S. Banerjee, K. Ramanathan, Clustering short texts using wikipedia, in: Proceedings of the 30th annual international...

X. Hu, N. Sun, C. Zhang, T. Chua, Exploiting internal and external semantics for the clustering of short texts using...

Cited by (41)

Using topic models with browsing history in hybrid collaborative filtering recommender system: Experiments with user ratings
2021, International Journal of Information Management Data Insights
Citation Excerpt :
In other words, the semantics of the browsing history might be lost. This is referred to as the semantic gap (Kang & Lee, 2017). Owing to these problems, our work employs Wikipedia corpus1 for topic modeling.
Personalizing user experience in recommender systems is possible when there is sufficient information about the user. But when new users join the system, the unavailability of information about these users, referred to as cold-start, inhibits the functionality of a recommender system. We propose an enhancement to the user-based approaches, which are extensively used in the recommender system literature. Our approach combines Wikipedia data and browsing history into the recommendation algorithm. Specifically, we generate topics by using the Latent Dirichlet Allocation (LDA) models on the Wikipedia data, and then use the topics on user browsing history to extract user preferences. Our evaluation employs five approaches and tests their performance in terms of prediction and classification accuracy. We conduct experiments in two domains (movies and restaurants), to gather user ratings and their browsing history for evaluation. Results from both experiments favor our proposed enhancement.
The four dimensions of social network analysis: An overview of research methods, applications, and software tools
2020, Information Fusion
Social network based applications have experienced exponential growth in recent years. One of the reasons for this rise is that this application domain offers a particularly fertile place to test and develop the most advanced computational techniques to extract valuable information from the Web. The main contribution of this work is three-fold: (1) we provide an up-to-date literature review of the state of the art on social network analysis (SNA); (2) we propose a set of new metrics based on four essential features (or dimensions) in SNA; (3) finally, we provide a quantitative analysis of a set of popular SNA tools and frameworks. We have also performed a scientometric study to detect the most active research areas and application domains in this area. This work proposes the definition of four different dimensions, namely Pattern & Knowledge discovery, Information Fusion & Integration, Scalability, and Visualization, which are used to define a set of new metrics (termed degrees) in order to evaluate the different software tools and frameworks of SNA (a set of 20 SNA-software tools are analyzed and ranked following previous metrics). These dimensions, together with the defined degrees, allow evaluating and measure the maturity of social network technologies, looking for both a quantitative assessment of them, as to shed light to the challenges and future trends in this active area.
Mining user interest based on personality-aware hybrid filtering in social networks
2020, Knowledge-Based Systems
With the emergence of online social networks and microblogging websites, user interest mining has been an active research topic for the past few years. However, most of the existing works suffer from two significant drawbacks, firstly, they focus on the user’s explicit content and social network structure to predicate the user’s interests, neglecting the fact that the user’s personality might be a rich source to infer the topical interests. Secondly, they represent the user’s content using the bag-of-words model that ignores the chronological order of the posted content, hence the predicted interests might contain outdated topics that the user does not interest anymore. In this paper, we propose a novel user interest mining system based on Big Five personality traits and dynamic interests. To prove the effectiveness of incorporating the user’s personality traits in the interest mining process, we have implemented a social network for news sharing and conducted different experiments on the collected data. The experiment results show that considering personality traits can increase the precision and recall of interest mining systems, as well as can help to tackle the cold start problem.
A social-semantic recommender system for advertisements
2020, Information Processing and Management
Citation Excerpt :
No predefined formal model is used in this approach to represent users’ interests, which is of utmost importance in our work. Finally, a user modeling framework that maps the content of texts in social media onto relevant categories in news media is described in Kang and Lee (2017). User interest vectors represented by news categories are obtained by considering the similarities between users’ messages and news categories based on Wikipedia (the hierarchy structure of the Wikipedia categories has been used as the basis to resolve the semantic gap).
Social applications foster the involvement of end users in Web content creation, as a result of which a new source of vast amounts of data about users and their likes and dislikes has become available. Having access to users’ contributions to social sites and gaining insights into the consumers’ needs is of the utmost importance for marketing decision making in general, and to advertisement recommendation in particular. By analyzing this information, advertisement recommendation systems can attain a better understanding of the users’ interests and preferences, thus allowing these solutions to provide more precise ad suggestions. However, in addition to the already complex challenges that hamper the performance of recommender systems (i.e., data sparsity, cold-start, diversity, accuracy and scalability), new issues that should be considered have also emerged from the need to deal with heterogeneous data gathered from disparate sources. The technologies surrounding Linked Data and the Semantic Web have proved effective for knowledge management and data integration. In this work, an ontology-based advertisement recommendation system that leverages the data produced by users in social networking sites is proposed, and this approach is substantiated by a shared ontology model with which to represent both users’ profiles and the content of advertisements. Both users and advertisement are represented by means of vectors generated using natural language processing techniques, which collect ontological entities from textual content. The ad recommender framework has been extensively validated in a simulated environment, obtaining an aggregated f-measure of 79.2% and a Mean Average Precision at 3 (MAP@3) of 85.6%.
Personality-Aware Product Recommendation System Based on User Interests Mining and Metapath Discovery
2021, IEEE Transactions on Computational Social Systems
Review on recent advances in information mining from big consumer opinion data for product design
2019, Journal of Computing and Information Science in Engineering

View all citing articles on Scopus

View full text

Modeling user interest in social media using news media and wikipedia

Abstract

Introduction

Section snippets

Related works

System framework

Methodology

Evaluations

Conclusion and future work

Acknowledgements

Latent dirichlet allocation

J. Mach. Learn. Res.

Introduction to Modern Information Retrieval

A hidden topic-based framework towards building applications with short web documents

IEEE Trans. Knowl. Data Eng.