Abstract
While large Knowledge Graphs (KGs) already cover a broad range of domains to an extent sufficient for general use, they typically lack emerging entities that are just starting to attract the public interest. This disqualifies such KGs for tasks like entity-based media monitoring, since a large portion of news inherently covers entities that have not been noted by the public before. Such entities are unlinkable, which ultimately means, they cannot be monitored in media streams. This is the first paper that thoroughly investigates all types of challenges that arise from out-of-KG entities for entity linking tasks. By large-scale analytics of news streams we quantify the importance of each challenge for real-world applications. We then propose a machine learning approach which tackles the most frequent but least investigated challenge, i.e., when entities are missing in the KG and cannot be considered by entity linking systems. We construct a publicly available benchmark data set based on English news articles and editing behavior on Wikipedia. Our experiments show that predicting whether an entity will be added to Wikipedia is challenging. However, we can reliably identify emerging entities that could be added to the KG according to Wikipedia’s own notability criteria.
A. Rettinger—The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 611346.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This fact results from our empirical analysis, see Sect. 2.2 for more details.
- 2.
Emerging relates to trending: Entities can emerge only once. Once they have become notable, any (repeated) increase in public interest is just a trend.
- 3.
- 4.
- 5.
As we are interested in novel/emerging entities, we do not consider deletions of entities or surface forms within \(\varDelta t\).
- 6.
The remaining few entities are not parseable by the Stanford parser.
- 7.
Given the set of 300 novel entities manually tagged as named entities, 95 of them got classified as of type Person, 51 of type Location, 27 of type Organization, and 24 of type Event (as subtype of Misc).
- 8.
For 11,639 of those 41,579 novel entities, however, only the Wikipedia title or redirects changed (due to typo correction or outsourcing of parts of a page). I.e., on average over 700 entities are inserted into Wikipedia each day which are “really” novel. For the task of Emerging Entity Detection (see Sect. 4), we only consider real novel entities which emerge (i.e., recently gained public interest for the first time).
- 9.
See http://trec-kba.org/, requested June 26, 2016.
- 10.
An entity is here understood as “noun phrase that could have a Wikipedia-style article if there were no notability or newness considerations, and which would have semantic types.” [12].
- 11.
Note that any text annotation method for Wikipedia could have been applied here.
- 12.
- 13.
We also experimented with aggregating all features for each NP series, but did not yield better evaluation results.
- 14.
- 15.
We also evaluated machine learning algorithms specialized on imbalanced and time-series data, such as cost-sensitive AdaBoost, cost-sensitive one class classifier and recurrent neural networks. However, this did not yield better results.
- 16.
See more information on our website.
- 17.
Given Wikipedia status of 2015-04-04 as the reference KG.
- 18.
Some of those entities were inserted later.
- 19.
Investigations revealed that the already existing Wikipedia entities were not annotated by x-LiSA because no suitable surface form were available for those entities. In most of those cases, the entity was a person and in the news article only the family name was mentioned and extracted. However, in the set of known surface forms from Wikipedia only the full name of the entity was contained. Resolving those issues are left to future work.
References
Ben-Hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.: Support vector clustering. J. Mach. Learn. Res. 2, 125–137 (2002)
Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of of the European Chapter of the Association for Computational Linguistics (EACL-06), pp. 9–16, Trento, Italy (2006)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E., Mitchell, T.: Toward an architecture for never-ending language learning (2010)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on EMNLP-CoNLL, pp. 708–716, Prague, Czech Republic. Association for Computational Linguistics, June 2007
Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowledge base population. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, Stroudsburg, PA, USA, pp. 277–285. Association for Computational Linguistics (2010)
Dutta, A., Meilicke, C., Stuckenschmidt, H.: Enriching structured knowledge with open information. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Republic and Canton of Geneva, Switzerland, pp. 267–277 (2015)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, USA, pp. 1535–1545. Association for Computational Linguistics (2011)
Gottipati, S., Jiang, J.: Linking entities to a knowledge base with query expansion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, USA, pp. 804–813. Association for Computational Linguistics (2011)
Hoffart, J., Altun, Y., Weikum, G.: Discovering emerging entities with ambiguous Names. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, New York, NY, USA, pp. 385–396. ACM (2014)
Ji, H., Nothman, J., Hachey, B., Florian, R.: Overview of TAC-KBP2015 tri-lingual entity discovery and linking (2015)
Lin, T., Etzioni, O.: No noun phrase left behind: detecting and typing unlinkable entities. In: Proceedings of the 2012 Joint Conference on EMNLP and CoNLL, EMNLP-CoNLL 2012, Stroudsburg, PA, USA, pp. 893–903. ACL (2012)
Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM conference on Information and knowledge management, CIKM 2008, New York, NY, USA, pp. 509–518. ACM (2008)
Nakashole, N., Tylenda, T., Weikum, G.: Fine-grained semantic typing of emerging entities. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1488–1497 (2013)
Parada, C., Sethy, A., Dredze, M., Jelinek, F.: A spoken term detection framework for recovering out-of-vocabulary words using the web. Paragraph 10(71.24), 323K (2010)
Soboroff, I., Harman, D.: Novelty detection: the TREC experience. In: HLT 2005, Stroudsburg, PA, USA, pp. 105–112. ACL (2005)
Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)
Wang, C., Chakrabarti, K., Cheng, T., Chaudhuri, S.: _targeted disambiguation of ad-hoc, homogeneous sets of named entities. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, New York, NY, USA, pp. 719–728. ACM (2012)
Wu, Z., Song, Y., Giles, C.L.: Exploring multiple feature spaces for novel entity discovery. In: AAAI 2016, AAAI - Association for the Advancement of Artificial Intelligence, February 2016
Yosef, M.A., Bauer, S., Hoffart, J., Spaniol, M., Weikum, G.: HYENA: hierarchical type classification for entity names. In: COLING 2012, pp. 1361–1370 (2012)
Zhang, L., Färber, M., Rettinger, A.: xLiD-Lexica: cross-lingual Linked data lexica. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2101–2105. ELRA (2014)
Zhang, L., Rettinger, A.: X-LiSA: cross-lingual semantic annotation. PVLDB 7(13), 1693–1696 (2014)
Zhao, S., Li, C., Ma, S., Ma, T., Ma, D.: Combining POS tagging, lucene search and similarity metrics for entity linking. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013. LNCS, vol. 8180, pp. 503–509. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41230-1_44
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Färber, M., Rettinger, A., El Asmar, B. (2016). On Emerging Entity Detection. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds) Knowledge Engineering and Knowledge Management. EKAW 2016. Lecture Notes in Computer Science(), vol 10024. Springer, Cham. https://doi.org/10.1007/978-3-319-49004-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-49004-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49003-8
Online ISBN: 978-3-319-49004-5
eBook Packages: Computer ScienceComputer Science (R0)