Skip to main content

On Emerging Entity Detection

  • Conference paper
  • First Online:
Knowledge Engineering and Knowledge Management (EKAW 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10024))

Included in the following conference series:

Abstract

While large Knowledge Graphs (KGs) already cover a broad range of domains to an extent sufficient for general use, they typically lack emerging entities that are just starting to attract the public interest. This disqualifies such KGs for tasks like entity-based media monitoring, since a large portion of news inherently covers entities that have not been noted by the public before. Such entities are unlinkable, which ultimately means, they cannot be monitored in media streams. This is the first paper that thoroughly investigates all types of challenges that arise from out-of-KG entities for entity linking tasks. By large-scale analytics of news streams we quantify the importance of each challenge for real-world applications. We then propose a machine learning approach which tackles the most frequent but least investigated challenge, i.e., when entities are missing in the KG and cannot be considered by entity linking systems. We construct a publicly available benchmark data set based on English news articles and editing behavior on Wikipedia. Our experiments show that predicting whether an entity will be added to Wikipedia is challenging. However, we can reliably identify emerging entities that could be added to the KG according to Wikipedia’s own notability criteria.

A. Rettinger—The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 611346.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
CHF34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
CHF 24.95
Price includes VAT (Switzerland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
CHF 94.00
Price excludes VAT (Switzerland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
CHF 118.00
Price excludes VAT (Switzerland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This fact results from our empirical analysis, see Sect. 2.2 for more details.

  2. 2.

    Emerging relates to trending: Entities can emerge only once. Once they have become notable, any (repeated) increase in public interest is just a trend.

  3. 3.

    See https://en.wikipedia.org/wiki/Wikipedia:Notability.

  4. 4.

    See http://people.aifb.kit.edu/mfa/emerging-entity-detection/.

  5. 5.

    As we are interested in novel/emerging entities, we do not consider deletions of entities or surface forms within \(\varDelta t\).

  6. 6.

    The remaining few entities are not parseable by the Stanford parser.

  7. 7.

    Given the set of 300 novel entities manually tagged as named entities, 95 of them got classified as of type Person, 51 of type Location, 27 of type Organization, and 24 of type Event (as subtype of Misc).

  8. 8.

    For 11,639 of those 41,579 novel entities, however, only the Wikipedia title or redirects changed (due to typo correction or outsourcing of parts of a page). I.e., on average over 700 entities are inserted into Wikipedia each day which are “really” novel. For the task of Emerging Entity Detection (see Sect. 4), we only consider real novel entities which emerge (i.e., recently gained public interest for the first time).

  9. 9.

    See http://trec-kba.org/, requested June 26, 2016.

  10. 10.

    An entity is here understood as “noun phrase that could have a Wikipedia-style article if there were no notability or newness considerations, and which would have semantic types.” [12].

  11. 11.

    Note that any text annotation method for Wikipedia could have been applied here.

  12. 12.

    See http://people.aifb.kit.edu/mfa/emerging-entity-detection.

  13. 13.

    We also experimented with aggregating all features for each NP series, but did not yield better evaluation results.

  14. 14.

    See http://dumps.wikimedia.org/other/pagecounts-raw/.

  15. 15.

    We also evaluated machine learning algorithms specialized on imbalanced and time-series data, such as cost-sensitive AdaBoost, cost-sensitive one class classifier and recurrent neural networks. However, this did not yield better results.

  16. 16.

    See more information on our website.

  17. 17.

    Given Wikipedia status of 2015-04-04 as the reference KG.

  18. 18.

    Some of those entities were inserted later.

  19. 19.

    Investigations revealed that the already existing Wikipedia entities were not annotated by x-LiSA because no suitable surface form were available for those entities. In most of those cases, the entity was a person and in the news article only the family name was mentioned and extracted. However, in the set of known surface forms from Wikipedia only the full name of the entity was contained. Resolving those issues are left to future work.

References

  1. Ben-Hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.: Support vector clustering. J. Mach. Learn. Res. 2, 125–137 (2002)

    MATH  Google Scholar 

  2. Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of of the European Chapter of the Association for Computational Linguistics (EACL-06), pp. 9–16, Trento, Italy (2006)

    Google Scholar 

  3. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E., Mitchell, T.: Toward an architecture for never-ending language learning (2010)

    Google Scholar 

  4. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  5. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on EMNLP-CoNLL, pp. 708–716, Prague, Czech Republic. Association for Computational Linguistics, June 2007

    Google Scholar 

  6. Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowledge base population. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, Stroudsburg, PA, USA, pp. 277–285. Association for Computational Linguistics (2010)

    Google Scholar 

  7. Dutta, A., Meilicke, C., Stuckenschmidt, H.: Enriching structured knowledge with open information. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Republic and Canton of Geneva, Switzerland, pp. 267–277 (2015)

    Google Scholar 

  8. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, USA, pp. 1535–1545. Association for Computational Linguistics (2011)

    Google Scholar 

  9. Gottipati, S., Jiang, J.: Linking entities to a knowledge base with query expansion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, USA, pp. 804–813. Association for Computational Linguistics (2011)

    Google Scholar 

  10. Hoffart, J., Altun, Y., Weikum, G.: Discovering emerging entities with ambiguous Names. In: Proceedings of the 23rd International Conference on World Wide Web, WWW 2014, New York, NY, USA, pp. 385–396. ACM (2014)

    Google Scholar 

  11. Ji, H., Nothman, J., Hachey, B., Florian, R.: Overview of TAC-KBP2015 tri-lingual entity discovery and linking (2015)

    Google Scholar 

  12. Lin, T., Etzioni, O.: No noun phrase left behind: detecting and typing unlinkable entities. In: Proceedings of the 2012 Joint Conference on EMNLP and CoNLL, EMNLP-CoNLL 2012, Stroudsburg, PA, USA, pp. 893–903. ACL (2012)

    Google Scholar 

  13. Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of the 17th ACM conference on Information and knowledge management, CIKM 2008, New York, NY, USA, pp. 509–518. ACM (2008)

    Google Scholar 

  14. Nakashole, N., Tylenda, T., Weikum, G.: Fine-grained semantic typing of emerging entities. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1488–1497 (2013)

    Google Scholar 

  15. Parada, C., Sethy, A., Dredze, M., Jelinek, F.: A spoken term detection framework for recovering out-of-vocabulary words using the web. Paragraph 10(71.24), 323K (2010)

    Google Scholar 

  16. Soboroff, I., Harman, D.: Novelty detection: the TREC experience. In: HLT 2005, Stroudsburg, PA, USA, pp. 105–112. ACL (2005)

    Google Scholar 

  17. Trampuš, M., Novak, B.: Internals of an aggregated web news feed. In: Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012, pp. 431–434 (2012)

    Google Scholar 

  18. Wang, C., Chakrabarti, K., Cheng, T., Chaudhuri, S.: _targeted disambiguation of ad-hoc, homogeneous sets of named entities. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, New York, NY, USA, pp. 719–728. ACM (2012)

    Google Scholar 

  19. Wu, Z., Song, Y., Giles, C.L.: Exploring multiple feature spaces for novel entity discovery. In: AAAI 2016, AAAI - Association for the Advancement of Artificial Intelligence, February 2016

    Google Scholar 

  20. Yosef, M.A., Bauer, S., Hoffart, J., Spaniol, M., Weikum, G.: HYENA: hierarchical type classification for entity names. In: COLING 2012, pp. 1361–1370 (2012)

    Google Scholar 

  21. Zhang, L., Färber, M., Rettinger, A.: xLiD-Lexica: cross-lingual Linked data lexica. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 2101–2105. ELRA (2014)

    Google Scholar 

  22. Zhang, L., Rettinger, A.: X-LiSA: cross-lingual semantic annotation. PVLDB 7(13), 1693–1696 (2014)

    Google Scholar 

  23. Zhao, S., Li, C., Ma, S., Ma, T., Ma, D.: Combining POS tagging, lucene search and similarity metrics for entity linking. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013. LNCS, vol. 8180, pp. 503–509. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41230-1_44

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Färber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Färber, M., Rettinger, A., El Asmar, B. (2016). On Emerging Entity Detection. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds) Knowledge Engineering and Knowledge Management. EKAW 2016. Lecture Notes in Computer Science(), vol 10024. Springer, Cham. https://doi.org/10.1007/978-3-319-49004-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49004-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49003-8

  • Online ISBN: 978-3-319-49004-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

  NODES
Association 7
INTERN 8
Note 5