Skip to main content

Unsupervised Keyphrase Extraction from Scientific Publications

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2019)

Abstract

We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection. It starts by training word embeddings on the _target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art and recent unsupervised keyphrase extraction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
CHF34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
CHF 24.95
Price includes VAT (Switzerland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
CHF 94.00
Price excludes VAT (Switzerland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
CHF 118.00
Price excludes VAT (Switzerland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/stanfordnlp/GloVe.

  2. 2.

    https://www.nltk.org/.

  3. 3.

    https://scikit-learn.org.

References

  1. Boudin, F.: PKE: an open source python-based keyphrase extraction toolkit. In: Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016, Proceedings of the Conference System Demonstrations, Osaka, Japan, pp. 69–73 (2016). https://aclweb.org/anthology/C/C16/C16-2015.pdf

  2. Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics Proceedings of NAACL, NAACL 2018, New Orleans (2018)

    Google Scholar 

  3. Bougouin, A., Boudin, F., Daille, B.: TopicRank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the 6th International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, pp. 543–551 (2013). https://aclweb.org/anthology/I/I13/I13-1062.pdf

  4. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)

    Google Scholar 

  5. Das, S.: Elements of artificial neural networks [book reviews]. IEEE Trans. Neural Netw. 9(1), 234–235 (1998)

    Article  Google Scholar 

  6. Dreiseitl, S., Osl, M., Scheibböck, C., Binder, M.: Outlier detection with one-class SVMs: an application to melanoma prognosis. In: AMIA Annual Symposium Proceedings. AMIA Symposium 2010, pp. 172–176 (2010). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041295/

  7. Florescu, C., Caragea, C.: A position-biased pagerank algorithm for keyphrase extraction. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA, pp. 4923–4924 (2017). https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14377

  8. Florescu, C., Caragea, C.: PositionRank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, pp. 1105–1115 (2017). https://doi.org/10.18653/v1/P17-1102

  9. Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016)

    Article  Google Scholar 

  10. Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec, Canada, pp. 1629–1635 (2014). https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8662

  11. Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, (Volume 1: Long Papers), Baltimore, MD, USA, pp. 1262–1273 (2014). https://aclweb.org/anthology/P/P14/P14-1119.pdf

  12. Hawkins, S., He, H., Williams, G.J., Baxter, R.A.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-46145-0_17

    Chapter  Google Scholar 

  13. Hubert, M., Debruyne, M.: Minimum covariance determinant. Wiley Interdisc. Rev.: Comput. Stat. 2(1), 36–43 (2010)

    Article  Google Scholar 

  14. Hubert, M., Debruyne, M., Rousseeuw, P.J.: Minimum covariance determinant and extensions. Wiley Interdisc. Rev.: Comput. Stat. 10(3), e1421 (2018)

    Article  MathSciNet  Google Scholar 

  15. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP 2003, Stroudsburg, PA, USA, pp. 216–223 (2003). https://doi.org/10.3115/1119355.1119383

  16. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  17. Kim, S.N., Medelyan, O., Kan, M., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala, Sweden, pp. 21–26 (2010). https://aclweb.org/anthology/S/S10/S10-1004.pdf

  18. Krapivin, M., Autayeu, A., Marchese, M.: Large dataset for keyphrases extraction. In: Technical Report DISI-09-055, Trento, Italy (2008)

    Google Scholar 

  19. Liu, F.T., Ting, K.M., Zhou, Z.: Isolation forest. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 15–19 December 2008, pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17

  20. Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, Massachussets, USA, pp. 366–376 (2010). https://www.aclweb.org/anthology/D10-1036

  21. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, pp. 257–266 (2009). https://www.aclweb.org/anthology/D09-1027

  22. Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, pp. 1318–1327 (2009). https://www.aclweb.org/anthology/D09-1137

  23. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, Barcelona, Spain, pp. 404–411 (2004). https://www.aclweb.org/anthology/W04-3252

  24. Moya, M.M., Hush, D.R.: Network constraints and multi-objective optimization for one-class classification. Neural Netw. 9(3), 463–474 (1996)

    Article  Google Scholar 

  25. Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41

    Chapter  Google Scholar 

  26. Papagiannopoulou, E., Tsoumakas, G.: Local word vectors guiding keyphrase extraction. Inf. Process. Manag. 54(6), 888–902 (2018). https://doi.org/10.1016/j.ipm.2018.06.004

    Article  Google Scholar 

  27. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011). https://dl.acm.org/citation.cfm?id=2078195

  28. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, pp. 1532–1543 (2014). https://aclweb.org/anthology/D/D14/D14-1162.pdf

  29. Rousseau, F., Vazirgiannis, M.: Main core retention on graph-of-words for single-document keyword extraction. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 382–393. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16354-3_42

    Chapter  Google Scholar 

  30. Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984). https://doi.org/10.1080/01621459.1984.10477105

    Article  MathSciNet  MATH  Google Scholar 

  31. Rousseeuw, P.J., van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3), 212–223 (1999)

    Article  Google Scholar 

  32. Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 1(1), 73–79 (2011). https://doi.org/10.1002/widm.2

    Article  Google Scholar 

  33. Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001)

    Article  MATH  Google Scholar 

  34. Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: Advances in Neural Information Processing Systems 12, NIPS Conference, Denver, Colorado, USA, 29 November–4 December 1999, pp. 582–588 (1999). https://papers.nips.cc/paper/1723-support-vector-method-for-novelty-detection

  35. Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, pp. 855–860 (2008). https://www.aaai.org/Library/AAAI/2008/aaai08-136.php

  36. Wang, R., Liu, W., McDonald, C.: Corpus-independent generic keyphrase extraction using word embedding vectors. In: Software Engineering Research Conference (2014)

    Google Scholar 

  37. Wang, R., Liu, W., McDonald, C.: Using word embeddings to enhance keyword identification for scientific publications. In: Sharaf, M.A., Cheema, M.A., Qi, J. (eds.) ADC 2015. LNCS, vol. 9093, pp. 257–268. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19548-3_21

    Chapter  Google Scholar 

  38. Wille, L.T.: Review of “Learning Kernel Classifiers: Theory and Algorithms by Ralf Herbrich”. MIT Press, Cambridge (2002). 13–17, ISBN 026208306x, p. 384; and review of “learning with kernels: support vector machines, regularization optimization and beyond by Bernhard Scholkopf and Alexander J. Smola”. IT Press, Cambridge (2002). ISBN 0262194759, p. 644. SIGACT News 35(3) (2004). https://doi.org/10.1145/1027914.1027921

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eirini Papagiannopoulou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Papagiannopoulou, E., Tsoumakas, G. (2023). Unsupervised Keyphrase Extraction from Scientific Publications. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24337-0_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24336-3

  • Online ISBN: 978-3-031-24337-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

  NODES
Association 3
INTERN 5
Note 3