Skip to main content

Knowledge-Based Techniques for Document Fraud Detection: A Comprehensive Study

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2019)

Abstract

Due to the availability of cost-effective scanners, printers, and image processing software, document fraud detection is, unfortunately, quite common nowadays. The main challenges of this task are the lack of freely available annotated data and the overflow of mainly computer vision approaches. We consider that relying on the textual content of forged documents could provide a different view on their detection by exploring semantic inconsistencies with the aid of specialized knowledge bases. We, thus, perform an exhaustive study of existing state-of-the-art methods based on knowledge-graph embeddings (KGE) using a synthetically forged, yet realistic, receipt dataset. We also explore additional knowledge base incremental data enrichments, in order to analyze the impact of the richness of the knowledge base on each KGE method. The reported results prove that the performance of the methods varies considerably depending on the type of approach. Also, as expected, the size of the data enrichment is directly proportional to the rise in performance. Finally, we conclude that, while exploring the semantics of documents is promising, document forgery detection still poses a challenge for KGE methods.

This work was supported by the French defense innovation agency (AID) and the VERINDOC project funded by the Nouvelle-Aquitaine Region.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
CHF34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
CHF 24.95
Price includes VAT (Switzerland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
CHF 94.00
Price excludes VAT (Switzerland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
CHF 118.00
Price excludes VAT (Switzerland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Comités opérationnels départementaux anti-fraude https://www.economie.gouv.fr/codaf-comites-operationnels-departementaux-anti-fraude.

  2. 2.

    https://en.wikipedia.org/wiki/SIREN_code.

  3. 3.

    https://en.wikipedia.org/wiki/SIRET_code.

  4. 4.

    https://www.insee.fr/en/accueil.

  5. 5.

    http://sirene.fr/siren/public/home.

  6. 6.

    https://api.gouv.fr/les-api/base-adresse-nationale.

  7. 7.

    The dataset has been split into a training and test set (80% and 20% respectively) thanks to the PyKEEN library https://github.com/pykeen/pykeen [2], to avoid redundant triples being found both in training and test. The previously presented methods are implemented by PyKEEN, library that we chose to use for its completeness, flexibility and ease of use.

References

  1. Abiteboul, S.: Semistructured data: from practice to theory. In: Proceedings 16th Annual IEEE Symposium on Logic in Computer Science. IEEE (2001)

    Google Scholar 

  2. Ali, M., Berrendorf, M., Hoyt, C.T., Vermue, L., Sharifzadeh, S., Tresp, V., Lehmann, J.: Pykeen 1.0: a python library for training and evaluating knowledge graph emebddings (2020)

    Google Scholar 

  3. Artaud, C., Doucet, A., Ogier, J.M., d’Andecy, V.P.: Receipt dataset for fraud detection. In: First International Workshop on Computational Document Forensics (2017)

    Google Scholar 

  4. Artaud, C., Sidère, N., Doucet, A., Ogier, J.M., Yooz, V.P.D.: Find it! fraud detection contest report. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE (2018)

    Google Scholar 

  5. Artaud, C.: Détection des fraudes : de l’image á la sémantique du contenu : application á la vérification des informations extraites d’un corpus de tickets de caisse. Ph.D. thesis (2019)

    Google Scholar 

  6. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  7. Balazevic, I., Allen, C., Hospedales, T.: Multi-relational poincaré graph embeddings. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  8. Balažević, I., Allen, C., Hospedales, T.M.: Tucker: tensor factorization for knowledge graph completion (2019)

    Google Scholar 

  9. Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)

    Article  Google Scholar 

  10. Behera, T.K., Panigrahi, S.: Credit card fraud detection: a hybrid approach using fuzzy clustering & neural network. In: 2015 2nd International Conference on Advances in Computing and Communication Engineering. IEEE (2015)

    Google Scholar 

  11. Berti-Équille, L., Borge-Holthoefer, J.: Veracity of data: from truth discovery computation algorithms to models of misinformation dynamics. Synth. Lect. Data Manag. 7(3), 1–155 (2015)

    Article  Google Scholar 

  12. Bertrand, R., Gomez-Kramer, P., Terrades, O.R., Franco, P., Ogier, J.M.: A system based on intrinsic features for fraudulent document detection. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 106–110. IEEE, Washington, DC, USA (2013)

    Google Scholar 

  13. Bertrand, R., Terrades, O.R., Gomez-Krämer, P., Franco, P., Ogier, J.M.: A conditional random field model for font forgery detection. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE (2015)

    Google Scholar 

  14. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. SIGMOD’08, Association for Computing Machinery, New York, NY, USA (2008)

    Google Scholar 

  15. Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, Lake Tahoe, Nevada, vol. 2, pp. 2787–2795. Curran Associates Inc., Red Hook, NY, USA (2013)

    Google Scholar 

  16. Bordes, A., Weston, J., Collobert, R., Bengio, Y.: Learning structured embeddings of knowledge bases. In: Proceedings of the 25th AAAI Conference on Artificial Intelligence, AAAI’11, San Francisco, California, pp. 301–306. AAAI Press (2011)

    Google Scholar 

  17. Cozzolino, D., Gragnaniello, D., Verdoliva, L.: Image forgery detection through residual-based local descriptors and block-matching. In: 2014 IEEE International Conference on Image Processing (ICIP). IEEE (2014)

    Google Scholar 

  18. Cozzolino, D., Poggi, G., Verdoliva, L.: Efficient dense-field copy-move forgery detection. IEEE Trans. Inf. Forensics Secur. 10(11), 2284–2297 (2015)

    Article  Google Scholar 

  19. Cozzolino, D., Verdoliva, L.: Camera-based image forgery localization using convolutional neural networks. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE (2018)

    Google Scholar 

  20. Cozzolino, D., Verdoliva, L.: Noiseprint: a CNN-based camera model fingerprint (2018)

    Google Scholar 

  21. Cruz, F., Sidere, N., Coustaty, M., d’Andecy, V.P., Ogier, J.M.: Local binary patterns for document forgery detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE (2017)

    Google Scholar 

  22. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph embeddings (2018)

    Google Scholar 

  23. Dong, X., et al.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14, New York, New York, USA, pp. 601–610. Association for Computing Machinery, New York, NY, USA (2014)

    Google Scholar 

  24. EulerHermes-DFCG: Plus de 7 entreprises sur 10 ont subi au moins une tentative de fraude cette annye. https://www.eulerhermes.fr/actualites/etude-fraude-2020.html

  25. Fridrich, J., Kodovsky, J.: Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 7(3), 868–882 (2012)

    Article  Google Scholar 

  26. Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in ontological knowledge bases with AMIE\(+\). VLDB J. 24(6), 707–730 (2015). https://doi.org/10.1007/s00778-015-0394-1

    Article  Google Scholar 

  27. Gesese, G.A., Biswas, R., Alam, M., Sack, H.: A survey on knowledge graph embeddings with literals: which model links better literal-ly? (2020)

    Google Scholar 

  28. Goyal, N., Sachdeva, N., Kumaraguru, P.: Spy the lie: fraudulent jobs detection in recruitment domain using knowledge graphs. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, S.-Y. (eds.) KSEM 2021. LNCS (LNAI), vol. 12816, pp. 612–623. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-82147-0_50

    Chapter  Google Scholar 

  29. He, S., Liu, K., Ji, G., Zhao, J.: Learning to represent knowledge graphs with gaussian embedding. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (2015)

    Google Scholar 

  30. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products. J. Math. Phys. 6, 1–4 (1927)

    Article  MATH  Google Scholar 

  31. Huynh, V.P., Papotti, P.: A benchmark for fact checking algorithms built on knowledge bases. In: 28th ACM International Conference on Information and Knowledge Management, CIKM’19, 3rd-7th November 2019, Beijing, China (2019)

    Google Scholar 

  32. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (volume 1: Long papers) (2015)

    Google Scholar 

  33. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: representation, acquisition and applications (2020)

    Google Scholar 

  34. Kazemi, S.M., Poole, D.: Simple embedding for link prediction in knowledge graphs (2018)

    Google Scholar 

  35. Kim, J., Kim, H.-J., Kim, H.: Fraud detection for job placement using hierarchical clusters-based deep neural networks. Appl. Intell. 49(8), 2842–2861 (2019). https://doi.org/10.1007/s10489-019-01419-2

    Article  Google Scholar 

  36. Kowshalya, G., Nandhini, M.: Predicting fraudulent claims in automobile insurance. In: 2018 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT). IEEE (2018)

    Google Scholar 

  37. Li, Y., Yan, C., Liu, W., Li, M.: Research and application of random forest model in mining automobile insurance fraud. In: 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD). IEEE (2016)

    Google Scholar 

  38. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)

    Google Scholar 

  39. Mishra, A., Ghorpade, C.: Credit card fraud detection on the skewed data using various classification and ensemble techniques. In: 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS). IEEE (2018)

    Google Scholar 

  40. Nickel, M., Rosasco, L., Poggio, T.: Holographic embeddings of knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  41. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-relational data. In: Proceedings of the 28th International Conference on Machine Learning, ICML’11, pp. 809–816 (2011)

    Google Scholar 

  42. Rabah, C.B., Coatrieux, G., Abdelfattah, R.: The supatlantique scanned documents database for digital image forensics purposes. In: 2020 IEEE International Conference on Image Processing (ICIP). IEEE (2020)

    Google Scholar 

  43. Rizki, A.A., Surjandari, I., Wayasti, R.A.: Data mining application to detect financial fraud in indonesia’s public companies. In: 2017 3rd International Conference on Science in Information Technology (ICSITech). IEEE (2017)

    Google Scholar 

  44. Rossi, A., Firmani, D., Matinata, A., Merialdo, P., Barbosa, D.: Knowledge graph embedding for link prediction: a comparative analysis (2020)

    Google Scholar 

  45. Rossi, A., Matinata, A.: Knowledge graph embeddings: are relation-learning models learning relations? In: EDBT/ICDT Workshops (2020)

    Google Scholar 

  46. Shen, A., Mistica, M., Salehi, B., Li, H., Baldwin, T., Qi, J.: Evaluating document coherence modeling. Trans. Assoc. Comput. Linguist. 9, 621–640 (2021)

    Article  Google Scholar 

  47. Shi, B., Weninger, T.: Proje: Embedding projection for knowledge graph completion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)

    Google Scholar 

  48. Sidere, N., Cruz, F., Coustaty, M., Ogier, J.M.: A dataset for forgery detection and spotting in document images. In: 2017 7th International Conference on Emerging Security Technologies (EST). IEEE (2017)

    Google Scholar 

  49. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW’07, pp. 697–706. Association for Computing Machinery, New York, NY, USA (2007)

    Google Scholar 

  50. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: Rotate: knowledge graph embedding by relational rotation in complex space (2019)

    Google Scholar 

  51. Thorne, J., Vlachos, A.: Automated Fact Checking: task formulations, methods and future directions. CoRR (2018)

    Google Scholar 

  52. Trouillon, T., Welbl, J., Riedel, S., Éric Gaussier, Bouchard, G.: Complex embeddings for simple link prediction (2016)

    Google Scholar 

  53. Van Der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014)

    MathSciNet  MATH  Google Scholar 

  54. Vidros, S., Kolias, C., Kambourakis, G., Akoglu, L.: Automatic detection of online recruitment frauds: characteristics, methods, and a public dataset. Future Internet 9(1), 6 (2017)

    Article  Google Scholar 

  55. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017)

    Article  Google Scholar 

  56. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28 (2014)

    Google Scholar 

  57. Yang, B., tau Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases (2015)

    Google Scholar 

  58. Zhang, S., Tay, Y., Yao, L., Liu, Q.: Quaternion knowledge graph embeddings (2019)

    Google Scholar 

  59. Zhang, W., Paudel, B., Zhang, W., Bernstein, A., Chen, H.: Interaction embeddings for prediction and explanation in knowledge graphs. In: Proceedings of the 12th ACM International Conference on Web Search and Data Mining (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antoine Doucet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tornés, B.M., Boros, E., Doucet, A., Gomez-Krämer, P., Ogier, JM., d’Andecy, V.P. (2023). Knowledge-Based Techniques for Document Fraud Detection: A Comprehensive Study. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24337-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24336-3

  • Online ISBN: 978-3-031-24337-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

  NODES
Association 4
innovation 1
INTERN 24
Note 3
Project 2