Due to the availability of cost-effective scanners, printers, and image processing software, document fraud detection is, unfortunately, quite common nowadays. The main challenges of this task are the lack of freely available annotated data and the overflow of mainly computer vision approaches. We consider that relying on the textual content of forged documents could provide a different view on their detection by exploring semantic inconsistencies with the aid of specialized knowledge bases. We, thus, perform an exhaustive study of existing state-of-the-art methods based on knowledge-graph embeddings (KGE) using a synthetically forged, yet realistic, receipt dataset. We also explore additional knowledge base incremental data enrichments, in order to analyze the impact of the richness of the knowledge base on each KGE method. The reported results prove that the performance of the methods varies considerably depending on the type of approach. Also, as expected, the size of the data enrichment is directly proportional to the rise in performance. Finally, we conclude that, while exploring the semantics of documents is promising, document forgery detection still poses a challenge for KGE methods.
This work was supported by the French defense innovation agency (AID) and the VERINDOC project funded by the Nouvelle-Aquitaine Region.
The dataset has been split into a training and test set (80% and 20% respectively) thanks to the PyKEEN library https://github.com/pykeen/pykeen [2], to avoid redundant triples being found both in training and test. The previously presented methods are implemented by PyKEEN, library that we chose to use for its completeness, flexibility and ease of use.
