Wikidata:WikiFactMine/Bridging the gap

Bridging the gap

The material for WikiFactMine starts in an repository of scientific papers. Currently Euro PubMed Central is used. WikiFactMine then uses XML downloads from there.

The next major step applied to a paper is to search it. For the project's purposes, it is not enough to search for one term at a time, as a standard search engine would. The basic reason is that the sought-after statements contain two terms (say gene, protein), and these terms are variable, not fixed. It is impractical to carry out direct searches on pairs for reasons of scale. A kind of information retrieval infrastructure must be built, to sustain the “mental arithmetic” of the dual search.

Then it is practical to turn up the cases when two given terms of the desired type occur in a paper. They can by presented, by an API, to a human reader within a context, made of text snippets. For fact mining, if the terms are mentioned far apart in the paper, the mined instance is not going to support any statement of the kind desired, so can be ruled out.

For the case of two terms that are nearby, human scrutiny usually resolves quickly the issue of whether the actual wording supports a putative statement. People can handle paraphrase.

The issue that then remains is whether there are too few mined facts, The “signal”, namely the sequence of identifiable statements fit for Wikidata, may be faint. Right now, the matter rests here.