Page MenuHomePhabricator

Compute the Freebase curation ratio per property
Open, Needs TriagePublic

Description

Based on the current status of the back end v1, output a curation ratio (approved / rejected) for each main property of the Freebase datasets.
See https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool/Version_1#Statements_per_property

Event Timeline

I just did a quick computation of such ratios: https://docs.google.com/spreadsheets/d/1o4zNWLesoe4OSLmfQsILW0ChprIqoP8ISVS3k3vTc-E/edit?usp=sharing
There are two spreadsheets: one talking only about claims (i.e. without taking care of references) and one only about statements with references.
It seems there are some properties with very hight quality claims we could import automatically and overall the approved/(approved+rejected) for claims is fairly good

It would seem like the 2018-03-13 spreadsheet should be adequate to call this task complete. I would recommend including some qualitative understanding of the source of the Freebase data in addition to just pure curation ratio when making judgements about how to use which data. Things like MusicBrainz IDs and ISFDB IDs went through a heavily QA'd reconciliation process and are going to be high quality. Films, and to a lesser extent TV shows, were an area of focus for the Freebase team, so will generally be both high quality and relatively complete.

Also many of the quality issues with the initial data set didn't have anything to do with the Freebase data itself, but the junky "evidence" URLs that Google produced after the fact to satisfy the Wikidata call for evidence. These tend to be of much, much lower quality than the data itself.

Of course, after so many years, much of the value of the data has been squandered, but I bet there are still some areas where it could be used to significantly improve Wikidata.

Aklapper removed Tpt as the assignee of this task.Jun 19 2020, 4:12 PM
Aklapper added a subscriber: Tpt.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

  NODES
see 5