Suggestions endpoints for SDC image caption addition/translation
Adds endpoints for suggesting Commons files for SDC caption editing.
The suggestion algorithm is as follows:
- A pool of 500 Commons file candidates with the required source and _target language characteristics is requested using the CirrusSearch- powered wikimediaeditortaskssuggestions Action API module, along with MIME info from imageinfo;
- Candidates with non-image MIME types are filtered out;
- The candidate set is narrowed to a random sample of 50 images;
- If data on legacy unstructured ImageDescriptions is required, an additional imageinfo request is made for this data for the remaining candidates;
- Info on structured data is requested for the candidates;
- Candidates are again filtered based on the desired properties, and those that remain are returned.
Notes:
For reasons yet to be determined, the initial candidate search
occasionally returns invalid candidates, necessitating a follow-up
wbgetentities query in all cases. If this is fixed, and structured
captions info is not required in the suggestions response, then this
query can be eliminated.
There is some debate over whether to involve the presence or absence of
unstructured captions in the inclusion criteria. Requesting this info
slows down the response dramatically.
The imageinfo request, where needed, is here made on its own rather than
as part of the initial candidate request. This is to increase the
randomness of the suggestions served. Results from the initial search
are not random across requests, and including an imageinfo extmetadata
query would mean having to lower the limit to approximately 50, or else
the request will time out.
Lowering the candidate sample size extracted from the initial pool of
candidates would also mitigate the impact of the extmetadata query, but
at the expense of lower numbers of results after filtering and possible
zero-results responses.
To demonstrate this, multiple implementations of both endpoints are
available here for testing, with implementations involving unstructured
captions available via query parameters:
/caption/addition/{_target}?includeUnstructured
Requests unstructured captions and includes them in the response, but
performs no filtering based on their presence or absence.
/caption/addition/{_target}?requireUnstructured
Requests unstructured captions and filters out candidates that do
not have one in the _target language.
/caption/translation/from/Missing path, expected "{src path ...}" in: {source}/to/{_target}?includeUnstructured
Requests unstructured captions and includes them in the response, but
performs no filtering based on their presence or absence.
/caption/translation/from/Missing path, expected "{src path ...}" in: {source}/to/{_target}?requireUnstructuredSource
Requests unstructured captions and filters out candidates that do not
have one in the source language.
/caption/translation/from/Missing path, expected "{src path ...}" in: {source}/to/{_target}?requireUnstructuredSourceNo_target
Requests unstructured captions and filters out candidates that do not
have one in the source language or do have one in the _target language.
Without a query parameter, unstructured captions are not requested.
Note: This code will reduce in size when the final implementation is
agreed upon.
Bug: T209997
Bug: T220034
Change-Id: I862bd382e4d93921a92467bd5a66435acd3ee53a