Page MenuHomePhabricator

Add a link: prioritize suggestions of underlinked articles
Closed, ResolvedPublic

Description

In seeing the usage of "add a link" in the wikis, community members have suggested that we direct newcomers' energies toward those articles that need links the most. To that end, we want to figure out a way to prioritize suggesting articles that appear underlinked, based on the ratio of links in the article to what would be expected.

We do not have exact logic to determine how "underlinked" an article is. That logic would be part of this task.

We also know there may be technical challenges depending on how this is implemented. We should talk about pros and cons of different implementations. Depending on the logic, perhaps adjustments could be made via community configuration.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kostajh lowered the priority of this task from High to Medium.Mar 14 2022, 8:11 PM

I have moved this task from the "add a link iteration 2" epic to the "improvements" epic. That's because this improvement will not be a blocker for deploying to more wikis. But it is still a near-term priority and belongs on our sprint board.

Copying a comment made on Slack:

Do we want to prioritize underlinked articles or limit tasks to them? (The latter would potentially mean less tasks, e.g. hiwiki only has 7K tasks currently, might be worth checking how many tasks such a filter would result in. Of course with per-wiki confiugration wikis with less tasks can always just disable this.)

If we strictly want to prioritize, I can think of two ways:

  • Create a custom CirrusSearch sort, which should be able to use a simple mathematic formula with the number of links and the number of bytes as parameters.
  • When a good task candidate has been found, calculate some sort of underlinkedness score. (This could be more complex, e.g. could look at the Parsoid HTML to filter out links from templates.) Use the score as the weight for the recommendation.link weighted tag. Change HasRecommendationFeature to handle weights (if it needs to be changed, not sure about that).

If we are fine with filtering, then T301096#7773354 could work, but might slow the refresh script down (lots of false positive task candidates, assuming most pages are not underlinked). Pushing the condition into the CirrusSearch query for finding task candidates (either as a filter or as a sort) would be better IMO.

kostajh raised the priority of this task from Medium to High.May 24 2022, 4:24 PM

Per discussion with @MMiller_WMF now, we want to prioritize rather than filter.

I am not exactly sure how we will combine prioritization with the random sorting that we have, though.

  • Create a custom CirrusSearch sort, which should be able to use a simple mathematic formula with the number of links and the number of bytes as parameters.

That sounds like a good plan, assuming that we don't need to ask the search team for any of their time to get this working.

That sounds like a good plan, assuming that we don't need to ask the search team for any of their time to get this working.

I imagine we'd want a +1 from them just to make sure we aren't doing anything stupid with the custom ElasticSearch query fragment the sort is adding, but I think that's a very lightweight commitment.

Change 801010 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] [WIP] Add rescore method for sorting by underlinkedness

https://gerrit.wikimedia.org/r/801010

MediaWiki core defines the outgoing_links field as SearchIndexField::INDEX_TYPE_KEYWORD, which CirrusSearch translates as type: text, analyzer: keyword. Unfortunately, scripts cannot access 'text' fields, regardless of the analyzer, so while a custom rescore function would work with a type: keyword field, it doesn't work with this one. There is a flag for changing that, but I doubt it's feasible to enable. And without scripting, the ElasticSearch sort and rescore logic doesn't have anything resembling array size (number of outgoing links). Not sure how to recover from that.

It seemed to me that such a task was not created, but here it is! :)

Can we, regardless of filtering, prioritize articles with a certain template?

Can we, regardless of filtering, prioritize articles with a certain template?

We can, via the boost-templates CirrusSearch feature, but that would be another task as the mechanism is different (the only overlap would be that both options would require changing our current random sort method to some real sort that doesn't discard the weights from the search query).

To recap the siutation: GrowthExperiments is finding link recommendation tasks for users with a hasrecommendation:link query (potentially with other filters like articletopic mixed in), with sort=random to avoid edit conflicts. Link recommendations are useful to get new editors engaged and increase retention, but not so useful to improve an established article (it's hard, both for an algorithm and a new editor, to judge which links would be relevant); the community is much more favorable to link recommendation based edits being done on new / underdeveloped articles. Link frequency is a reasonable approximation for how well-developed an article is, so we'd like to weight search results by that. (And hopefully still avoid edit conflicts; not sure how that would work. But that's a secondary concern for now.)

I thought the best approach would be a custom rescore profile in CirrusSearch, using a boost function like doc['text_bytes'] / doc['outgoing_link'].length. That turns out not to work: outgoing_links is a field with type text, and text fields aren't exposed to scripts; and without scripting, there doesn't seem to be any way to use the length of a field (ie. array size) in a rescore query.

As far as I can see the options are:

  1. Maybe there is something I missed, and and there is a way to access text fields in scripts, or to access array size without scripting in a rescore query. That would be great.
  2. The mapping for outgoing_links could be set to fielddata=true, which does allow scripts to access it. But it would probably have an adverse effect on memory usage, and thus performance.
  3. A new outgoing_link_count field could be added to the search index.
  4. We could calculate the rescore factor (article length divided by link count, or something similar) outside Cirrus, and then import the scores (maybe as weights for the recommendation.link/exists weighted tags, and then create a variant of HasRecommendationFeature that would take tag weights into account). But that would probably be both more effort (needs an import mechanism) and less flexible (any changes to the rescore logic would require reimporting the data).

Accessing the array count in realtime without specific mapping will be expensive if even possible. Once indexed the array no longer exists and the only way to find out would be to decode the json blob that contains the entire document, or walk the positions lists and count the gaps. Not really plausible to do while scoring.

A new subfield of outgoing_link would probably be the best bet. Much like how the text.word_count field tokenizes the input into words and counts the number of tokens, the same functionality could be used with the keyword analyzer to count the number of entries in the outgoing_link field. The patch for this should be quite easy, deploying it into production will require reindexing all wikis. Is the intent to use this across the fleet or is there a subset of wikis that could be done? A reindex of all wikis is sadly a bit error prone and takes more than a week to process (and has annoying interactions with various maintenance activities), but a few _target wikis would be easy enough to do in a couple days.

Change 816353 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] [DNM] Use params._source for rescoring

https://gerrit.wikimedia.org/r/816353

Thanks a lot for the help @EBernhardson!

Tests for the prioritized search results are available at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing
The first column contains the first 100 100 results from the current (randomized) search logic, the second column from the new prioritized logic. The sheets called <wiki>_with_topic use a search with a topic filter (video games) enabled. (Topic filtering will affect search rankings, so this is significantly different - it is possible that prioritization will perform well with topics but poorly without topics, or vice versa.)
Let me know if a different format or content would be more useful.

The script used to make these files:

#!/bin/bash

WIKIS="arwiki bnwiki cswiki eswiki frwiki huwiki viwiki"
CIRRUS_SERVER="https://search.svc.eqiad.wmnet:9243"

convert_response_to_csv() {
    jq --raw-output \
        --arg column_header "$1" \
        --arg wiki_lang "${2:0:-4}" \
        '[$column_header], (.hits.hits[]._source.title | { "title": ., "titlee": . | gsub(" "; "_") | @uri } | ["=HYPERLINK(\"https://\($wiki_lang).wikipedia.org/wiki/\(.titlee)\",\"\(.title)\")"]) | @csv'
}

for WIKI in $WIKIS; do
    for WITH_TOPIC in _with_topic ''; do
        curl -s "${CIRRUS_SERVER}/${WIKI}_content/_search?pretty" \
                -H 'Content-Type: application/json' \
                -d @<(jq '. | del(.rescore)' addlink_rescore_underlinked${WITH_TOPIC}.json) \
            | convert_response_to_csv 'Normal' "$WIKI" \
            > tmp_addlink_plain${WITH_TOPIC}.csv
        curl -s "${CIRRUS_SERVER}/${WIKI}_content/_search?pretty" \
                -H 'Content-Type: application/json' \
                -d @addlink_rescore_underlinked${WITH_TOPIC}.json \
            | convert_response_to_csv 'Prioritized' "$WIKI" \
            > tmp_addlink_rescore_underlinked${WITH_TOPIC}.csv
        paste -d, tmp_addlink_plain${WITH_TOPIC}.csv tmp_addlink_rescore_underlinked${WITH_TOPIC}.csv > ${WIKI}${WITH_TOPIC}.csv
    done
done

rm tmp_addlink_{plain,rescore_underlinked}{,_with_topic}.csv


(removed duplicate comment)

FWIW on huwiki the search seems to perform reasonably well. Assuming performance is decent in other languages as well, I think there are two other potential issues to think about, but we can do that once the new scoring is in production behind a feature flag:

  • Will it result in too many edit conflicts? Previously we used random sorting, so Add Link edits got dispersed over the whole task pool. With underlinked articles sorted on top, edits will cluster around the tasks with the least-linked articles, so collisions could become more frequent. We can probably just add a random factor to the score if that becomes a problem, so I'm not too worried about this.
  • Do we need to tune the service parameters for underlinked tasks so instead of trying to link the first few linkable words it finds, it suggests some of the most relevant links? We could increase the model score threshold, but that would slow down the request and possibly cause a timeout.

Moving to QA, for the lack of a better column. The sample searches at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing need to be spot-checked.

FWIW on huwiki the search seems to perform reasonably well. Assuming performance is decent in other languages as well, I think there are two other potential issues to think about, but we can do that once the new scoring is in production behind a feature flag:

  • Will it result in too many edit conflicts? Previously we used random sorting, so Add Link edits got dispersed over the whole task pool. With underlinked articles sorted on top, edits will cluster around the tasks with the least-linked articles, so collisions could become more frequent. We can probably just add a random factor to the score if that becomes a problem, so I'm not too worried about this.

I think we'd just see an increase in "No suggestion found" messages shown to users, rather than edit conflicts. Once a user does a link recommendation edit, the cache entry with the link recommendation metadata is removed, so a subsequent visit by another user to that page would trigger the "No suggestion found".

  • Do we need to tune the service parameters for underlinked tasks so instead of trying to link the first few linkable words it finds, it suggests some of the most relevant links?

That sounds like a follow-up task to me but I defer to @KStoller-WMF.

We could increase the model score threshold, but that would slow down the request and possibly cause a timeout.

Do you mean, instead of using the cached link recommendation metadata, we'd issue a new request to the service with different parameters with a higher score threshold?

Moving to QA, for the lack of a better column. The sample searches at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing need to be spot-checked.

@KStoller-WMF @MShilova_WMF @Trizek-WMF who should do the spot-checking? The ambassadors, QA / @Etonkovidova, Growth engineers, some combination of those? See also T301096#8100507 for more context on that file.

Do you mean, instead of using the cached link recommendation metadata, we'd issue a new request to the service with different parameters with a higher score threshold?

I was thinking of calculating the length/links ration in LinkRecommendationUpdater and making the score threshold depend on that.

Moving to QA, for the lack of a better column. The sample searches at https://drive.google.com/drive/folders/1mXx-U8Xsof06VdCARXsBKLNSTg-fbDER?usp=sharing need to be spot-checked.

@KStoller-WMF @MShilova_WMF @Trizek-WMF who should do the spot-checking? The ambassadors, QA / @Etonkovidova, Growth engineers, some combination of those? See also T301096#8100507 for more context on that file.

Ambassadors are currently working on comparing the search samples. I just created T314299: Add a link: check prioritized suggestions of underlinked articles for clarity.

Do we need to tune the service parameters for underlinked tasks so instead of trying to link the first few linkable words it finds, it suggests some of the most relevant links?

I added T314343 to the "Growth: "add a link" structured task 3.0 Epic".

@Tgr, a question from @Dyolf77_WMF: are ar.wiki prioritized articles reviewed (unflagged)?

@Tgr, a question from @Dyolf77_WMF: are ar.wiki prioritized articles reviewed (unflagged)?

Not necessarily, neither the sorted nor the unsorted query takes flagrev status into account in any way.

Growth Ambassador completed an initial Quality of Add a link suggestions review and we found that the majority of article suggestions that were prioritized were better. About 70% of suggestions the were Prioritized were better than the unprioritized (Normal) article suggestions. To be clear this was a manual review, so there is some subjectivity and the results are likely not statistically significant.

The main difference between "Normal" and "Prioritized" (besides ratio of links to words) is that Prioritized suggestions are more likely to be longer articles (and potentially more often poorer-quality). There is some concern that adding links to articles that are already low-quality might be seen as a very low-impact edit. However there is also the perspective that we are bringing more attention to these articles that need more attention, so perhaps that's an OK tradeoff.

@Trizek-WMF - What else would you add to this summary?
@Tgr - you mentioned that it might be a fair amount of work to complete this task / reindex search results. Do you have a rough estimate of what it would take so we can decide if the impact is worth that effort?

@Tgr - you mentioned that it might be a fair amount of work to complete this task / reindex search results. Do you have a rough estimate of what it would take so we can decide if the impact is worth that effort?

It will be a fair amount of time (search index schema changes take a month or two), not necessarily a fair amount of work. The minimal version of the patch is almost done (IIRC the only thing left is exposing the new settings in Special:EditGrowthConfig), the question is what extra changes will we need? Do we want some kind of beta period where only some people see it? How much will we have to fine-tune the weight function?

Optimistically, there is 2-3 days' worth of work left, plus the Search team has to add a new index (which I *think* isn't a lot of work, but I'm not sure).

@Trizek-WMF - What else would you add to this summary?

Now that we heard from everyone, I closed the subtask. In T314299#8142535 I wrote this summary:

Overall, the new priority setup is better, even if is suggests longer articles, with less links. These longer articles are often low-quality articles, such as unreviewed machine translated articles, lists of series' episodes, movies plots. Some of them are unchanged bot creations.

Suggesting these low quality contents may be problematic, as we encourage newcomers to add links on articles that don't need them (synopsis) or that require more urgent efforts (reviewing translations, adding citations to walls of text, etc.).

@KStoller-WMF, @Tgr, my advice would be to refine the new model to identify articles that are underlinked walls of text with no images. These characteristics are often the ones of translated articles or lists of episodes. Excluding lists would be nice too.

my advice would be to refine the new model to identify articles that are underlinked walls of text with no images. These characteristics are often the ones of translated articles or lists of episodes. Excluding lists would be nice too.

Thank you so much for the evaluation and summary! I agree we might want to consider refining the model slightly.
But I am worried about the scope of this task increasing too much. It seems like some of the concerns mentioned could be mitigated if communities customized their "Add links between articles" configuration, right?

Current "Add links between articles" options to configure that might help:

  • Articles containing categories defined here will not be displayed to users as tasks for this type of task.
  • Landing page to learn more about the link recommendation task type.
  • The list of section names where no link should be recommended.

@Tgr - Or are there any relatively straight-forward ways to proceed with @Trizek-WMF's suggestion that I'm missing? Is it possible to exclude lists?
(My concern is both that we don't want this task to continue to increase in scope and it seems like excluding "underlinked walls of text with no images" is somewhat contradictory to what this task is all about).

There are two quality control steps in the system which are easy for us to change:

  • Validating task candidates in the periodic job that fills up the task pool. (E.g. we discard the recommendation if the article has had an Add Link edit recently.) This is fairly flexible but 1) we can only completely exclude tasks here (and thus reduce the task pool), not prioritize them; and 2) we probably don't want to exclude a lot of tasks, because we still need to iterate through excluded tasks so it might slow things down. (Excluding a small fragment of potential tasks is fine; excluding the majority of them might be problematic.)
  • Changing the search query that ranks and/or filters these tasks. This is limited to what the search engine can currently do: templates, categories, links, whether the article is old, whether it has a linked wikidata item. (You can check the data for yourself with <domain>/w/api.php?action=query&format=json&prop=cirrusdoc&titles=<title>.)

Change 831988 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add token_count subfield to outgoing_link

https://gerrit.wikimedia.org/r/831988

Change 816353 abandoned by Gergő Tisza:

[mediawiki/extensions/GrowthExperiments@master] [DNM] Use params._source for rescoring

Reason:

Not needed anymore as there's now an outgoing_link.token_count field

https://gerrit.wikimedia.org/r/816353

@KStoller-WMF is this something you would like to see rolled out to pilot wikis first, or can it go live on all wikis at once?

My suggestion: Let's rollout to pilot wikis first, have Ambassadors communicate and see if we get any feedback. Then rollout to all wikis after about a month if we don't hear of any issues.

Although @Trizek-WMF is welcome to suggest a different rollout strategy. I've added a communication task (T322868) to go along with this rollout and the (eventual) resolution of the associated epic.

The formula I ended up with is

$random = [random value between 0 and 1];
if ( [article lengh in bytes] > $minimumLength ) {
    $linkRatio = [links in article] / [words in article];
    $rawUnderlinkednessScore = 1 - $linkRatio;
    $underlinkednessScore = ^4;
} else {
    $underlinkednessScore = 0;
}
$finalScore = $weight * $underlinkedScore + (1 - $weight) * $random;

where $minimumLength and $weight come from task type configuration (underlinkedMinLength and underlinkedWeight, respectively). The default minimum length is 300 bytes; the weight has no default and doubles as a feature flag, the new algorithm will only be used if it's set.

Using links/words rather than links/bytes like in earlier explorations seemed nicer as links are usually one word, so conceptually the raw underlinkedness score is a good approximation for the chance of a randomly selected word in the article not being a link. The ^4 factor is very arbitrary; the raw underlinkedness score is always very close to 1, which would result in the random factor dominating. The power function seemed like the most straightforward method for a monotonic [0, 1] -> [0, 1] mapping that spreads out the area near 1; I chose 4 by importing a few random enwiki articles and seeing what value gives a wide range of results. This might or might not work well in reality.

If you want to examine the results, the simplest approach is to open the homepage with a ?debug=1 flag, and open get the search URLs that are logged to the JS console. These will show the details of the scoring (to the extent CirrusSearch exposes them). If you want to test manually with Special:Search, use the cirrusRescoreProfile=growth_underlinked flag (and probably the hasrecommendation:link search query, although it would work with anything else as well). Use cirrusDumpResult=1&cirrusExplain=pretty to make CirrusSearch explain the scoring. (Note that you will get a different result set every time you make a request. I considered using a pseudorandom seed, but the built-in random sort in CirrusSearch is not compatible with using a rescore profile, and I didn't see any great way to pass a URL paramater, or data about the current user, the rescore function builder.)

My suggestion: Let's rollout to pilot wikis first, have Ambassadors communicate and see if we get any feedback. Then rollout to all wikis after about a month if we don't hear of any issues.

Although @Trizek-WMF is welcome to suggest a different rollout strategy. I've added a communication task (T322868) to go along with this rollout and the (eventual) resolution of the associated epic.

Sounds good, I made T322910: Add a link: Deploy prioritization of underlinked articles feature as the deployment task.

Change 801010 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add rescore method for sorting by underlinkedness

https://gerrit.wikimedia.org/r/801010

My suggestion: Let's rollout to pilot wikis first, have Ambassadors communicate and see if we get any feedback. Then rollout to all wikis after about a month if we don't hear of any issues.

Works for me.

Change 874968 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Fix underlinkedness rescore logic

https://gerrit.wikimedia.org/r/874968

Change 874969 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add test for UnderlinkedFunctionScoreBuilder

https://gerrit.wikimedia.org/r/874969

Change 874968 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Fix underlinkedness rescore logic

https://gerrit.wikimedia.org/r/874968

Change 875371 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.40.0-wmf.14] Fix underlinkedness rescore logic

https://gerrit.wikimedia.org/r/875371

Change 875372 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.40.0-wmf.17] Fix underlinkedness rescore logic

https://gerrit.wikimedia.org/r/875372

Change 875371 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.40.0-wmf.14] Fix underlinkedness rescore logic

https://gerrit.wikimedia.org/r/875371

Change 875372 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.40.0-wmf.17] Fix underlinkedness rescore logic

https://gerrit.wikimedia.org/r/875372

Mentioned in SAL (#wikimedia-operations) [2023-01-04T22:11:58Z] <kindrobot@deploy1002> Started scap: Backport for [[gerrit:875371|Fix underlinkedness rescore logic (T301096)]], [[gerrit:875372|Fix underlinkedness rescore logic (T301096)]]

Mentioned in SAL (#wikimedia-operations) [2023-01-04T22:13:48Z] <kindrobot@deploy1002> kindrobot and tgr: Backport for [[gerrit:875371|Fix underlinkedness rescore logic (T301096)]], [[gerrit:875372|Fix underlinkedness rescore logic (T301096)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-01-04T22:27:18Z] <kindrobot@deploy1002> Finished scap: Backport for [[gerrit:875371|Fix underlinkedness rescore logic (T301096)]], [[gerrit:875372|Fix underlinkedness rescore logic (T301096)]] (duration: 15m 20s)

Change 874969 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add test for UnderlinkedFunctionScoreBuilder

https://gerrit.wikimedia.org/r/874969

I'm writing the newsletter, and I look for which tense I should use regarding this task. Should I use past or future (or present, as it is ongoing)?

I'm writing the newsletter, and I look for which tense I should use regarding this task. Should I use past or future (or present, as it is ongoing)?

It's enabled on pilot wikis at the moment, per T301096#8399811. But is it time to roll out to all wikis where link recommendation is deployed? cc @KStoller-WMF

My current plan:

After this change has been released for a month (AKA in early February), I will:

  • Gather qualitative feedback: discuss with Ambassadors / Growth pilot wikis
  • Gather quantitative data: compare "add a link" edit counts and revert rates pre and post change.

That being said, I don't think we'll have enough revert data to make that data significant, so presumably we will have to rely more on Ambassador / community feedback and release as long as there isn't community concern.

@Trizek-WMF to help set expectations, should we set a release date for late February (maybe Thursday, February 23rd?) and we'll plan to stick to that unless we discover something unexpected in the feedback or data?

Kirsten added this topic to the next Ambassadors' meeting. They will ask patrollers at their wikis for one week, and then we will decide on what to do next.

Community Ambassadors completed an initial evaluation that confirmed that prioritizing underlinked articles resulted in better article suggestions. In the 181 articles evaluated, the suggestions that resulted from the new prioritization formula were considered better ~69% of the time.

By evaluating edit and revert data on Growth pilot wikis before and after the release of T301096 (and using other wikis with "add a link" as a baseline), it appears that the new prioritization formula may result in more newcomers completing the "add a link" task and a lower revert rate. (However, this was not a true A/B test, and revert rates are low enough that it is difficult draw a conclusion with confidence).

Given these results, and the fact that we haven't received negative feedback about the change from Growth pilot wikis, I suggest we move forward with releasing this new prioritization formula to all wikis that have the "add a link" task. I've opened a new task to cover that work: T330535: Add a link: prioritize suggestions of underlinked articles: scale to all wikis

  NODES
Done 6
eth 11
games 1
News 2
orte 4
see 22
Users 3