Page MenuHomePhabricator

Ensure that AbuseFilter applies to the content of all MCR slots
Closed, ResolvedPublic

Description

AbuseFilter should be able to filter by the content of all slots.

At present, filter variables that are backed by the ParserOutput will match all slots, since the ParserOutput objects exposed to AbuseFilter is the combiend output of all slots. In contrast, filter variables backed by the raw content (wikitext) apply only to the main slot.

As a baseline, AbuseFilter could simply concatenate the raw content as returned by Content::getTextForSearchIndex before initializing the variables. Ideally, AbuseFilter would allow filters to specify which slot the should apply to.

Event Timeline

Jdforrester-WMF created this task.
Jdforrester-WMF renamed this task from AbuseFilter applies to the content of all slots to Ensure that AbuseFilter applies to the content of all MCR slots.Nov 5 2018, 7:43 PM

@Jdforrester-WMF I was just going to ask what's to be done here, since I don't know much about MCR. This also means that I probably won't be able to help much with this. Anyway, AF uses the EditFilterMergedContent for edits. If there are no changes to the logic related to this hook, AF will probably work as usual.

daniel renamed this task from Ensure that AbuseFilter applies to the content of all MCR slots to AbuseFilter applies to the content of all slots.Nov 5 2018, 7:50 PM
daniel removed tstarling as the assignee of this task.
daniel updated the task description. (Show Details)
daniel added a subscriber: tstarling.
Daimona renamed this task from AbuseFilter applies to the content of all slots to Ensure that AbuseFilter applies to the content of all MCR slots.Nov 5 2018, 7:51 PM
Daimona assigned this task to tstarling.

Sorry, didn't mean to retain the assignment to Tim.

One way to do this is to create a bunch of new variables for every slot - content, diff, links etc. The other way is to have a per-filter configuration setting so that a filter can apply to a specific slot. The first is more flexible but also makes filters more complicated / harder to read. (And then neither makes it possible to have filters that apply some criteria to every slot individually, unless you also add some way of iterating slots.) Probably a good idea to consult filter maintainers before making that decision (and to split the baseline and the ideal versions of this task as the baseline is pretty straightforward and the full version is not).

@Jdforrester-WMF I was just going to ask what's to be done here, since I don't know much about MCR. This also means that I probably won't be able to help much with this. Anyway, AF uses the EditFilterMergedContent for edits. If there are no changes to the logic related to this hook, AF will probably work as usual.

Well, T192307: Provide MCR-aware hook points exists, and in reading the code in EditPage's runPostMergeFilters it seems to me unlikely how the Content object would ever have anything other than just the main slot in it, so… I doubt EditFilterMergedContent is currently providing a path for this to work right now, no.

There's also T106136: Wikidata edits are not tagged by AbuseFilter on item creations which may affect the UploadWizard use case, but I think we escape that for now as we make a series of API edits on top of the extant File item immediately after it's been uploaded.

Uh, as filter maintainer I don't really fancy adding tons of new variables. However, adding logic to apply filters only to some slots could be IMHO even worse. If we decide to add new variables, I guess we should make the existing ones refer to all slots, to keep B/C. But again, I'm new to MCR, so all I can do is probably express my thoughts as filter maintainer.

Ramsey-WMF raised the priority of this task from Medium to High.Nov 8 2018, 5:48 PM
Ramsey-WMF subscribed.

Changing priority to reflect the need for this before we launch file captions on SDC.

After taking a bit of a closer look at it, it looks like the "old_wikitext", "new_wikitext", "old_content_model", and "new_content_model" would need to be replaced with something that allows fetching the values for each slot individually.

Then any calculated variables that are currently calculated based on these strings (e.g. everything in AbuseFilter::getEditVars()) should likely start with RevisionRecords or RenderedRevisions instead, and on a case-by-case basis decide whether you'll have full-page versions and/or per-slot versions of the variable.

How exactly that should be reflected in AbuseFilter's filter language I have no idea, nor how AbuseFilter's caching of old hits using the old variables might need to be considered.

I think a good approach could be to:

  • Keep current variables as they are, i.e. include all slots (concatenated)
  • Add new variables, for instance new_wikitext_slot, added_lines_slot, ..., every of them being an array. To retrieve its content for the nth slot, you just have to access e.g. new_wikitext_slot[1] and proceed as usual

I don't know if this will be feasible and painless for filter maintainers, but I guess it'd be a relatively simple but effective approach. However, please note that:

  1. I haven't investigated what code changes would be necessary
  2. Since all these arrays will have variable length, we'll need to change the syntax checker, which right now fails with non-fixed-length arrays (part of T204654)
  3. In any case, also given point 2, this will not be easy.

Maybe move language to old_slot_content and new_slot_content given that many slots aren't wikitext whilst we're at it?

@Jdforrester-WMF That would be easy, but I'm not sure it's worth it. It'd be similar to T173889, and as for that task we would end up keeping old variables anyway. Unless we want to edit hundreds of filters.

@Jdforrester-WMF That would be easy, but I'm not sure it's worth it. It'd be similar to T173889, and as for that task we would end up keeping old variables anyway. Unless we want to edit hundreds of filters.

Almost every filter is going to need to be upgraded to work with the new world anyway; this way makes it much easier for filter editors to see immediately whether the filter has been updated yet. The current variables would stay the same.

Arrays are an abomination in AbuseFilter, I don't think it's a good idea to expose them more in their current form. Also for sane handling of arbitrary slots (not an SDC blocker but needed eventually) you'd need associative arrays.
Although if separate variables are used instead then you'd need variable variables, that's not much nicer either. (Or a function returning slots by name; that has sane syntax but doesn't really fit the current concept of functions.)

@Jdforrester-WMF Huh? Weren't we planning to make the change so that nothing will need to be updated? It wouldn't be easy to update all those filters...
@Tgr Well, that's an important point. I think we should instead improve arrays (and yeah, they pretty suck right now) somehow, maybe introducing associative arrays and changing my previous idea to use them.

Replacing strings-pretending-to-be-arrays with actual arrays would almost certainly be a breaking change. How should AF handle breaking changes? (Versioning? Something akin to deprecation warnings? Just hope that filter maintainers will deal with it?)

I agree it's a worthwhile goal, it is not a small task though.

@Tgr If correctly supervisioned, it won't be a breaking change. Arrays are currently handled as strings in several function, that's because of poor initial implementation. Just think of how common is for instance added_lines contains/like/rlike ...: in these cases we just cast the array to string, because added_lines was turned into array when lots of filters already used a syntax like that, and people decided not to break everything. A step forward is to add array-specific functions, see https://gerrit.wikimedia.org/r/#/c/424298/. This way we add new ways to play with arrays (mostly treating them as such), without changing anything in the current syntax.
Adding associative arrays would be the same: for sure a big (and possibly useful) change, but it wouldn't interfere with existing filters.
The two points above (specific functions and associative arrays) would IMHO make arrays acceptable, and we could use such arrays in this patch after having addressed T204654.

I suggest to use variable suffixes instead of arrays. old_content_model_main, old_content_model_mediainfo, etc. Not a great example, but I suppose you see what I mean.

I suggest to use variable suffixes instead of arrays. old_content_model_main, old_content_model_mediainfo, etc. Not a great example, but I suppose you see what I mean.

But that would mean every wiki needs to manually update each of their filters each time a new slot rolls out, even if it's the same content type as before – e.g. if we add a documentation wikitext slot to templates. Letting filter authors match against content_model generically would be more powerful and involve less work for them…

But that would mean every wiki needs to manually update each of their filters each time a new slot rolls out, even if it's the same content type as before – e.g. if we add a documentation wikitext slot to templates. Letting filter authors match against content_model generically would be more powerful and involve less work for them…

I suppose it depends on the filter. You are probably right that most filters should apply across slots, and should rather depend on the content model.

Perhaps the best approach would then to simply run all filters against each slot. The variables would then refer to the content of that slot. There would be no need to concatenate any text or otherwise combine output. We could in the future allow users to specify which slots and/or which models each filter should apply to, but for now, we could just run them all against each of the slots slots.

We could in the future allow users to specify which slots and/or which models each filter should apply to

That's pretty trivial by exposing the slot name as a variable.

But the per-slot approach prevents filters that concerns themselves with the relationship between multiple slots. Prevent adding $thing to the wikitext if it is already in the structured data slot, flag edits that update templatedata but do not change the documentation slot, that kind of thing.

I think per-slot filters aren't the right way. As for using suffixes versus arrays, the idea is the same, and both ways are good. The advantage of arrays could be a less clogged variables list (and you get the idea that you're using a part of the whole). Another thing to take into account: each part of variable should be lazy-loaded independently of others, e.g. if the whole page has new_wikitext for slot A and B, only load the required ones, like if they were separate variables. Effectively making them separate variables would help with this.

I doubt there's much benefit in lazy-loading, these filters typically run on page save when all the content slots are in memory anyway.

@Tgr that is true for old_wikitext_slot and new_wikitext_slot (or whatever we decide to call them), but not for derived variables like added_lines_slot, edit_delta_slot etc. I also think that for users it would be quite confusing to see lazy-loaded array elements of the same array (in /details and /examine), so we should take this into account.

I would as a first version just lump everything together so at least the data is available to the abusefilter editors. The more advanced version should make the distinction between wikitext, wikibase text (in abusefilter it's no json anymore), etc.

Jdforrester-WMF lowered the priority of this task from High to Medium.Dec 4 2018, 8:01 PM
Jdforrester-WMF raised the priority of this task from Medium to Needs Triage.
Jdforrester-WMF triaged this task as High priority.
Jdforrester-WMF moved this task from To Do to Blocked on the Structured Data Engineering board.
  NODES
Idea 8
idea 8
Note 2
Project 11
USERS 3
Verify 2