Maniphest T205254

Investigate usage of "text" in AbuseFilter rules on wikidata.org
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Addshore
	Sep 24 2018, 9:06 AM

Description

We want to reduce the amount of "text" provided to AbuseFilter from Wikibase entities in T205252.
Before we can do that we need to see what rules are in place and which bits of the text are actually used.

This should cover:

Rules in entity namespaces (Item, Property, Lexeme)
AbuseFilter text-related variables (from https://www.mediawiki.org/wiki/Extension:AbuseFilter/Rules_format):
- old_wikitext, new_wikitext
- edit_diff, edit_diff_pst
- added_lines, added_lines_pst, removed_lines, removed_lines_pst
- new_pst, new_html,new_text

For example

Statement GUIDs are provided as one of the lines in the "text". So are strings such as the following used by abuse filter rules? "Q56596767$199BCB00-D1ED-40A5-B001-439BC5F434F7"
The rank of statement is also included as a line such as "normal". Is this used in abuse filter rules?
etc.

Related Objects
Search...

Status	Assigned	Task
Resolved	Addshore	T204109 Investigate some Wikidata write api queries taking over 30 seconds. (In EntityContent::collectValues?)
Resolved	Addshore	T205252 Reduce text returned by EntityContent::getTextForFilters
Resolved	Addshore	T205254 Investigate usage of "text" in AbuseFilter rules on wikidata.org

Event Timeline

Addshore triaged this task as Medium priority.Sep 24 2018, 9:06 AM

Addshore created this task.

Addshore moved this task from Inbox to Investigate & Discuss on the [DEPRECATED] wdwb-tech board.

Lydia_Pintscher added a project: Wikidata.org.Sep 25 2018, 3:47 PM

Addshore moved this task from incoming to needs discussion or investigation on the Wikidata board.Sep 26 2018, 6:38 AM

Addshore moved this task from Investigate & Discuss to LEGACY Freezer 🥶 on the [DEPRECATED] wdwb-tech board.Oct 8 2018, 12:49 PM

Addshore moved this task from LEGACY Freezer 🥶 to Legacy Inbox on the [DEPRECATED] wdwb-tech board.Oct 8 2018, 12:51 PM

Addshore mentioned this in T209687: WikibaseMediaInfo should define keys to ignore when passing test to AbuseFilter.Nov 16 2018, 9:34 AM

Addshore raised the priority of this task from Medium to High.Nov 16 2018, 9:37 AM

Addshore mentioned this in T205252: Reduce text returned by EntityContent::getTextForFilters.

Just a comment: "_text" variables are for page title, and so are "_prefixedtext" variables. So, if you're interested in covering such variables, then you also have to include "_title" and "_prefixedtitle" per T173889.

Addshore mentioned this in T204109: Investigate some Wikidata write api queries taking over 30 seconds. (In EntityContent::collectValues?).Dec 14 2018, 12:11 PM

In T205254#4753194, @Daimona wrote:

Just a comment: "_text" variables are for page title, and so are "_prefixedtext" variables. So, if you're interested in covering such variables, then you also have to include "_title" and "_prefixedtitle" per T173889.

So, this ticket only cares about the "wikitext" in all of its forms, not the title, we should update the description!

Daimona updated the task description. (Show Details)Feb 5 2019, 11:52 AM

Daimona updated the task description. (Show Details)Feb 5 2019, 11:56 AM

Description updated! Searching for all of the variables yields 76 matches. Checking by hand is feasible, but not optimal. Is there a list of what data we're looking for (e.g. GUIDs and rank, mentioned in task desc)? I'd like to see if I can extract a regex from there.

Lucas_Werkmeister_WMDE mentioned this in T215422: Migrate Wikibase to use comment_data field instead of SummaryFormatter.Feb 6 2019, 3:14 PM

Rules in entity namespaces (Item, Property, Lexeme)

Nothing for lexemes yet.

Statement GUIDs are provided as one of the lines in the "text". So are strings such as the following used by abuse filter rules? "Q56596767$199BCB00-D1ED-40A5-B001-439BC5F434F7"

This has been making abuse filter matching harder.

The rank of statement is also included as a line such as "normal". Is this used in abuse filter rules?

Sometimes.

Just a random comment: data actually used by existing abuse filters like the rank can be moved from added_lines to new AF variables defined via hooks.

In T205254#5034129, @matej_suchanek wrote:

Statement GUIDs are provided as one of the lines in the "text". So are strings such as the following used by abuse filter rules? "Q56596767$199BCB00-D1ED-40A5-B001-439BC5F434F7"

This has been making abuse filter matching harder.

Yup, the format just being a collection of lines is a pretty insane thing to have to try to match.

In T205254#5034297, @Daimona wrote:

Just a random comment: data actually used by existing abuse filters like the rank can be moved from added_lines to new AF variables defined via hooks.

Indeed, to know what to move to different vars we would need some sort of overview of all of the elements used.

Is statement GUID used?
Is language ever user?
Are the reference etc hashes ever used?
Are various elements of some data types ever used? (datetimes have lots of 0,s? for example for before after etc)

Is statement GUID used?

Probably not.

Is language ever user?

[[ https://www.wikidata.org/wiki/Special:AbuseFilter/33 | It is usually matched against in summary. ]]

Are the reference etc hashes ever used?

Probably not.

Are various elements of some data types ever used? (datetimes have lots of 0,s? for example for before after etc)

No. There are two filters which deal with complex datatypes (#55 and #93) but they don't need it. Which doesn't mean we didn't want to create filters to check for invalid data...

So we could get rid of:

statement guids
all hashes
language keys (actually already done for items)
some keys from complex values:
- timevalues (before, after)
- possibly some others.

We can either use the current approach which is to define keys which should be ignored at all levels of the JSON, or create a slightly more complex layered method of filtering.

The current list is:

		return [
			'language',
			'site',
			'type',
		];

and could be something like:

		return [
			'language',
			'site',
			'type',
			'hash',
			'id',
			'before',
			'after',
		];

but this would have some slightly unexpected consequences, as 'id' is pretty generic for the statement guids, and we would also now be excluding the _target ID for statements when they are added etc.

This will need some slight refactoring in EntityContent and related classes, we need a customizable way (that is efficient) per entity type.

Going to close this investigation ticket now, as we have made some head way, know which step we will take next to chip some save timing off.
Will leave the parent ticket open for this to be worked on in.

Restricted Application added a project: User-Addshore. · View Herald TranscriptJun 20 2019, 9:53 PM

Addshore mentioned this in T226216: Statement GUIDs should not appear in AbuseFilter text for Wikibase.Jun 20 2019, 10:06 PM

Investigate usage of "text" in AbuseFilter rules on wikidata.orgClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate usage of "text" in AbuseFilter rules on wikidata.org
Closed, ResolvedPublic
Actions

Related Objects
Search...