Page MenuHomePhabricator

Bypass query length limit for incategory search
Closed, ResolvedPublic

Description

Creating a separate task for this and split it form T107947 since it involves quite some discussion by now.
First patch by @EBernhardson has been merged: https://gerrit.wikimedia.org/r/#/c/236195/

But according to @jkroll it seems it doesn't cover all cases that are relevant for DeepCat-Gadget:

I think the current patch doesn't work for all cases yet. The gadget replaces certain search terms by their "expanded" variations. Users can combine several subcategory searches with other search terms and with each other.
For instance, this search string 'Music deepcat:Cello deepcat:Person' would be transformed by the gadget to something like this:
Music incategory:id:123|incategory:id:456[...] incategory:id:432|incategory:id:765[...]
So that the search result would be the intersection of categories 'Cello' and 'Person', including subcategories, and pages which contain the word 'Music'.
If I understand the patch correctly, the current regexp wouldn't allow such a search string to exceed the length limit. Can that be fixed?

Event Timeline

Is the bug you're referring to not one where our Product Manager
states that this is not currently /meant/ to work?

@Ironholds I am referring to https://gerrit.wikimedia.org/r/#/c/236195/. And, technically, it's not a bug..

So that we can avoid the cat-and-mouse situation where we change the query length exception only to find out afterwards about another kind of query that the gadget can emit that is not covered by the exception, please provide an exhaustive list of all query signatures that your DeepCat gadget can query the API with. Thanks!

@Ironholds I am referring to https://gerrit.wikimedia.org/r/#/c/236195/. And, technically, it's not a bug..

So that we can avoid the cat-and-mouse situation where we change the query length exception only to find out afterwards about another kind of query that the gadget can emit that is not covered by the exception, please provide an exhaustive list of all query signatures that your DeepCat gadget can query the API with. Thanks!

For any search string given by a user, the gadget replaces all occurences of 'deepcat:CATEGORY' with the list of the page_ids of CATEGORY and its subcategories recursively, in the form

incategory:id:123|id:234|id:345 [...]

A query can contain one or more such category lists. Each list has one or more entries. All other search terms and operators are untouched and fed to Cirrus unmodified. The order of search terms is preserved as the user typed them.

Corrected example:
User types this search string:

Music +deepcat:Cello -deepcat:Person

Modified search string passed to Cirrus:

Music +incategory:id:123|id:5|id:234 -incategory:id:432|id:7|id:4321

It's probably easiest to just allow the maximum query length to be bypassed if there's an incategory operator anywhere in it, then. In the interests of resolving this once and for all, now would be the time to let me know if that doesn't work. :-)

Change 236570 had a related patch set uploaded (by Deskana):
Bypass max query length if query contains incategory operator.

https://gerrit.wikimedia.org/r/236570

Stas proposed a solution in two pass.

  1. limit the query with a rather large value (btw it's already limited by GET no?)
  2. extract all the special syntax (incategorie, intitle, insource...) as we do already
  3. check the resulting query size after extraction with the low limit (300)

This would allow users to run long queries with special syntax but a "normal query" would be limited to 300 chars.

Examples:

  • normal queries without special syntax will be limited to 300 chars.
  • random garbage queries will be limited to 300 chars.
  • a query with : incategory:id:1|id:2|...id:100 with some words would be hard limited to the GET size and maybe another hard limit in cirrus (3000?) but the with some words part will be limited to 300 chars.

Would that solution works?

Oops, sorry @Deskana I haven't seen that you uploaded a patch before my comment. I think your solution works also.

It's probably easiest to just allow the maximum query length to be bypassed if there's an incategory operator anywhere in it, then. In the interests of resolving this once and for all, now would be the time to let me know if that doesn't work. :-)

That should cover every case, I see no reason why it wouldn't work.

Change 236570 merged by jenkins-bot:
Bypass max query length if query contains incategory operator.

https://gerrit.wikimedia.org/r/236570

The above patch is merged and should resolve this issue. However, it won't take effect for around a week, as I believe it was merged after the branch was cut.

Thank you all for resolving this.

  NODES
Note 1
Project 5
USERS 2