One of the options for T213505: RfC: OpenGraph descriptions in wiki pages is to port the existing description parsing logic in the Page Content Service summary endpoint to PHP and do it in MediaWiki during parsing or page rendering; this task is about estimating the difficulty of that.
Description
Event Timeline
The process of creating the summary looks like this:
- get Parsoid HTML
- split via <section>, get lead section (parsoidSections.createDocumentFromLeadSection)
- remove dewiki IPA markup (stripGermanIPA.js)
- remove elements matching certain CSS selectors (rmElements.js, conf)
- remove some useless spans, e.g. replace <span>[</span>1<span>]</span> with [1] (rmBracketSpans.js)
- remove comment nodes (rmComments.js)
- remove some common HTML attributes (rmAttributes.js, conf)
- remove the mw* ids generated by Parsoid (rmMwIdAttributes.js)
- fetch the contents of the first non-empty <p> block + any trailing content until the next <p> (extractLeadIntroduction.js)
- turn links into spans, drop spans which do not do any styling (summarize.js#37-38)
- remove elements matching certain CSS selectors (summarize.js#39-44)
- remove Parsoid-specific HTML attributes (summarize.js#45)
- if the lead does not seem to contain math formulas, remove parentheses that are inside parentheses or have space inside and space/nbsp right before them, or are / have become empty (summarize.js#46-67 and 17-27)
- collapse whitespace (summarize.js#69-73, summarize.js#78-79)
- call sanitize-html, which will discard non-text tags like <script>, and then remove a bunch of non-whitelisted tags while keeping their contents, and remove a bunch of non-whitelisted attributes/styes from allowed tags (sanitizeSummary.js)
- remove space/nbsp before punctuation in some cases (summarize.js#81-88)
- do a standard DOM HTML-to-text transformation on the results (summarize.js#93)
This is pretty complex, but most of it is done for getting a safe and readable HTML extract. og:description is text-only (and probably whitespace-insensitive) so only 1-4, 9, 11, 13, 17, and a small part of 15 applies. Almost all of that is simple, more configuration than code (remove elements matching certain CSS selectors, drill down to some part of the document that could be expressed with an xpath, do the DOM spec's HTML-to-text), with the exception of steps 3 and 13, but 3 becomes very simple when you don't care about HTML (just replace IPA blocks with some special unicode character, then after the rest of the transforms are over drop the replacement + preceding/following brackets), so 13 is the only complicated logic that would have to be duplicated, and that's just a series of regular expressions. So altogether a few dozen lines of logic and another few dozen lines of configuration, assuming there's support for manipulating a HTML document via CSS selectors and standard DOM commands. The code would have to be maintained in two places (since the generation of HTML descriptions would remain in PCS, and some of these steps are used by other endpoints as well) but even so it does not seem like a huge burden.
A precondition for doing this is that the code needs to be fast enough to run during a parse or render (if it has to be done asynchronously, we can just use PCS as it is now). Current PCS latency is 200ms p50, 250ms p75, 1s p95, 50s p99. That's obviously unworkable but given that we'd omit most of the processing that's currently done, processing time might become significantly shorter. Also, much of the remaining processing could probably be optimized (e.g. the element removal does a separate DOM pass for every element).
Another question is whether this needs to use Parsoid HTML (in which case it would be blocked on the porting of Parsoid to MediaWiki). At a glance, not so much (although there a few Parsoid-specific selectors like [data-mw*="_target":{"wt":"IPA"] but that can be worked around by adding better markup in the templates) but there might be differences between the HTML structure generated by MediaWiki and Parsoid that would result in breaking the current logic; that's hard to tell without doing and testing it.
PCS has a good amount of unit tests for this. Might be worth porting those as well. There is a also a growing list of page titles that are good candidates for testing summaries.
Deprioritizing, I think it makes sense to run the RfC first and see what options are acceptable in the first place.
Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)
Note also that doing proper language converter redirects (T240068 is one example) and handling flagged revisions (T209936) become a lot easier if this code is moved to core (in php). It will also avoid a round trip, since both the parsoid and legacy html are cached in core.
(There's another LanguageConverter related bug somewhere, which i can't find right now, where a request to the summary endpoint for title [[A]] from mobile doesn't return any results because the title actually exists in the DB as [[B]], where B is the language-converted form of A.)
Recent discussions with other teams interested in the summary endpoint have progressed towards extracting the logic to a JS library instead, so it can be used by other teams to create their own interfaces with the summary extraction logic.
I'll add this to the icebox to revisit in the future since we might want to close this as invalid depending on the results of the aforementioned effort.
@Jgiannelos and I have evaluated the difficulty of porting PCS summary logic to PHP and included our findings in this doc.