What percentage of articles created by autoconfirmed users are deleted? What percentage by non-autoconfirmed?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kaldari
	May 24 2017, 9:53 PM

Description

Now that we have some numbers from T149021, let's dig into the data a little deeper. Can we rerun Tilman's query, but restrict it to deleted articles? So we would have:

day	creations_that_have_been_deleted	non_autoconfirmed_creations_that_have_been_deleted

Related Objects

Mentioned In: T168497: Data Lake queries abort with HDFS write fail
T149021: Determine: What percentage of new articles are created by non-autoconfirmed editors
Mentioned Here: T168497: Data Lake queries abort with HDFS write fail
T161146: Provide historical redirect flag in Data Lake edit data
T167396: Add 'page_is_redirect' field to the mediawiki_history Data Lake tables
T167471: Calculate enwiki new pages backlog size by page creation day and autoconfirmed status at creation time
T149021: Determine: What percentage of new articles are created by non-autoconfirmed editors

Event Timeline

kaldari created this task.May 24 2017, 9:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 24 2017, 9:53 PM

kaldari merged a task: T166785: NPP: Stats on page deletion of non-autoconfirmed pages.Jun 1 2017, 5:10 PM

kaldari added subscribers: • DannyH, MusikAnimal.

@Tbayer: Any update on this? It's one of the follow-up questions that we've gotten after posting the report.

@kaldari: I have not yet been working on this so far (focusing on other parts of the new pages investigation instead). It might be fairly easy to do using the revision_is_deleted field; I'll aim to take a closer look tomorrow and find out.

TonyBallioni subscribed.Jun 1 2017, 8:33 PM

@Tbayer: TonyBallioni has specifically requested that we look at it with a threshold of 90 days since article creation.

@kaldari the request was mainly put that way for a manageable timeframe after the creation of an article and past when the new user backlog tends to fall off.

Also, I'm not a technical person, but my concern with doing the "revision is deleted" field is that it might also include significant contributions beyond just page creations that have been hidden via revision deletion for one reason or another (mainly copyright violations). If these were included, they might unintentionally skew the data that is reported.

Thanks to you all for working on this. It is very much appreciated.

• Tbayer mentioned this in T149021: Determine: What percentage of new articles are created by non-autoconfirmed editors.Jun 4 2017, 9:16 AM

• DannyH added a project: Community-Tech-Sprint.Jun 5 2017, 10:41 PM

Having a threshold ("deleted after 90 days") is a good idea. It let's us have numbers that won't change later and it let's us compare apples to apples (eliminating the bias of new articles being more likely to have survived).

I'm afraid that our position in the discussion is undercut significantly if we can't come up with this stat and T149021 after ten days. The conversation is basically moving on without us.

Can we cut scope on the requests to get at least a preliminary idea? Maybe we don't have to count redirects, or something?

I know these are hard numbers to crunch, and it's not officially anyone's job, but we run the risk of looking like we don't know what we're talking about.

@Tbayer - Any update on this one?

In T166269#3336450, @DannyH wrote:

I'm afraid that our position in the discussion is undercut significantly if we can't come up with this stat and T149021 after ten days.

Well, I already put some work into this last week and on Wednesday/Thursday, but I'm afraid I have been having many competing priorities. Sorry that this is taking a while, but the fact that T149021 has been open since October despite many different people working on it should illustrate that this it's not entirely trivial to get this right.

The conversation is basically moving on without us.

It wasn't clear to me that this piece of data is central to our continued participation in the discussion. I was under the impression that the only reason we kept working on it (after choosing to use other data in the report instead that is more pertinent to the central question about patrollers' burdens) is that the community requested it. To be clear, that's of course a good reason to work on it (and as discussed with @kaldari the other day, I have actually set aside more time to work on it today). But I have put most of the time I had available in this area into completing T167471, also because it was a more direct followup to the report we published.

Can we cut scope on the requests to get at least a preliminary idea? Maybe we don't have to count redirects, or something?

After the work earlier this week, I'm actually comfortable for now with using the redirect detection as improved there (even though it's of course still an approximation, and ideally we would prefer T167396 to be resolved to get 100% exact results).
Considering that the community is - rightfully - attentive to potential sources of inaccuracies and that we got considerable pushback even for publishing perfectly valid stats (at least before explaining things in detail on the report talk page), I'm not comfortable with posting preliminary numbers in this discussion right now.

I know these are hard numbers to crunch, and it's not officially anyone's job, but we run the risk of looking like we don't know what we're talking about.

Where have we been talking about this topic (deletion rates) specifically? Perhaps we can already use other data or previous research to back up these statements, while wrapping up this task here (I'm happy to take a look). Also, again, it has been my impression that we are much more at risk to look like we don't know what we're talking about if we present "preliminary ideas" that are misunderstood as hard data.

In T166269#3309280, @TonyBallioni wrote:

@kaldari the request was mainly put that way for a manageable timeframe after the creation of an article and past when the new user backlog tends to fall off.

Also, I'm not a technical person, but my concern with doing the "revision is deleted" field is that it might also include significant contributions beyond just page creations that have been hidden via revision deletion for one reason or another (mainly copyright violations). If these were included, they might unintentionally skew the data that is reported.

Yes, good point, that's one of the several corner cases we need to check to make sure that they doesn't impact the result too much. (I mentioned a few others on May 25, and, as mentioned above, looked at the issue of redirect creation edit summaries earlier this week.) We are not looking for a perfectly accurate result, but need to make sure we know these sources of error and assess their potential impact.

Regarding this particular one, my initial plan is simply to look at a sample of entries from the revision deletion log and count (either manually or via a query) how many of these affect the initial revisions that we are relying on here. If that leads to the conclusion that the planned method won't be affected too much, we'll leave it as is. Let me know in case you have further thoughts about this.

I wrote a script to get some sample data and have posted the results at https://en.wikipedia.org/wiki/User:MusikAnimal_(WMF)/NPP_analysis

This is based off of the 9,201 articles created in the first week of November 2016 that were not redirects at the time of creation. I chose this date range because it is outside the current NPP backlog – meaning the pages have either been marked as reviewed or since deleted. This took a while to run (~2 hours), but it made me more confident in the numbers because through the log output I was able to do some spot checking. I did not compare the time of the deletion with the time of page creation, so I don't have an answer for the "percentage that were deleted within 90 days of creation". This is something I can do though, and will try to get to that next.

In this report, new articles are defined as any mainspace page where the first revision was within the date range, going off of revision and archive and finding those where the parent ID is 0. This means any pages that were moved to the mainspace should be accounted for, assuming the initial revision was within the date range (first week of November). However older redirects that became articles are not accounted for.

In summary:

Autoconfirmed users:
- Created 6,726 (73%) of the 9,201 total
- 890 (13%) of those they created were deleted and 784 (12%) were turned into a redirect
Non-autoconfirmed users:
- Created 2,475 (27%) of the 9,201 total
- 1,883 (76%) of those they created were deleted and 91 (4%) turned into a redirect

In determining if a user was autoconfirmed at the time of page creation, the script first checks if they were manually granted "confirmed" before the page was created. Then it does the math of checking the registration date and edit count at the time the page was created.

For determining if the page was a redirect at the time of creation, the script looks for #REDIRECT in the source of the first revision, and if the page was deleted, it instead looks for "Redirected" in the edit summary.

In regards to T166269#3336571, I think such pages are included in this report, but I'm not sure what the issue is. I don't think I ran into a case where the username was revision-deleted (or else it should have errored out), so I was able to tell if they were autoconfirmed. Revdel'd edits still show up in revision and archive, so computing their edit count shouldn't have been affected.

I'm probably overlooking something but I'm mostly confident in these numbers, and I'd be interested to see what we compute by other means!

I'm just curious to know what has changed that makes it so difficult to come up with nice clean graphs of a kind - the exact kind we need now - that a volunteer user was able to do so quickly for us at my request almost exactly 6 years ago to the day.

(In my understanding this task seems to be about running queries to get percentage data, not about potential visualization obstacles?)

Anything that helps present the information in pictures leads to a more rapid understanding , especially when offered with some analog display drawn on a representative time window. The figures presented by MusikAnimal cover only November 2016, which do not permit an important comparison with the curve shown here covering the period that gave rise to the current discussions, and the older stats here, which gave rise to ACTRIAL.

Vexations subscribed.Jun 14 2017, 11:31 AM

In T166269#3346965, @MusikAnimal wrote:

I wrote a script to get some sample data and have posted the results at https://en.wikipedia.org/wiki/User:MusikAnimal_(WMF)/NPP_analysis

That script solution looks great overall. I like that it avoids two limitations of the Data Lake approach (that we have been pursuing all the time in T149021 and subsequently here), by using the revision and logging tables directly, e.g. to determine redirect status exactly by looking at the revision content.

This is based off of the 9,201 articles created in the first week of November 2016 that were not redirects at the time of creation. I chose this date range because it is outside the current NPP backlog – meaning the pages have either been marked as reviewed or since deleted. This took a while to run (~2 hours), but it made me more confident in the numbers because through the log output I was able to do some spot checking.

Yes, so performance is of course the downside of not using the Data Lake tables. FWIW, for the somewhat related query in T167471 I got a very similar time (2.5 hours for 10 days' worth of unreviewed articles in one run, and eventually 3.5 days for the entire backlog stretching over about six months).
...

In determining if a user was autoconfirmed at the time of page creation, the script first checks if they were manually granted "confirmed" before the page was created. Then it does the math of checking the registration date and edit count at the time the page was created.

This looks good, except that (unless I'm misreading the script) it seems to omit deleted edits when determining the user's edit count. These are part of the definition of autoconfirmed. I initially made the same mistake myself last week when working on T167471 until Aaron Halfaker pointed this out to me. It should be easy to correct in your code though, by doing a UNION with the archive table.

For determining if the page was a redirect at the time of creation, the script looks for #REDIRECT in the source of the first revision, and if the page was deleted, it instead looks for "Redirected" in the edit summary.

OK, a few things:

Does this rely on #REDIRECT being capitalized like that? (This doesn't need to be the case - example.)
You mention in the comments that the regex function crashes for some pages - perhaps it's possible to avoid that by including the condition that the #REDIRECT has to be in the first line, as MediaWiki requires anyway.
It's a bit unsatisfying that in the case of deleted pages, it falls back to the less accurate method of testing the edit summary. Any reason we can't check the content directly there too? Obviously, using different counting methods for deleted and survived pages may affect the calculated deletion ratios directly.
In any case, if you are indeed falling back to testing the edit summary, looking for "Redirected" certainly isn't enough - per T149021#3331707, that only catches a small minority of redirects. You are welcome to use the more accurate regex developed at T149021#3331707, but again, that's still just an approximation, and if we can check the content directly (and have already paid much of the performance price ;), we should do so.

In regards to T166269#3336571, I think such pages are included in this report, but I'm not sure what the issue is. I don't think I ran into a case where the username was revision-deleted (or else it should have errored out), so I was able to tell if they were autoconfirmed. Revdel'd edits still show up in revision and archive, so computing their edit count shouldn't have been affected.

The concern there, as I understand it, is not about retrieving user names, but about cases where the initial revisions was rev-deleted but the page (later revisions) is still live. Such pages would be mistakenly counted as deleted when relying on the revision_is_deleted field of the Data Lake table. (As mentioned, that may turn out not to matter much numerically, and also there is probably a slightly more complicated and slower way to do it - by using the page deletion events in the same table - that avoid the issue.)

I'm probably overlooking something but I'm mostly confident in these numbers, and I'd be interested to see what we compute by other means!

I haven't yet had time again this week to work further on the Data Lake solution; considering that we have now already put preliminary numbers in front of people, I may just post results of the "naive" Data Lake query here for now for comparison purposes, without yet working on the remaining checks and corrections.

In T166269#3347007, @Kudpung wrote:

I'm just curious to know what has changed that makes it so difficult to come up with nice clean graphs of a kind - the exact kind we need now - that a volunteer user was able to do so quickly for us at my request almost exactly 6 years ago to the day.

As @Aklapper already pointed out, visualizing the data is really not the problem here - rather, what's complicated is obtaining accurate data to visualize.

There are surely volunteers with excellent knowledge about this topic. Can you get that user to publish the SQL queries they mention on that page? ("The statistics were gathered via SQL queries to the toolserver database.")
As you can see from the discussion at T149021 and this task, we normally always try to share the underlying queries when discussing such analyses (in an open source way), and it helps a lot for spotting errors, learning from each other, etc.

In particular, I am curious how autoconfirmed status was determined in that analysis - as mentioned above, one needs to include deleted edits for that, and I wonder if the private tables containing them were available on the Toolserver.

And it would also be great learn how that 2011 analysis handled the issue of redirects that has caused many of the complications that were worked on here and in T149021. As my colleague Neil pointed out several months ago in T161146, "filtering out redirects ... can significantly alter the conclusions" about deletion rates (linking to a whole academic paper from 2014 that focused entirely on complications that arise from redirects, and also lists, in the "Impact of redirects" chapter, concrete examples of previous Wikipedia research that had gone off the rails for not considering these complications). I'm not saying such mistakes were made in the 2011 analysis, and again, there are of course many community members who are familiar with the database tables in question. However, this is an area where it is easy to come up with quick but wrong data, so double-checking is valuable.

Thanks MusikAnimal, for getting these statistics. While I think a larger sample would be great too, it does give us an idea of what the survival rate is after all the pages have gone through NPP, which is helpful.

There are surely volunteers with excellent knowledge about this topic. Can you get that user to publish the SQL queries they mention on that page?

The user, who incidentally wrote some of the most useful tools ever for NPP, retired 4 years ago. I suppose you could try emailing him. Seems to me however that it is not so much topic knowledge that is required here as knowing how to do some data mining.

In T166269#3347536, @Tbayer wrote:

In T166269#3346965, @MusikAnimal wrote:

In determining if a user was autoconfirmed at the time of page creation, the script first checks if they were manually granted "confirmed" before the page was created. Then it does the math of checking the registration date and edit count at the time the page was created.

This looks good, except that (unless I'm misreading the script) it seems to omit deleted edits when determining the user's edit count. These are part of the definition of autoconfirmed. I initially made the same mistake myself last week when working on T167471 until Aaron Halfaker pointed this out to me. It should be easy to correct in your code though, by doing a UNION with the archive table.

You are correct! Indeed an easy fix. I'll include deleted edits with the next run.

For determining if the page was a redirect at the time of creation, the script looks for #REDIRECT in the source of the first revision, and if the page was deleted, it instead looks for "Redirected" in the edit summary.

OK, a few things:

Does this rely on #REDIRECT being capitalized like that? (This doesn't need to be the case - example.)

It is case-insensitive

You mention in the comments that the regex function crashes for some pages - perhaps it's possible to avoid that by including the condition that the #REDIRECT has to be in the first line, as MediaWiki requires anyway.

This is for some edge cases where the wikitext itself is so large the library I'm using to parse it crashes. I tried to increase the limit, but instead I figured it was safe to assume such pages were not redirects.

It's a bit unsatisfying that in the case of deleted pages, it falls back to the less accurate method of testing the edit summary. Any reason we can't check the content directly there too? Obviously, using different counting methods for deleted and survived pages may affect the calculated deletion ratios directly.

I forgot that I could authenticate as MusikBot II which is an admin who can see deleted edits, so I will do that!

In regards to T166269#3336571, I think such pages are included in this report, but I'm not sure what the issue is. I don't think I ran into a case where the username was revision-deleted (or else it should have errored out), so I was able to tell if they were autoconfirmed. Revdel'd edits still show up in revision and archive, so computing their edit count shouldn't have been affected.

The concern there, as I understand it, is not about retrieving user names, but about cases where the initial revisions was rev-deleted but the page (later revisions) is still live. Such pages would be mistakenly counted as deleted when relying on the revision_is_deleted field of the Data Lake table. (As mentioned, that may turn out not to matter much numerically, and also there is probably a slightly more complicated and slower way to do it - by using the page deletion events in the same table - that avoid the issue.)

I don't think this should be an issue here as the records should still exist in revision and archive. I don't actually know what happens when an edit is suppressed (oversight) though, but as I said my script should have failed if this happened because it's doing computations on fields that would return nil. I can add some checks to see which fields are missing and keep track of how often this happens, just out of curiosity.

In T166269#3347007, @Kudpung wrote:

I'm just curious to know what has changed that makes it so difficult to come up with nice clean graphs of a kind - the exact kind we need now - that a volunteer user was able to do so quickly for us at my request almost exactly 6 years ago to the day.

I agree with @Tbayer that it's difficult to say how reliable the stats this volunteer offered are since they did not go into any detail. I question if they accurately determined autoconfirmed vs non-autoconfirmed at the time of page creation, along with handling other edge cases, especially if they did this "so quickly" upon request, but I guess we'll never know because they evidently chose not to publish their code.

In T166269#3347043, @Kudpung wrote:

Anything that helps present the information in pictures leads to a more rapid understanding , especially when offered with some analog display drawn on a representative time window. The figures presented by MusikAnimal cover only November 2016, which do not permit an important comparison with the curve shown here covering the period that gave rise to the current discussions, and the older stats here, which gave rise to ACTRIAL.

Indeed the requirements here didn't mention anything about visualizing the data, but I can keep track of per-day data my script produces so that we could build a chart. Regarding the February-March curve you speak of, the script will tell us how many articles were created by autoconfirmed users vs non-autoconfirmed, but the "survival rate" statistic won't be that telling because many of the articles in that date range have still not been reviewed. Note also I'm currently looking at all pages created in a date range, including those created by autopatrolled users, so this data does not necessarily explain why the backlog grew so much.

We were asked for the "percentage that were deleted within 90 days of creation", so the February-March date range will suffice. So what I'm going to do is run the script again on this date range using the modifications Tbayer suggested, and making note of how many articles were deleted within 90 days of creation. I can also make note of how many of these articles actually went into the backlog (users who were not autopatrolled), and finally record per-day statistics so we can build a chart. I think this covers all grounds. If I'm missing anything else please speak up because the script will take quite a while to run.

@MusikAnimal Thanks for working on this, and being so clear about the data sources and considerations. Can't wait to see the new run. :)

In T166269#3348737, @MusikAnimal wrote:

In T166269#3347536, @Tbayer wrote:

In T166269#3346965, @MusikAnimal wrote:

...

For determining if the page was a redirect at the time of creation, the script looks for #REDIRECT in the source of the first revision, and if the page was deleted, it instead looks for "Redirected" in the edit summary.

OK, a few things:

...

You mention in the comments that the regex function crashes for some pages - perhaps it's possible to avoid that by including the condition that the #REDIRECT has to be in the first line, as MediaWiki requires anyway.

This is for some edge cases where the wikitext itself is so large the library I'm using to parse it crashes. I tried to increase the limit, but instead I figured it was safe to assume such pages were not redirects.

To clarify, this was basically a suggestion to only run the regex over the first line instead of the whole page (as MediaWiki itself does in essence). This may also make things a bit faster.

It's a bit unsatisfying that in the case of deleted pages, it falls back to the less accurate method of testing the edit summary. Any reason we can't check the content directly there too? Obviously, using different counting methods for deleted and survived pages may affect the calculated deletion ratios directly.

I forgot that I could authenticate as MusikBot II which is an admin who can see deleted edits, so I will do that!

Cool - it looks like you were already querying the archive table anyway, being on a private server, so authenticating as admin should not be necessary.

In regards to T166269#3336571, I think such pages are included in this report, but I'm not sure what the issue is. I don't think I ran into a case where the username was revision-deleted (or else it should have errored out), so I was able to tell if they were autoconfirmed. Revdel'd edits still show up in revision and archive, so computing their edit count shouldn't have been affected.

The concern there, as I understand it, is not about retrieving user names, but about cases where the initial revisions was rev-deleted but the page (later revisions) is still live. Such pages would be mistakenly counted as deleted when relying on the revision_is_deleted field of the Data Lake table. (As mentioned, that may turn out not to matter much numerically, and also there is probably a slightly more complicated and slower way to do it - by using the page deletion events in the same table - that avoid the issue.)

I don't think this should be an issue here as the records should still exist in revision and archive. I don't actually know what happens when an edit is suppressed (oversight) though, but as I said my script should have failed if this happened because it's doing computations on fields that would return nil. I can add some checks to see which fields are missing and keep track of how often this happens, just out of curiosity.

Again, that was a discussion about the Data Lake approach and the use of revision_is_deleted there, not the script.

When I said 'quickly', I meant that Scottywong made those graphs in a matter of days rather than weeks or months. IIRC, he did it over a weekend using two computers and they ran all night. What we need to understand is that these requests are being made in relation to the state of the New Page Patrol process. Graphics do help, so MusikAnimal's stats would look nice (for the non-tech minded volunteers) if presented as a pie chart. That said however, as New Page Reviewing appears to reveal some most unusually changing patterns, the request to see some graphs over a June 2016 through June 2017 sample does not seem unreasonable.

To clarify, this was basically a suggestion to only run the regex over the first line instead of the whole page (as MediaWiki itself does in essence). This may also make things a bit faster.

This has definitely gone too far down the rabbit hole. I agree with Danny that we should just forget about accounting for redirects if it's going to make getting this data significantly difficult.

In T166269#3347785, @Kudpung wrote:

There are surely volunteers with excellent knowledge about this topic. Can you get that user to publish the SQL queries they mention on that page?

The user, who incidentally wrote some of the most useful tools ever for NPP, retired 4 years ago. I suppose you could try emailing him.

Actually it looks like he's only semi-retired, with several edits just last month. I left a comment on the talk page asking if he still has the queries.

Seems to me however that it is not so much topic knowledge that is required here as knowing how to do some data mining.

Not sure what this opinion is based on - and also, this task is not actually about data mining per se. The most time-consuming part for everyone involved here has been to get to know the specific database tables involved (I, personally, had not worked with them before, being a data analyst from the WMF Reading department who is normally focused on readership stats) and to figure out how to handle the various corner cases.

In T166269#3350065, @kaldari wrote:

To clarify, this was basically a suggestion to only run the regex over the first line instead of the whole page (as MediaWiki itself does in essence). This may also make things a bit faster.

This has definitely gone too far down the rabbit hole. I agree with Danny that we should just forget about accounting for redirects if it's going to make getting this data significantly difficult.

The redirect condition is part of MediaWiki's basic article definition and accounting for redirects has been part of this investigation since last November (long before I jumped in to help out). As mentioned above (see the quote from Neil and the link there) one can go seriously astray when ignoring them.

As for that remark about the regex, it was just a small suggestion since I had been looking at the code anyway, also considering that the script (like my own, similar query in T167471) still has serious performance problems. No need to get hung up on it; the other issues discussed above are more important.

In T166269#3350027, @Tbayer wrote:

To clarify, this was basically a suggestion to only run the regex over the first line instead of the whole page (as MediaWiki itself does in essence). This may also make things a bit faster.

I think the content comes back as one big blob, with \n where new lines would be, but the regex I'm using is /^#REDIRECT/i so it's only looking at the very beginning.

I forgot that I could authenticate as MusikBot II which is an admin who can see deleted edits, so I will do that!

Cool - it looks like you were already querying the archive table anyway, being on a private server, so authenticating as admin should not be necessary.

I'm actually using the API to get the revision content, since that does not exist in the archive (or revision) table itself, correct? Rather it's on some external storage system. I don't know how to query for it but the API is fast enough, so all good.

In T166269#3350065, @kaldari wrote:

This has definitely gone too far down the rabbit hole. I agree with Danny that we should just forget about accounting for redirects if it's going to make getting this data significantly difficult.

No worries, we've got it figured out. There are some very rare cases where it took the user a few edits to get the redirect to work (maybe misspelled the word "redirect", or forgot the #), but not much we can do about that.

I do not see any comment by Tbayer on Scottywong's en.Wiki talk page.
It might help to ask Halfak (WMF) He did a major research into NPP at Patroller Work Load. He seems to know a lot about getting data. I believe he is one of the few WMF employees from that era who is still around today. More important than his research is the discussion on its talk page - but that affects a different current discussion.

Are we still working on this or it is complete?

In T166269#3365770, @kaldari wrote:

Are we still working on this or it is complete?

For my part, as mentioned in T166269#3347536 , I still intend to complete the Data Lake approach in the sense of posting the results it yields when applying all the refinements/corrections we have figured out so far. We can then compare with the slower but more precise script-based approach and decide whether it's justifiable to use the Data Lake results for a longer-term analysis (e.g. look at deletion rate trends over several years), and in any case update the Data Lake query later to the more precise redirect detection when T161146 is resolved.

I worked on this last week after the above discussion. Writing down the query was straightforward, but I then ran into a new problem with it that probably needs to be looked into by the analytics engineers - filed as T168497: Data Lake queries abort with HDFS write fail.

In T166269#3350156, @Kudpung wrote:

I do not see any comment by Tbayer on Scottywong's en.Wiki talk page.

It was actually left on the talk page of his analysis. Is that not on your watchlist? I also subsequently sent him an email in case the notification did not reach him.
There hasn't been a reply yet, so he may indeed be fully retired or busy at the moment. So, as discussed above, for now we should not to rely too much on these 2011 stats until we know more about how they were generated and which definitions were used. In any case, as you correctly noted, they would need to be updated in order to draw conclusions about the situation in 2017.

It might help to ask Halfak (WMF) He did a major research into NPP at Patroller Work Load. He seems to know a lot about getting data. I believe he is one of the few WMF employees from that era who is still around today. More important than his research is the discussion on its talk page - but that affects a different current discussion.

Yes, we already synced with Aaron (I myself on June 7, and I think Kaldari on other occasions); I had already mentioned some of his input above.

An important amendment to data/computation I've done here. I just read more carefully the definition of editcount and found that it includes Flow posts [1]. Since these are not revisions, there could be a significant difference between the way I approximated historical edit counts and user_editcount. If there are people who become auto-confirmed by posting 10 times on Flow boards, I am under-estimating the number of autoconfirmed users.

[1] https://www.mediawiki.org/wiki/Manual:User_table#user_editcount

Thanks @Milimetric - that's important to know! However I think it is unlikely to affect the results for the English Wikipedia, considering that (AFAIK) Flow was only enabled on two pages there before it was uninstalled - and from a quick skim these didn't see any posts from non-autoconfirmed users before being converted to wikitext. We should still keep this caveat in mind though in case this analysis is extended to other wikis in the future.

• Tbayer mentioned this in T168497: Data Lake queries abort with HDFS write fail.Jun 21 2017, 4:26 PM

@Tbayer, Joseph is awesome and did a deep dive into this to make sure. He's got hard data here: https://phabricator.wikimedia.org/T168648 but the basic result is that your intuition is correct. The vast majority of wikis have user_editcount equal to cummulative_revision_count the vast majority of the time. There are exceptions though, which may be interesting to look at and consider.

He also broke down the analysis to look specifically at users with user_editcount < 10, so you can know how applicable that field is for this question, and that looks good too.

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Jul 3 2017, 6:52 PM

• DannyH moved this task from New & TBD Tickets to Older: Team Work on the Community-Tech board.

nshahquinn-wmf moved this task from Backlog to Radar on the Contributors-Analysis board.Sep 12 2017, 10:52 PM

• TBolliger removed a project: Community-Tech.Mar 21 2018, 8:20 PM

• Phabricator_maintenance added a project: Product-Analytics.Apr 18 2018, 11:16 PM

nshahquinn-wmf triaged this task as Medium priority.Apr 19 2018, 8:20 PM

nshahquinn-wmf moved this task from Triage to Backlog on the Product-Analytics board.

@kaldari @DannyH I'm going to close this, but do let us know if it needs to be reopened. / cc@jmatazzoni @Niharika

@kzimmerman - Since Analytics was unable to provide any information for us, we did the work ourselves (per T166269#3346965). I think this ticket is a great example of Analytics' difficulty with providing Product with basic ball-park data to help with decision-making and community discussions due to Analytics' self-imposed requirements for perfect data (which is almost never possible in the context of Wikipedia). We need some sort of process where we can get Analysts to give us ball-park data quickly, rather than waiting months for reports that never materialize. This is one of the main reasons why PMs are asking for dashboards.

For the record, and speaking just for the Analytics Engineering team, our team's mission is creating and maintaining a platform for data collection and analysis [1]. As part of that, we sometimes create packaged datasets, which we try to make as stable as possible given the messy source data. Ballpark data was never and should never be our responsibility, there's a huge appetite for it and it would've completely derailed our team. PMs, analysts, and anyone else can access our data and run queries like T166269#3346965. I thought that was clear from our many communications about this, but maybe the main problem is that we still don't fully understand each other's roles in the organization. Hopefully this is clear going forward, especially with @kzimmerman's wonderful team organizing the use of data at WMF.

[1] as described on our team page https://wikitech.wikimedia.org/wiki/Analytics

@Milimetric - Thanks for explaining that and thanks for your help on this. I'm glad we have a Product Analytics team now and hopefully requests like this will go smoother in the future.

@kaldari I get that you are frustrated but I find the hostility of your reply unwarranted, I made the very same point than @Milimetric just did 1.5 year ago when ACTRIAL started, it should not be a surprise that a team tasked with providing a huge part of our infrastructure cannot attend to custom ad-hoc data requests. It is not possible do those two things and do them well. We can help with data requests and we do daily but it is not the main responsibility of Analytics Engineering.

Also, It is important to point out that dashboards already exist (many improvements needed, of course, created by PMs, analysts and others) and just in case you are not aware of these please see: https://superset.wikimedia.org/slicemodelview/list/ Also, as of this month you can visually explore edit data, see for example pages created by auto confirmed users in all namespaces in English Wikipedia. https://bit.ly/2ITuL8J (you need ldap authentication to see either)

@Nuria, @Milimetric - Crap, I didn't mean to sound hostile. I love the Analytics Engineering team! Y'all are literally the smartest people I know. Right before I wrote that comment, I had had a 1:1 meeting with Kate which consisted mainly of me explaining the pain points that Product has around data analysis. I was pointing out this ticket to her as an example of cases where product has a hard time getting basic ball-park data to help with decision making and community discussions (since it was directly related to our discussion). It wasn't meant as a criticism of the Analytics Engineering team. I'm sorry if the lack of context made it seem that way! My sincere apologies.

@kaldari Thank you for providing context! Really!.Let's be in closer touch for other things you think you need so we can help you better.

I didn't take much offense, no apologies necessary :) I was just making sure things are clear and all. And obviously I very much appreciate the work that @kaldari is doing and as always happy to help in any way we can.

What percentage of articles created by autoconfirmed users are deleted? What percentage by non-autoconfirmed?Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

What percentage of articles created by autoconfirmed users are deleted? What percentage by non-autoconfirmed?
Closed, ResolvedPublic
Actions