Essay

RfA trend line haruspicy: fact or fancy?

This user essay originally titled "RFA trend lines" was started in 2020. You may edit it, but please do so on the original page and not The Signpost.E

Trends in support percentage during a request for adminship are rarely informative, and these trends are difficult to interpret even when they might be informative.

As a first order approximation, let's assume there's an RfA where no new information comes to light over the course of the request and everyone !votes independent of each other. In this case, if we were to poll every Wikipedian, there would be some global, unobserved support percentage for the population; call it p. Given an RfA with n participants, each !vote in an RfA can be considered a Bernoulli trial with probability p. The number of supports, s, at any given time can be simulated by combining the results of multiple Bernoulli trials; this can be modeled as a binomial distribution of n trials and probability p.

RfAs run for multiple days and are among the most attended discussions on the project; this suggests that the final support percentage is a reliable stand-in for the population support percentage. By contrast, the trend line tells us almost nothing and may in fact be misleading. Our binomial model is the same we would use to model the ratio of heads to tails in successive coin flips. Imagine we are going to flip a coin for a contest and we want to prove that the coin we are flipping is fair. We flip it 150 times and track the number and order of heads and tails. After 150 coin flips, the ratio of heads to tails would be very informative: if it is far away from a 50% split then the coin is not fair. The order these flips occur in, however, is uninformative, and in fact, using it as evidence for an argument is logical fallacy known as the gambler's fallacy.

Our first order approximation of RfA trend lines represents a hypothesis regarding !voting behavior. Absent evidence to the contrary, we assume editors review the candidate and comment independently of others just like the result of a coin flip does not depend on prior results. But an RfA is not a series of independent tests. The amount of information available to a !voter includes not only other comments, but new question answers, and summary statistics like current support percentage. These can consciously or unconsciously affect how a participant !votes and justifies an alternate hypothesis: each !vote is related to the ones that came before it (and maybe even after it). If the population support percentage, p, doesn't change then this distinction is immaterial to our model.

Reconsider the coin flip example: if the probability of getting heads depends on the previous result such that getting a heads changes the probability from 50% to 50% (i.e., no change), then the dependent model and independent model will produce the exact same results. Differences only arise if the dependence changes the underlying probability. In statistical terms, we can say that the binomial distribution is robust against violations of the independence assumption as long as the sample size is much smaller than the population. For example, let's assume that getting a heads increased the likelihood of getting another heads. In that situation our independent trial model will be accurate at first but get more inaccurate as we have more trials since the non-independence will keep compounding making heads more and more likely. Bringing this back to RfA, the influence of prior votes on later ones is not a serious threat to the binomial (independent trial) model. It would only affect our model if there were thousands of !voters or if there was a major shift in the underlying probability.

Editors look at trend lines because they believe that (or want to evaluate whether) earlier votes influenced later ones to such an extent that a major shift occurred in the underlying probability. considering how !votes are non-independent, this intuition makes sense but is flawed. Essentially, this is a model selection problem, and the starting assumption ought to be the null hypothesis. As discussed above, this means that without evidence, we should assume that the order of !votes is not meaningful, just like the order of coin flips. Claiming that a coin is unfair because of the order of heads and tails is fallacious, so we cannot reject the null hypothesis on the basis of the trend line alone; we need some other kind of evidence. What is critical to understand in the context of RfA is that the trend line cannot tell us whether a change in the underlying support percentage occurred; they are only useful if we already assume that happened and even then can only help us determine when.

Like any hypothesis testing tool, a trend line is only useful if we already have a hypothesis. Unless there is an independent reason to believe the information available to participants has changed, the trend line is most likely to reflect randomness in the sample rather than a meaningful pattern. Without a rational argument as to why early !voters did not have the same information as late !voters, an argument from trend-line data is weak.

Example

A simulated RfA with 150 !votes. Can you tell where the underlying support percentage changed?
The accompanying image shows a trend line for the support percentage in a simulated RfA which ended within the discretionary range. It is a series of 150 Bernoulli trials, but at some point the underlying probability of support changed from just above the 75% threshold for an outright pass (76 percent) to well below the 65% threshold for outright fail (60 percent). The location at which this change occurred is difficult to determine from the trend line alone, and in fact the graph looks like other simulations where the underlying support percentage was above the discretionary range the entire time. The change in probability occurred after the 90th !vote, and despite that change, there is little evidence in the trend line alone to substantiate that. These simulations can be replicated (in spirit, since it's a random simulation) using the following R code:
# Config variables
N = 150 # How many !votes to simulate
switchPoint = 90 # At what vote should the probability switch
p.start = 0.76 # Probability of support before switchPoint
p.end = 0.6 # Probability of support after switch point

# Data lists
voteList = c()
meanSeries = c()

# Simulation
for(i in 1:N) {
  if ( i < switchPoint ) {
    p = p.start
  } else {
    p = p.end
  }
  voteList[i] = rbinom(1,1,p)
  meanSeries[i] = mean(voteList)
}

# Plot the result
plot(1:150,meanSeries,xlab='!vote number',ylab='Support percentage',type='l')
  NODES
eth 2
News 3
see 5
Story 1
twitter 1