Page MenuHomePhabricator

Spike of "unknown" errors experienced by users of UploadWizard after wmf.7 deployment
Closed, ResolvedPublic

Description

I just checked our error logging data, and there is a new type of error, "unknown", coindicing with the deployment of 1.28.0-wmf.7. This seems to correspond to API errors with "undefined" error code.

pasted_file (820×1 px, 93 KB)

pasted_file (820×1 px, 106 KB)

Unlike other weird spikes visible on this graph earlier, this doesn't just affect a single user.

Event Timeline

The exact, whole API response is {"error":{"info":""}}, which doesn't really help narrow down the problem…

I briefly reviewed all core changes between wmf.6 and wmf.7 and don't see anything obvious that would cause this. @aaron @Anomie Any ideas perhaps? This weird {"error":{"info":""}} response is produced for API requests with action=upload&stash=1 (unfortunately from the data I have I can't tell if they are async, or chunked, or otherwise special, and I was not able to reproduce).

That really shouldn't happen. Do you have the API request that caused that response?

The only thing that I can think of that would produce that kind of response (besides something explicitly making that response for some strange reason) would be if something passed false as the code into UsageException (likely via ApiBase::dieUsage()). But at a glance I don't see anything that would be likely to do that sort of thing; if nothing else, we could add some validation into UsageException::__construct() and log with a backtrace if it's getting a non-string for $codestr.

That really shouldn't happen. Do you have the API request that caused that response?

No. Assuming we don't manage to solve it like our grandparents would have, by thinking about it really hard, adding some logging will probably be the next step. (For some time now I've been thinking that logging some information about all API requests that return an error response would be really convenient…)

I think server-side logging would be a lot easier, since from the client-side we'd have to figure out some way of uploading a copy of the problematic file somewhere…

Ideally we'd add logging and redeploy it at least long enough to get
backtraces.

Ideally we'd add logging and redeploy it at least long enough to get
backtraces.

Let's do that. Who's on point?

greg triaged this task as Unbreak Now! priority.Jun 27 2016, 7:45 PM
greg added a project: UploadWizard.

For the record, this now blocks further rollout of wmf.7. The other blocker (save-timing regression) is hopefully now addressed.

@Anomie / @matmarex: I don't care who, but someone should move us forward here to not block all other developers.

For the record, this is not a UBN for Multimedia, and I don't consider it a blocker to wmf.7 deployment. I do not have time to work on this right now. (I did not expect anyone to care about UploadWizard problems, that's a first…)

greg lowered the priority of this task from Unbreak Now! to High.Jun 27 2016, 9:06 PM
greg added a subscriber: dr0ptp4kt.

For the record, this is not a UBN for Multimedia, and I don't consider it a blocker to wmf.7 deployment.

UBN! -> High.

Also, I missed that it wasn't you who made this a blocker for wmf.7, it was @dr0ptp4kt.

Adam: reasoning?

I do not have time to work on this right now. (I did not expect anyone to care about UploadWizard problems, that's a first…)

When things are marked as blocking the deployment, I care greatly :)

Given the conversation in multiple channels on IRC I'm inclined to rescind my proclamation and remove this as a blocker, unless @dr0ptp4kt has other reasoning.

Change 296353 had a related patch set uploaded (by Anomie):
UsageException: Validate that $codestr is a non-empty string

https://gerrit.wikimedia.org/r/296353

Change 296353 merged by jenkins-bot:
UsageException: Validate that $codestr is a non-empty string

https://gerrit.wikimedia.org/r/296353

Change 296357 had a related patch set uploaded (by Bartosz Dziewoński):
UsageException: Validate that $codestr is a non-empty string

https://gerrit.wikimedia.org/r/296357

Change 296357 merged by jenkins-bot:
UsageException: Validate that $codestr is a non-empty string

https://gerrit.wikimedia.org/r/296357

Mentioned in SAL [2016-06-28T00:26:24Z] <dereckson@tin> Synchronized php-1.28.0-wmf.7/includes/api/ApiMain.php: UsageException to try to catch T138585 issue (duration: 00m 27s)

I had @Anomie's patch backported to wmf.7, but it doesn't seem like anything is getting logged. Or I'm searching logstash wrong. Anomie, can you verify?

And the errors keep appearing in our logs:

mysql:research@dbstore1002.eqiad.wmnet [log]> select left(timestamp, 8), count(*) from UploadWizardErrorFlowEvent_11772725 where event_code='api/error/undefined' group by left(timestamp, 8);
+--------------------+----------+
| left(timestamp, 8) | count(*) |
+--------------------+----------+
...
| 20160518           |        4 |
| 20160522           |        3 |
| 20160528           |        1 |
| 20160530           |        2 |
| 20160609           |        1 |
| 20160615           |        2 |
| 20160617           |        1 |
| 20160622           |       38 |
| 20160623           |      351 |
| 20160624           |      233 |
| 20160625           |      219 |
| 20160626           |      461 |
| 20160627           |      444 |
| 20160628           |      171 |
+--------------------+----------+
21 rows in set (0.28 sec)

mysql:research@dbstore1002.eqiad.wmnet [log]> select left(timestamp, 8), count(*) from UploadWizardErrorFlowEvent_11772725 where event_code='unknown' group by left(timestamp, 8);
+--------------------+----------+
| left(timestamp, 8) | count(*) |
+--------------------+----------+
...
| 20160509           |        4 |
| 20160516           |        2 |
| 20160518           |        4 |
| 20160522           |        3 |
| 20160528           |        1 |
| 20160530           |        2 |
| 20160609           |        1 |
| 20160615           |        2 |
| 20160617           |        1 |
| 20160622           |        1 |
| 20160623           |      197 |
| 20160624           |       89 |
| 20160625           |      101 |
| 20160626           |      273 |
| 20160627           |      257 |
| 20160628           |       80 |
+--------------------+----------+
241 rows in set (0.26 sec)

Some more investigation…

  • The reason we see this now is actually probably a recent commit in UploadWizard, that I forgot about entirely (35d0724e0dcc5abf8ac7a7222b81019c48d05abf). So probably a false alarm about wmf.7 problems. But the errors themselves are real (the difference is that they're not new; they were always there, we just didn't know about them).
  • Many of the 'unknown' errors look to be produced by broken proxy servers between us and the user that clearly can't handle file uploads. I'm not pasting them here because some look rather unique, but anyone with access to stat1003 can see for themselves. We should be detecting this and displaying a better error message, I guess…
  • I still have no idea what the deal with 'api/error/undefined' is. I guess something could be producing {"error":{"info":""}} somewhere in our JS, but I don't see anything that would.

I'm not seeing anything logged for the new exception either.

Looking through your JS, I see one issue with error logging in 35d0724e, but that doesn't seem likely to cause this issue.

I also see in mw.FormDataTransport.prototype.parseResponse that, if the JSON parsing fails for some reason, you synthesize a response as { error: { code: evt._target.code, info: evt._target.responseText } }. If I understand things correctly, evt._target is the XMLHttpRequest object, which has no code property, which may well be where your "undefined" is coming from. If the JSON decoding failed because evt._target.responseText is empty, that would synthesize the {"error":{"info":""}} that you're seeing.

Yes, that looks obvious now. I thought that a JS object like { code: undefined } would serialize to JSON as { "code": undefined }, not just {}.

Change 296429 had a related patch set uploaded (by Bartosz Dziewoński):
Report fake API errors with saner error codes

https://gerrit.wikimedia.org/r/296429

With this patch, the errors should be logged as 'parsererror'.

Change 296429 merged by jenkins-bot:
Report fake API errors with saner error codes

https://gerrit.wikimedia.org/r/296429

  NODES
Idea 4
idea 4
Note 3
Project 7
USERS 2
Verify 1