Real Time collaboration Issues detected

Incident Report for Mural

Postmortem

The degradation on the real time collaboration service that occurred on Oct 30th was caused by a self-inflicted DDOS based on our websocket authentication protocol that would retry forever on a timeout. This was triggered by a high server side response time (caused by high load on our servers triggered by a slow deployment process) that made timeout of the authentication handshake to occur more often. We fixed the issue by improving the authentication protocol to allow for more flexibility in server side response time and by adding an "exponential backoff" strategy that would prevent a self-inflicted DDOS in the future.

Posted Nov 05, 2018 - 11:05 GMT-03:00

Resolved

This incident has been resolved.
Posted Oct 30, 2018 - 14:15 GMT-03:00

Monitoring

Issue was identified and fixed. We continue to monitor the behavior of the web application.
Posted Oct 30, 2018 - 14:14 GMT-03:00

Identified

Sockets authentication service is down. Users cannot edit murals. We are working on a fix.
Posted Oct 30, 2018 - 12:30 GMT-03:00

Investigating

We're investigating problems with real time collaboration in murals.
Posted Oct 30, 2018 - 12:22 GMT-03:00
This incident affected: Mural Application (Canvas).