As part of an ongoing effort to improve realtime performance and perceived lag in realtime collaboration we applied some experimental optimizations to our beta environment.
The optimization consisted in refactoring websocket notification code. This refactored code contained a bug that was not found during testing environment previous tests due to a very low probability edge case scenario.
In order for our tests to be as accurate as possible our realtime notification subsystem in the beta environment that was used for the test was linked to our production pub/sub cluster, resulting full-cluster broadcasts to span to production subscribers.
8:10 PM GMT - We merged the offending code containing the defect described above. This merge triggered an automated rolling deployment to our beta environment.
8:23 PM GMT - The API rolling deployment completed in the beta environment effectively deploying the defective code to every beta API server
9:16 PM GMT - We executed an emergency rollback of our beta API, restoring every server to the previous version without the defective code.
• We fully rolled back the defective code both in our beta servers and our source control
• We included negative broadcast tests to every regression test batch
• We are working on automated regression tests to assert notification broadcasts are not possible
• We will sever the link between our pub/sub cluster in beta and production to prevent any beta code from ever again affecting production collaboration sessions\
No test content was ever persisted in production murals. Our websocket notification subsystem only reports about changes in the database but never performs any change to persistent data. As a result of the defective broadcast of (what should have been multicast) notifications, production collaboration sessions received notifications about test changes in our test mural. These notifications were volatile and a mural reload (page refresh in the web app, or leave and enter the mural again in native apps) fixed the issue.
No unauthorized anonymous user joined the murals. When the mural clients received the extraneous notifications, incorrectly assumed those notifications were sent by an anonymous user.
This is the reason some anonymous avatar appeared in the realtime collaboration sessions. These avatars did not represent a connected user but the client code incorrectly mapping unknown notifications to anonymous users.
No user content was leaked to other users or MURAL employees. The defective code only affected beta notification servers and only MURAL employees have access to beta environments. The content broadcast was produced as part of our internal load test. Because user notifications are issued by production realtime servers not affected by the defect, no user content was ever broadcast or otherwise exposed.
Both the correct multicast and the defective broadcast operations are directed from a notification producer (the notification server) to mural collaborators (in multicast) and to every connected user (in broadcast). During the incident the only notification producers capable of issuing defective broadcasts were the beta notification servers, limiting the scope of broadcast messages to only messages produced internally as part of our load test.