Web application unavailable
Incident Report for MURAL
Postmortem

The issue was caused due to internal Administrative issues of our cloud infrastructure provider, Microsoft Azure.

In Q4 2018 we started working with Microsoft Azure in upgrading our existing contract from a Pay-As-You-Go mode to an Enterprise Agreement. This new form of contract includes several upgrades that would beneficially impact our services and, thus, our customers.

In all our conversations with Microsoft, it was made explicit and absolutely sure that moving our existing Azure subscriptions from the existing billing mode to the Enterprise Agreement would entail no service interruption whatsoever. They confirmed this numerous times.

However, because of their internal processes, they mistakenly identified a critical part of our resources as not part of our new billing groups and went on to pause said infrastructure groups when making the transition. Microsoft marked said resources as non-billable, and this had two consequences:

  1. All resources in these groups were immediately stopped by their automated billing processes.
  2. All access from our team to these resource groups was downgraded to read only: we could only “see” the stopped resources, but all the Microsoft toolkit prevented us from performing any significant action.

On top of that, one of the components in this group was our productive database servers. So our productive database was effectively -and erroneously- stopped by Microsoft Azure, without prior warning, and even after them ensured several times we would have no service disruption at all.

This situation effectively prevented our team access to put our Business Continuity Plan in effect, as we were locked out of our database resources -including our backups- so our only option was to work with Microsoft to re-enable access.

Our POCs within Microsoft were unable to solve this situation quickly. We opened ticket after ticket, escalated everything we could, and went to every and all of our networks, both public and personal, in order to get hold of someone inside Microsoft who could actually help.

Once we could finally get a hold of a qualified decision maker inside Microsoft, the issue was almost instantly solved as we were able to bring the platform back up in a matter of minutes.

The incident took so much time to be solved because of internal Microsoft bureaucracy as they effectively locked us out of our accounts and weren’t able to let us back in to solve the issue.

Posted Apr 22, 2020 - 18:02 GMT-03:00

Resolved
We were able to fix issues caused by our infrastructure provider. All systems are now green. Sorry for the inconvenience.
Posted Feb 16, 2019 - 23:51 GMT-03:00
Identified
We have identified the source of the problem and working on a fix.
Posted Feb 16, 2019 - 20:42 GMT-03:00
Investigating
We're investigating reports of the web application being inaccessible for some users
Posted Feb 16, 2019 - 18:54 GMT-03:00
This incident affected: Canvas.