⚠️ ⚠️ Active Incident Update ⚠️ ⚠️

3 days of downtime is fully unacceptable for a community that puts literal dozens of hours of prep, planning and scheduling for each session. Considering how long this has gone on, I’d like to update everyone with more information from the beginning to our current status.

Friday

We received reports around 12:30AM EST that uploads weren’t working. I responded within the hour and alerted the necessary staff and began diagnosing the issue. Our image processor for uploads went down, and I couldn’t bring it back up. It’s a self-healing system, but both the self-healing and my own intervention couldn’t bring the service back online. We found other related services were down, and I began mapping out the affected services and systems.

Friday night, we had roped in everyone we needed at the time, and had alerted previous system owners that we needed them to be ready to assist soon.

Saturday

Our partners provided support personnel for the incident, who began doing triage. Somewhere along the line, support personnel were able to get image processing back online.

We found that several critical back-end systems including our ability to deploy fixes had failed prior to the incident, and we hadn’t found these failures due to no changes being released recently. During my folow up triage, I informed all responders that our services were still marked as below minimum availability and that the incident wasn’t resolved.

Today

At 9:10AM EST our site healthchecks and backend alerting automation sent out emergency alerts that the site was down. During an automated self-healing attempt, critical backend services went offline in turns, then tried to come online. This caused a cascading failure that we are still trying to recover from.

We’re still on the case. I can’t provide an ETA with any certainty. Please don’t hesitate to @ me with any questions, concerns, or even just to vent. We sincerely apologize to everyone who had their games affected this weekend.

Sketch (Quality Cleric)🛡 — Yesterday at 1:29 PM

⚠️ ⚠️ Active Incident Update ⚠️ ⚠️

Hello all, there’s been some progress on the current situation. We’ve isolated the problem to two possible sources, and we’re working on resolving the issues entirely.

Firstly, I’ll start by saying we still don’t have an ETA. If you have sessions planned for tonight or tomorrow, you may want to consider rescheduling, or finding a temporary alternative while we figure this out. We hope to be up soon, but right now it’s difficult to tell when we’ll be back online.

From our understanding, there has been no risk to data posed by any of the issues we’re experiencing. The issue lies within our ability to bring services back up again, but databases remain healthy and are not at risk. Once we’re back online, the infrastructure will simply reconnect everything and we can continue as normal.

Personnel not working on getting the site back up are working on getting a better error page up to direct people to this Discord for updates, rather than just being met with 500 errors. In the meantime please don’t hesitate to reach out for updates, questions or concerns.