Here is what happened today. The NA platform went crazy and it was totally preventable.
TL;DR: We missed an error that showed up in the logs during restarting the platform after the deploy. Unfortunately not fixing that error in time led to instability much later.
The longer version:
The platform has static and dynamic configuration values that are stored in a database or configured when the platform is up. The static values persist between restarts of the platform.
When we started the platform today, the static values did not load. We failed to detect this particular error condition in the automated deploy process as it occurs once in a blue moon.
When the platform went live several hours later with important configuration values missing, the platform buckled under load. We attempted to fix it by lowering the load on the platform by taking the following steps.
1. Stopping additional players from logging in
2. Turning off all queues
The above steps would have allowed players to be still logged into the platform rather than dropping all players and causing a huge login queue.
We did load all the missing configuration values, however the errors in the logs did not give us the confidence that we would be safe. We proceeded to do a restart of the platform. We recognize that it was a preventable incident. We are taking the following steps to prevent such incidents in the future.
1. Find an engineer who will be made a scapegoat and make him do pushups till he collapses. (This might be hard as all of us on the team are responsible.)
2. Improve error detection during a deploy.
I will be around for some time to answer any questions.