Today’s exercise is not an exercise! The investigation I’m about to describe was triggered by a production outage affecting multiple APIs - albeit not at the same time - at a customer. For obvious reasons I’ll not be able to share the code and will have to alter / obfuscate some of the commands’ results. I still think this is a valuable exercise as it is a classic example of an easy to make mistake leading to a catastrophic result.
The other interesting point is that I had no idea what the issue was so I’ll not be following a script as I did in the first instalment.
The symptoms
A week prior we had another production outage. One of our API became unresponsive, requests would sit in Chrome waiting for minutes for the server to reply. Logs did not yield anything interesting and according to New Relic we had a quasiperfect Apdex.
The working theory was that requests were failing early in the pipeline so that our logging did not pick anything
According to New Relic our traffic plummeted during the outage. So basically, New Relic was not capturing any of the failing requests and the dashboards were all green!
At this stage I recommended to capture a memory dump before restarting the web app. The later happened but the former did not. The app recycled and started to serve production traffic again but we were left with few clues. Afterwards the operation team engaged Microsoft and learned than the thread count spiked during the outage.
I then wrote a guide explaining how to write a memory dump and we instructed the operation team to capture a memory dump before recycling the app if the outage were to happen again.
We were left with two clues:
Unresponsive web app
High thread count
We had released a new version of this API a week before the outage. The release included a fix that I did for an intermittent 500 error that our testers and front-end teams experienced. The patch involved modifying the lifetime of some classes involved in the logging system. The Pull Request had been approved after a heated discussion and as soon as news of the outage broke, fingers were pointed at my fix.
Draw of luck
Come Friday morning, I’m sipping my Dirty Chai while absently looking at Slack. Suddenly a message catches my attention:
This is the same API that went down last week! My heart beats slightly faster while I fire up the Azure portal. And sure enough the thread count is far too high for the current dev traffic this API is serving:
I write a memory dump, slurp the end of my Dirty Chai and go to work thinking I’ll allocate some WinDbg time during the week-end.
We’re down again
A few hours later another API experiences a significant outage. The symptoms are eerily similar. I was waiting for another of my PR to get approved but the process is being delayed as our build server is down. I don’t have access to the memory dump that was written in production during the outage but it’s likely the outages have a common cause; hence I decide to take my frustration out on the memory dump I wrote in the morning.
Investigation
Threads all the things
The first question I had was: why do we have so many threads and what are they doing?
CLRStack: Provides a stack trace of managed code only
Thread 143
So thread 143 is trying to acquire a lock: System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef). Let’s look at another thread.
Thread 136
Thread 136 is also trying to acquire a lock.
Thread 123
Thread 123 is also trying to acquire a lock. By now it’s safe to assume than many threads are trying to acquire a lock. How many of them exactly? And more importantly: which naughty thread is holding on the lock?
Thread ID42 is holding the lock. The third column (MonitorHeld) indicates how many threads are trying to acquire the same lock. In this case it is 114 ((229 - 1) / 2). You can read more about how I computed this number in this Stack Overflow answer.
What is thread 42 doing?
The last line executed in our code was:
Despite its name it seems to be executing async code and even more troubling it looks like we’re blocking. I have enough information to locate the source and find the following method:
Sure enough we’re blocking on an async method. I’ll let Stephen Cleary explains what’s wrong but it can be summed up as: Don’t Block on Async Code (seriously read the blog post).
Solution
There are two alternatives:
Resolve the secret on application startup. At this stage we’re single threaded and there is no request context so we can block without risking a deadlock. It isn’t pretty and will slow down the application startup but it works.
A nicer option would be to resolve the secret when needed (lazy initialization). By then we’ll be in an async method and we will be able to call the Async overload.
Conclusion
Once again WinDbg has delivered the goods. I write up my findings and communicate them to my colleagues. It’s 4:30 PM and the build server is still not fixed, I decide to call it a day.