Lessons Learned from a Production Outage

A little while back, I arrived at my desk to a message no one likes to receive: the production instance of a client’s primary app was down. What’s more, the site had just gone live earlier that week, and thousands of users were trying to login.

I had fairly high confidence in the overall quality of the code, so I knew it had to be something environment-specific and site-wide to take down the entire app like this. Whenever production goes down, no matter the cause, I always have a twinge of guilt and a feeling of personal responsibility, quickly followed by a drive to solve the problem as rapidly and efficiently as possible.

So, I logged into the site to see what the issue was, and I was greeted with a blank white screen. I opened my browser’s dev tools and hit refresh to get some details for debugging, and the page loaded successfully.

Weird.

I confirmed with the client that they could access the site as well, and, at least for the short term, I chalked it up to an intermittent network outage.

The Cat Came Back

Later that morning, an identical report came in: a blank white screen when trying to access the site. And again, before I could debug it, everything was back to normal. Something was definitely up.

When I checked the health of the web and database servers, everything looked OK. There were more than enough physical resources available, so that wasn’t an issue either.

When in doubt, reboot.

Unfortunately, this Windows server was going on two years of uptime, which meant there were a lot of OS updates that needed to run. All in all, the emergency restart took half an hour. Not good for the middle of the day. And the worst part was, once everything had come back up, the issue was still happening.

Throughout all of this, I started to notice that there seemed to be some periodicity to the system going down.

Adding Some Observability

To test my periodicity theory, I set up some synthetic monitoring with New Relic to check on the site every minute. With some solid data finally in-hand, it quickly became clear that the downtime was happening for 3 minutes every 5 minutes.

My first lead was that I had seen this blank screen before, during development, when it was due to a database connection issue. I discovered how to enable file logging in production for the old framework that the app was running on, and upon opening the log file, I was greeted with this message:

Could not connect to database.

So, I knew I was on the right track.

If at First You Don’t Succeed

I had noticed the app was configured to connect to the database via an external IP address, yet the database was located on the same physical machine as the web server. Maybe it was an intermittent network issue after all?

Nope. After changing the settings to connect via localhost, the app went down again a few minutes later.

I then thought it might be too many database connections. The app had just gone live; perhaps they were leaking, or something was getting overloaded? But, when I checked on the number of open connections, they never reached a critical mass: the application was closing connections correctly at the end of each request.

The Smoking Gun

I checked the database audit logs to see if there was anything fishy. Indeed, there was.

Loads of Login failed for 'sa'. rows, with reasons of either Incorrect password or Account is locked out. I cross-referenced the application’s database configuration file. The app was also using the sa account to connect to the database.

Bingo.

I attempted to connect to the production database from my development machine using the sa credentials. Ideally, I would have received a connection timed out error, indicating that a firewall was (appropriately) blocking the connection. Unfortunately, as I predicted, I was able to connect, but login was denied, as the sa user was locked out.

Behind the Curtain

So, what was happening?

The database was open to the Internet, and someone on the Internet was repeatedly attempting to log in as sa. For those less acquainted with SQL Server, sa stands for “system administrator,” and is basically equivalent to root.

After 2 minutes of failed login attempts, the sa account would be locked, and since the application was using sa to connect to the database, all connection attempts from there failed. 3 minutes later, the account is automatically unlocked, and the application can reconnect, but so can the hacker.

Remediation

Once the cause had become clear, I worked with the client to close the firewall hole so that only approved machines could connect to the database. As soon as the hole was closed, the downtime stopped.

As an extra measure, I also changed the application configuration to use application-specific credentials to connect to the database.

Lessons Learned

There are several security and operations lessons that were reiterated for me that morning:

1. Keep your servers up to date

Aside from the risk of opening machines up to vulnerabilities by not regularly patching, the additional time required to perform all of these updates at once increased our time to resolution for this outage, not to mention caused a full half-hour of downtime in the middle of the business day.

2. Don’t expose your database to the public

There are very few reasons why a database would need to be open to the Internet. Limiting access can seem like a hassle, especially in a remote-friendly workplace where IP addresses change frequently, but the extra security from forcing connections over a VPN or SSH tunnel is worth the effort.

3. Use application-specific credentials

The sa (or root) account should never be used for application database access. If anyone were to connect to the system with these credentials, they’d have full access to everything in the database, if not the entire server. Not good.

Creating separate credentials for each application and limiting access to only what’s needed for that application will help reduce the potential attack surface should those credentials get compromised.

4. Monitoring and logging are invaluable for debugging production

Having an overview of what’s going on with the system — both on the inside, and from the outside — is valuable information. Using automated alerts to let us know when the site is down is much better than learning it from end users. Providing quick and easy access to see what the problem is, via logs or otherwise, will drastically reduce the time to resolution.

Conclusion

All of the steps that could have helped prevent or minimize this issue take time and effort to maintain. As developers and system administrators, our time is precious, but by prioritizing security best practices like those stated above, we can reduce the potential attack surface available to malicious actors.

When production issues like this do occur, persistence is key. Having multiple hypotheses proven wrong can certainly be frustrating, especially with the added stress of the time-sensitive situation at hand. But with some detective work, and the scientific method, you can eliminate variables and narrow down the suspects until the case is solved.

Corgibytes logo

Want to be alerted when we publish future blogs? Sign up for our newsletter!