On Call DBA Checklist

   Posted by: jasoncbunn   in SQL Server

So, you are on call for your DBA team this week.  Perhaps you have 200 servers, perhaps 1000 or more, but you have enough that you have the automation tools in place.  You are not scanning error logs one at a time or opening backup folders to verify that backups ran.  You know nothing went horribly wrong because you were not paged the night before.  As you walk into work, fire up the PC, and sip your morning coffee, what do you look at?

This was the basic question posed to my DBA team this morning.  A quick count of the aggregate emails we receive, or can receive, turned out to be surprisingly high– 36.  Some of these are alerts (TempDB version store is filling up, for example), many were configuration problems (guest account is enabled), some simmering potential problems (I/O taking longer than 15 seconds), and others were minor errors (databases not backed up).  Many were duplicates since there are production and non-production versions.

The challenge was to identify the five key emails that we as a team wanted to ensure were read by the DBA on-call.  Sure, we should look at all of these, but what are critical ones the on-call is standing up and essentially guaranteeing would be examined?  It was a lively discussion, and instructive to force us to pick five since none of them were trivial or we wouldn’t be alerting ourselves.

Eventually, the team decided on these five emails to highlight and take responsibility to ensure they are examined and processed:

  • Daily summary of critical alerts from the last 24 hours
  • Change Data Capture misconfiguation
  • Mirroring misconfiguration
  • Disk Space warnings
  • Databases not backed up in production

The list is good, and forms a basis of accountability for the on-call DBA.  What I found most interesting is the realization that, in essence, all of the emails are important, and all indicate a greater or lesser degree of instability of the enterprise.  The team realizes that what really needs to happen is that all of these alerts are eventually triaged, documented, and resolved.  Then, whenever any email comes into the mailbox, it is actionable and something we can and should fix.  Gradually, take care of the noise so that the signal can bubble up to the surface.

This entry was posted on Tuesday, November 17th, 2015 at 8:00 pm and is filed under SQL Server. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a reply

Name (*)
Mail (will not be published) (*)