hcg

Archive for the ‘SQL Server’ Category

17
Nov

On Call DBA Checklist

   Posted by: jasoncbunn

So, you are on call for your DBA team this week.  Perhaps you have 200 servers, perhaps 1000 or more, but you have enough that you have the automation tools in place.  You are not scanning error logs one at a time or opening backup folders to verify that backups ran.  You know nothing went horribly wrong because you were not paged the night before.  As you walk into work, fire up the PC, and sip your morning coffee, what do you look at?

This was the basic question posed to my DBA team this morning.  A quick count of the aggregate emails we receive, or can receive, turned out to be surprisingly high– 36.  Some of these are alerts (TempDB version store is filling up, for example), many were configuration problems (guest account is enabled), some simmering potential problems (I/O taking longer than 15 seconds), and others were minor errors (databases not backed up).  Many were duplicates since there are production and non-production versions.

The challenge was to identify the five key emails that we as a team wanted to ensure were read by the DBA on-call.  Sure, we should look at all of these, but what are critical ones the on-call is standing up and essentially guaranteeing would be examined?  It was a lively discussion, and instructive to force us to pick five since none of them were trivial or we wouldn’t be alerting ourselves.

Eventually, the team decided on these five emails to highlight and take responsibility to ensure they are examined and processed:

  • Daily summary of critical alerts from the last 24 hours
  • Change Data Capture misconfiguation
  • Mirroring misconfiguration
  • Disk Space warnings
  • Databases not backed up in production

The list is good, and forms a basis of accountability for the on-call DBA.  What I found most interesting is the realization that, in essence, all of the emails are important, and all indicate a greater or lesser degree of instability of the enterprise.  The team realizes that what really needs to happen is that all of these alerts are eventually triaged, documented, and resolved.  Then, whenever any email comes into the mailbox, it is actionable and something we can and should fix.  Gradually, take care of the noise so that the signal can bubble up to the surface.

22
Aug

Backup Share and NTFS Permissions, Too

   Posted by: jasoncbunn

As DBAs, backups are a fundamental part of what we do.  We live in the world of recovery models and full, differential, and transaction log backups.  We know to test our backups regularly by restoring them to another server.  But what else do we have besides database backups?

I never really thought about what happens if we were to lose file permissions.  Think about the backup locations, or script repositories.  What if all of the service accounts suddenly lost access?  We ran into that this week and it was only because the access patterns were stored elsewhere that we were able to recovery without significant issues.

As it turns out, backing up share and NTFS permissions seems to be pretty straightforward.  Based on this MSDN article, we should be able to back up these permissions and have something in place should it ever happen again.  Just another step towards a more resilient database system.

7
Apr

Joining Infrastructure

   Posted by: jasoncbunn

A little over a year ago, our team of DBAs was transferred to report to the director of infrastructure instead of the director of data services or the director of operations, our previous departments.  At the time it was met with a bit of concern from the team.  Data Services brings to mind the goal of delivering data solutions to the organization, building an operational data store or a data warehouse.  Infrastructure was viewed in some ways like keeping the lights on.  Keep the plumbing working.  Not unlike the CEO’s initial impression of infrastructure in The Phoenix Project.

However, over the last year, we’ve come to appreciate the closer connection to the teams that build our servers and provide our storage.  One example is the recent effort to reduce storage allocations for SQL Server.  With the storage team as a separate entity operating “over the wall” so to speak, we ask for storage and it’s received.  We didn’t get involved in the overall picture of the array.  But as a member of the infrastructure team, we’ve learned about the difference between storage subscription and storage allocation.  We’ve worked much more closely with the storage admins regarding I/O concerns and how the inner workings of the SAN function.  The idea of the SAN being “smoke and mirrors” as it was once described, is finally fading away as we develop a better working relationship with them.  I could describe similar situations talking with the server and network teams.  A year later, we’re happy to be part of the infrastructure team, although the change has not been without some negatives as we tend to become involved in some data services projects later than we used to, and often end up needing to challenge some assumptions and the translation from logical ideas to a physical database has been painful at times.

By the way, we did get paged on Easter night.  We were updating our custom job processing engine and ran into a powershell issue.  Turns out that the foreach command in powershell can loop once for a $null object.  That wasn’t tested since we didn’t consider a few servers still have powershell 2.0 running.  It was a feature fixed in powershell 3.0 and described here.

 

 

13
Jun

Database Administration Tiered Support

   Posted by: jasoncbunn

As a database platform footprint grows, it’s natural to add DBAs to your staff to support the growing number of servers, instances, and databases.  In the beginning, the nature of the work of the new DBAs is not significantly different from the work of the first DBAs.  There are simply more heads to divide up the work.  However, at some point, it is worth considering if that approach is really the best one.

At our shop, we are considering the idea of tiered support.  We have hundreds of servers and thousands of databases, with nine DBAs.  The work is getting done– we’re not in a crisis mode, database backups are regular, and verified, deployments are relatively smooth.  However, there are a host of metrics that we collect, but we do not have the bandwidth to improve upon.  For example, our daily monitoring tool collects “interesting” events in the various error logs.  Those are reported in the form of an email, but it’s rarely looked at.  Non-production jobs may fail and not be investigated.  There is little proactive maintenance that occurs.

The idea of tiered support makes sense, but it’s something that seems to be rarely talked about.  At least, I couldn’t find many references out there.  What we’re looking to move to at least at first is a two tier support system.  For the first tier, we’d like to utilize our operations center to offload routine, non-production work such as granting permissions.  We provided the operations center with a powershell script that grants a limited set of permissions– read, write, and execute.  The script limits what can be done and it essentially only calls a stored procedure, so that limits the permissions that we need to grant to the operations center engineers.  The stored procedure constructs the GRANT statement.  Everything is logged in our configuration management tool.  We will be looking for more opportunities to push these routine tasks to the center this year.

The other step for the operations center is to be the “notification” tier.  We’d like to make the operations center the recipients of the calls and pages, at least the first contact.  If we provide the proper triage recipes, they will be able to conduct a preliminary analysis, determine if it is in fact a DBA event, and then, the second tier can be called with that all important initial investigation complete.  For example, if a server is down, the operations center can go through the steps of attempting to connect to the server and determining if the server itself is down (an issue for the server team) or the instance is non-responsive (an issue for the DBAs).

As we move in this direction, I have hopes that we’ll eventually get to three tiers.  The first, the operations center, is the reactive tier, responding to pages and alerts.  The second tier monitors the day to day operations and also responds to requests.  The third tier is that tier that looks at disk capacity, future needs, proactive maintenance and optimization,etc.  Each tier can be an escalation point for the tier below.  It will take some time to get there, but it has the potential to lead to a more capable DBA team that scales better as a server footprint grows.

9
Nov

SQL PASS Summit 2012- Reflections

   Posted by: jasoncbunn

Well, SQL Pass Summit 2012 is over.  I’m writing this on Friday evening, from the hotel.  The fact that I’m not out with new found SQL Server colleagues and friends tells you something of my experience and the whole networking aspect.  That part was, in fact, that was a dismal failure, and I fully admit 99.44% of that is my fault.

Before I get to that, I must say that the PASS organization and Microsoft put on one heck of a conference.  The facilites were top-notch, the staff helpful, the food was good, the technology worked (as I would hope so at an IT conference!).  It was a very professional and smooth running affair from start to finish.

The two keynote events were outstanding.  The world of data is changing, and it is clear that Microsoft is aware of this and is moving to become a leader.  It’s exciting and unnerving at the same time.  Exciting because it is all the more clear that there is power in information, and information comes from knowing how to acquire, manage and transform data.  As a database professional, it is clear that major, major gains in the future will come through efforts in this field.  It is also unnerving though.  There is a whole lot to learn, and what if the company that pays you is not willing to step in front of this trend?  Personally, the fact that my company didn’t want to pay for a copy of the conference talks to either download or take home may be an ominous sign.  It’s easy to get excited here about big data, hadoop, and transforming whole industries, but back home, if you are immersed with the standard tasks of take that backup, add this index, tune this query, all of what was learned here, the vision of it all, will be lost.

I did enjoy the technical talks, although it will take some time to digest it all.  I’m so glad the slides and demos are going to be on the PASS website.  There’s a lot to go through, but most of the talks did have specific takeaways that could be immediately applied to several issues back home.

So, about the networking.  SQL PASS tried to do something for first-timers, pairing groups of us with alumni who were willing to help us have a better experience.  I don’t know if it was just my alumni assignee, but after the welcome reception, that I missed because I was flying in that day, the alumni interaction ceased.  No emails, no contact at all.  I did see the guy a couple of times from afar.  I’m not sure what the right answer is, though.  Perhaps it was clear that after the welcome reception we’re on our own.  After all, he’s there for his own reasons, not to babysit us.  I get that.  But perhaps a breakfast with the group, or a mid-point “how’s it going?” session, would help those of us who didn’t get to the reception.

As for the other events, PASS had some good opportunities to meet people, but it was a combination of my natural introvertedness, and unfortunate choices, that stymied my feeble efforts.  For birds of a feather and one of the breakfasts, the reaction to my arrival at the table and saying hi was that some people grunted, and the rest didn’t look up from their phone or tablet or netbook.  I sat by myself for most of the regional luncheon as people filled in every table but the one I picked, although that did get better as the room filled up.

It was interesting watching everyone with their various electronic devices.  People don’t look up at your arrival when they are catching up on email and what not right from their chair waiting for the next talk to start.  Heck, I saw one guy updating his fantasy football roster for the weekend.  I wonder if the networking was different five or more years ago when you sat down next to a person who wasn’t plugged in 24/7.

So, the networking was a bust, and I do blame myself for that.  Hopefully, I’ll be able to come back to a future session (maybe Charlotte 2013) and get better at this.  But overall, it was a good experience, a lot was learned, and I highly recommend it.

6
Nov

Travel Day for SQL PASS Summit

   Posted by: jasoncbunn

Today’s the day.  Traveling to Seattle today for the Summit.  So many people are so very excited online.  I admit I am too, but more intrigued than excited.  I want to see what it’s all about.  Also, there is a lot of potential for networking, so I figured I should update the blog in case there are actual visitors!

Traveling from 3:12 CST to 8:02 PST, which means I’ll be in the airline system for most of the election returns.  That should be interesting.  The country is so very divided.  Will it be a squeaker for the president, or perhaps a new course is chosen?  Either way, half the country will be ticked off by the time I land in Seattle.

12
Oct

Attending SQLPASS Summit in Seattle

   Posted by: jasoncbunn

In early November, I’ll be heading out west for the first time in over six years to attend the Professional Association for SQL Server (PASS) Summit in Seattle. It’s billed as the premier event for SQL Server professionals and I’m really looking forward to it.

I’ve been a DBA for about six years and in that time I’ve progressed to a point where I’m looking forward to expanding my perspectives by learning from the best, and spending some time figuring out how to take things to the next level.

In my day-to-day activities, we have a great group of eight DBAs working to manage over 4000 databases encompassing over 50 TB. I’d like to think that most of the daily triage and the daily grind, so to speak, are things that we are good at at this point.  All databases are backed up at least daily, with reports about missing full backups, or databases in full recovery without a recent transaction log backup emailed to the team each morning.  We monitor disk space, security changes, error log messages, and a host of other metrics.  We’ve automated backups, weekly maintenance (without a maintenance plan), and have developed a rather sophisticated unattended restore process (allowing developers to manage a well defined restore operation without direct DBA involvement) and a deployment process that can support both replication and change data capture with relative ease.  Yes, I’m proud of the infrastructure we’ve developed.

At the same time, there is so much more to learn.  One of the great new features of SQL Server 2012 is always-on availability groups and we will be moving in that direction next year, complete with clustering across subnets.  I’d like to see what other pros have done with monitoring of transactional replication.  Does everyone use tracer tokens?  How widespread is the use of the replication checksum capabilities?

SQL Server is such a big product that there could be whole swaths of the software what you never touch, and I don’t mean simply not using reporting services or integration services.  Even within the database engine, if XML querying or spatial data types are not part of the daily grind, then it’s not something you are likely to be proficient in.  So do those who taken and pass certifications, even the vaulted MCM’s, learn a whole host of features just to pass exams?  I’m curious to see if there is any opinion in that regard.

I’m looking at this trip as an opportunity to relax a bit, and focus on the profession, rather than the job.  What area of expertise can I really develop?  Are there any experiences or proficiencies I can share?  What are the next steps in my journey?  Hopefully I’ll find out a bit more during the week.  The sessions look great, and it’ll be hard to choose which to attend, but I’m sure it’ll be fruitful.

8
Sep

Change Data Capture Access Role

   Posted by: jasoncbunn

When enabling a table for change data capture, one of the options is to set up a user defined role to control access to the change data capture functions.  If a user is not a member of the role that is specified when change data capture is enabled, the user will not be able to use the functions that access change data.  This role is defined when sys.sp_cdc_enable_table is executed. For example:

USE AdventureWorks2012; 
GO 

EXECUTE sys.sp_cdc_enable_table
     @source_schema = N'HumanResources'
   , @source_name = N'Employee'
   , @role_name = N'cdc_Admin';
 GO 

(source:  http://msdn.microsoft.com/en-us/library/bb522475.aspx)

In this example, a user must be a member of the cdc_Admin role in order to use the CDC functions.  Note that membership in the db_owner role or sysadmin fixed server role will supercede this gating functionality.

This week, an interesting situation arose– what if the role doesn’t exist?  This can happen if CDC is set up in one environment and then the database is copied to another environment with different permissions.  If that copy is done via a backup and restore with the KEEP_CDC parameter, all of the change data capture tables and meta data are retained, including the name of the gating role.  If the role doesn’t exist, the functions are not accessible to anyone with permissions lower than db_owner.

This happens because the functions do a simple check.  First, are you sysadmin?  No.  Are you db_owner?  No.  Is there a gating role defined in the cdc.change_tables system table?  Yes.  Are you in it?  No.  Well, tough luck.  The functions usually don’t even fail, but will return a null result set.  A good example of this is the get_ordinal_position function.  You can misinterpret a missing column as a permissions problem, or vice versa.  The function will return null if there is no column in the table you are interested in, but will also return null if you are not a part of the gating role.

If the CDC functions are giving you problems, check cdc.change_tables and ensure the user is a member of the gating role, if one exists.  If the gating role is defined in cdc.change_tables but doesn’t exist, create it and add the user you are interested in.  The role doesn’t even have to have any explicit permissions, it just needs to exist, and the user that wishes to use the cdc functions needs to be a member.

24
Aug

Inbox 0: An Awesome Place to Be

   Posted by: jasoncbunn

As an enterprise DBA, I get a ton of email. Requests, notifications, announcements, server alerts, summaries, meeting minutes, and ongoing conversations about all of the above. In a typical week I probably receive over 200 emails.

Deciding what to do with those emails can be a nightmare.  Without some sort of order and organization, you can quickly find yourself with thousands of emails that are neither junk nor critical.  These emails may contain important information, may be a reminder that you should do something, or a note to yourself.  To make matters worse, in my case since we have a whole other inbox for our general DBA team mail, and we use the read/unread setting to denote new vs. ongoing tasks, I’ve set outlook not to mark an email as read when it’s in the preview pane.  So I ended up with literally thousands of unread email stretching back months.

Until today.  Today, I have nothing in my inbox.  Zero.  Zilch.  Nada.  Getting there took two steps.

First, I realized that, with Enterprise Vault, all emails older than 30 days old are archived.  So, I went from thousands to maybe one thousands by hitting the delete key to drop all shortcuts to the vault out of the inbox.

Second, I took a tip from Getting Things Done, by Dave Allen.  Every email is categorized, and I am getting much better at deleting intermediate conversational emails.  Today, when I get a new email, it goes into a folder in Outlook:

  • Action Pending:  Something to do
  • In Progress:  When I’m going to work on it right away
  • To Read
  • Waiting for Calendar:  When I have to act later.  I flag it with a reminder for the future and forget it
  • Waiting for Response:  If I’ve asked a question, I dump the email here and forget it until someone responds.

My work flow consists of moving inbox emails into the folders.  Once there are no more emails in the inbox, I choose what I want to work on and move the request from Action Pending to In Progress.  Once the task is complete, I move some email into a “Completed” folder that I use to write my weekly status report.  Daily I’ll check the two WaitingFor folders to ensure things are current and moving along.

This has freed up a lot of time and reduced the inbox clutter to zero.  When there’s nothing in the Inbox or Action Pending, I complete what is In Progress and then I can concentrate on my own assignments, research and development.  It’s a great feeling to know that you are on top of the inbox monster.  Whatever method you choose, Inbox: 0 is a worthy goal to shoot for.

With 10 days to develop and approach and implement it, we were presented with the need to combine data from two sources into a single set of tables.  One of our premium calculation databases is in production with data from one source system.  Meanwhile, data from a second source system has been converted into the same format and we are tasked with combining the data from the two systems.  Unfortunately, the converted data was built without considering the identity column in the four key tables.

For example, production may have data using identity values 1 to 5 million for a key table, and the converted data has overlapping identity values, from 1 to 60 million.  The goal is to have a single 65 million row table.  Since the 5 million rows are in production and resequencing the identity columns will have downstream impact, we will be inserting the larger data set into the smaller one.

Our basic plan is to, in this example, add 5 million to the identity column of the larger set and insert it into the smaller set:

INSERT INTO DBO.Prod_Table (
prodTable_Key,
<remaining columns>
)
SELECT convertedTable_Key+5000000,
<remaining columns>
FROM Converted_Table

Of course, to do this we set IDENTITY_INSERT on for the target table.

To test this, we just created an empty table and inserted the 60 million records into the empty table.  The total duration was about 2.5 hours.  Looking to improve upon this, we first tried, just to see the effect, skipping the identity insert and letting SQL Server populate that column in the target table.  When we ran the test, the transaction log of our database filled just before the load was complete, but we were looking at about 1 hour, 20 minutues.  This is better, but we wanted to see if we can improve further.

Incidentally, we discovered that if we drop the primary key (identity column) and just loaded into a heap, SQL Server decided to single-thread the load, which would take about 12 hours on the hardware we were testing with.

I read through the SQL Server Data Loading Performance Guide and ended up with few things to try.  The guide is good, but is geared towards loading from a flat file via bcp or BULK INSERT.  In our case, the data to load is in SQL Server, and the time to unload the data to a flat file would negate any bcp gains.  But the guide had one interesting graphic that was the key to our solution.

Here, we see that if you bulk insert into different parts of a target table, you can multi-thread the load to improve performance.  Combined with SSIS, we were able to achieve our objective.  We set up a simple SSIS package with separate data flow tasks.  Each data flow selected a subset of the source table and inserted into the target table.  We set the FastLoadOptions to define the incoming data as ordered.   How many data flow taskes and how much data to load in each was determined by trial and error, but we had good results with 12 flow tasks, and that could be related to the fact that this is a 24 core server and thus we had 12 threads for reading and 12 for writing.  In the end, we were able to load the 60 million records, which was about 120 GB of uncompressed data, into our empty table (primary key defined as identity column) in less than 30 minutes.

There are a few tests to run to fully vett this approach.  In particular, I want to test with a compressed target.  However, this approach is certainly viable and was a useful alternative to INSERT INTO…SELECT.