Tuesday, November 23, 2010

NetApp SMO snapshots and changing heads...MESSY!

This post is going to be specific to you NetApp customers out there using your systems to host Oracle mount points or storage, but more specifically those using Snapmanager for Oracle to backup your Oracle databases (which funnily enough, doesn't pre-require you to be doing so on NetApp storage.)

As an aside, you should know that I'm on a cross country flight and making my first honest attempt at writing a post on the virtual keyboard of my iPad.

Some basic layout information of our infrastructure. Primary NetApp FAS3140 cluster hosting a dozen or so volumes with NFS exports mounted to HP DL380 servers over 10GbE. Pretty straight forward. This is hosting a single instance of Oracle 10g, and is being backed up using SMO 3.0.2. (3.1 is current)

As far as Oracle layout is concerned, I won't go into grit detail, but considering this post is about a fallacy I discovered in SMO, we need to establish a few things.

As is typical, Oracle uses a /u01, /u02, etc, format to number drives/mounts for it's structure. Honestly you can name them whatever you want, SMO just polls Oracle for what the mount points are. I'm just listing what we use in case I refer to them later in the post.

SMO is a great product regardless of whether you use NetApp for your storage or not. It was basically (unofficially) modified by Oracle themselves. Once the Oracle devs got their handson it they began collaborating with NetApp to fine-tune it.

Ok...now to the meat and potatoes of the post. As part of our resiliency, we keep a secondary archive log location active on a remote piece of storage that is also mounted over nfs. There is a defUlt setting in SMO that specifics to include the secondary arc log locations in any SMO snapshots.

Sounds great, doesn't it? Well, it is, unless you are unaware of it, or don't account for it in your capacity planning, or when replacing the hardware that hosts this secondary location.....which is what bit us in the backside this week.

We upgraded our off site hardware from 2050 controllers to 3140 controllers. Fairly routine. Snaps continued to run fine once the hostnames were updated and storage remounted. However, we started getting some log chatter in e SMO jobs about not being able to delete old snapshots that had expired (based on retention policy).

Hmm...after some head scratching and digging, we saw that it was trying to delete snapshots on the secondary arc log location of the old filer we just replaced. More head scratching and chatter over coffee, we hypothesized that somehow we were just going to have to ride out the log spam until the retention period had passed. We weren't really breaking anything. Or so we thought.

To compound things, further investigation reveals that not only is it not deleting the snapshots on e storage that doesn't exist anymore, it is also not flushing ....well, anything. This wS brought to our attention by the filling up of some volumes because the snap space was eating into the usuable space.

Feature request for NetApp: I want to be able to tell my snap products to NOT snap if there's no snap reserve space remaining. A snap backup job failing is not near as critical as a LUN filling up because snaps ate up all the usable space and crashed whatever application is being hostied and corrupting data. I know there is auto grow and auto delete snapshot abilities, and thats all fine and good, but you cant just delete snapshots randomly in something like an SMO snap, because the whole snap backup becomes invalid if it cannot find or access one sub-snapshot under the whole umbrella of a SMO snapshot.

So, by changing the hostname of the secondary storage, we nullified ALL SMO snaps that included a snapshot of that secondary arc log location. We couldn't delete them (gracefully) and we had to go through and force-delete the snaps, as well as traverse every single volume and remove the snapshots related to those jobs manually on all filers.

I did speak with one of the SMO gurus at NetApp this morning, and he confirmed this behavior, and also confirmed that there is a setting in SMO.config that can be changed to not snap the sec arc log location.

Once I get back from thanksgiving next week, I'll be posting the results of fixing this, with a thorough walkthrough.

No comments:

Post a Comment