Storage and Apps and VMware, OH MY!: 2010

Monday, November 29, 2010

Exchange migration project (Part 1)

One of the bigger projects of my tenure at my current employer is firing up. In this series of posts over the next couple of months, I'm hoping to highlight and document our methods, as well as go over any hurdles we run into, as well as how we resolved them. This initial post will serve only as an introduction to the environment, and the overall plan of attack.

Currently our environment's email has always been hosted on an external mail host. It has been this way for 10+ years, and unfortunately, we have just outgrown them.

Internally, we have been planning for this, and have turned up a new Exchange 2007 environment (yes, it's virtualized), and it is already hosting about 2000 mailboxes to some happy end-users that we migrated from a shoddy freeware platform called hMail earlier this year. For them, OWA 2007 was a night-and-day difference to what they had before.

What we are now targeting is the corporate/regional backoffice staff that, while fewer in number, are the heavy hitters, with mailboxes ranging in size anywhere from 2GB to 40GB. And while this is the second phase of this migration as a whole, this phase has many, many "sub-phases."

What I'd like to discuss initially is sub-phase 1, and the hurdles, and how we overcame them.

When dealing with Exchange, it's a fairly straightforward process to move email from one place to another. What most people tend to NOT think about, especially over the course of TEN YEARS, is all of the little granular permissions, delegations, "EVP's admin asst can send email of his behalf, and view all calendars," etc etc etc. We'll get to this in a later post, as it deserves its own.

To make things even a little more complicated, our host is still running Exchange 2003, and our current environment is Exchange 2007. While there are defined upgrade paths from 2003 to 2007, there are not defined ways to take a 2003 edb and mount it on a 2007 mailbox server. So, we were sold on the idea that tools would need to be leveraged, and trusts between forests would have to be established, in order to migrate the users, as if they were on a completely different mail platform altogether.

But wait...how the hell are we going to do all of that between a remote host and our internal domain?

Well, we could do a site-to-site tunnel, but that would require some complex networking and add add'l layers of complexity that we weren't interested in, or that the host might not even allow.

After exhausting all options, we resolved to the idea of physically relocating the Exch2k3 server, as well as a Domain Controller from that domain, into a private VLAN inside of our network, essentially hosting the additional domain short-term until we were able to migrate the data off of the mail server completely. Why? It seems to be much easier than trying to do complex tunneled solutions, the host was willing to let go of the old HP ML370 we're currently running on, and they were willing to replicate our domain information onto a DC that we could then bring in-house.

So, what's required to do this?

1) We need a private VLAN internally. Coordination with the networking team to carve out ports and a space to host the new servers, and to avoid any crosstalk between the domains. All outgoing mail would go out to the internet first and come right back in to the new 2007 server. Could we get all super-cool with routing groups and SMTP connecters? Sure, but why bother/overcomplicate for a server that has a remaining shelf-life of about a month. We essentially just relocated the hosted solution, the same way they would if they moved datacenters.

2) Public access/interface: MX records will have to be updated to our IP block, as well as building in new NAT/ACL rules into our firewalls for this solution. Again, handled by the networking team but fairly straightforward, as if it were a new environment.

3) Physical move: One of our admins is hauling a new box down in the morning that will be the new DC, and we will be taking an outage to relocate the servers. During this time, the networking team will update Network Solutions, ZIX encryption gateway, and Sprint Spamshark to point to the new VLAN/Public IP.

At this point, we are at a hard cutover. No more mail will flow to the host, even if we left the server there. Once we plug in the hosts' gear and power it up in our datacenter, mail should resume delivery once again.

4) Power up the host's domain controller in our datacenter, and ensure that it is a GC. Actually, this will be verified before it ever leaves the host's datacenter.

5) AD looks good? Cool. Check for EventID 13516 to ensure it can accept authentication, and upon success fire up the Exchange server.

That's the plan, and we'll see how it goes.

Upon success of this implementation/move, we'll take additional longer term outage over the weekend to attempt to P2V the box, and throw a ton of resources at it. (It only has a 10/100 NIC, for example)

Further posts will include:

*Virtualizing Exch2k3 box with a P2V cold conversion.
*Virtualizing domain controller with a P2V cold conversion.
*Establish trusts between the two domains.
*Using 3rd party tools to move permissions, delegations, and data into existing domain.

Stay tuned...

Tuesday, November 23, 2010

NetApp SMO snapshots and changing heads...MESSY!

This post is going to be specific to you NetApp customers out there using your systems to host Oracle mount points or storage, but more specifically those using Snapmanager for Oracle to backup your Oracle databases (which funnily enough, doesn't pre-require you to be doing so on NetApp storage.)

As an aside, you should know that I'm on a cross country flight and making my first honest attempt at writing a post on the virtual keyboard of my iPad.

Some basic layout information of our infrastructure. Primary NetApp FAS3140 cluster hosting a dozen or so volumes with NFS exports mounted to HP DL380 servers over 10GbE. Pretty straight forward. This is hosting a single instance of Oracle 10g, and is being backed up using SMO 3.0.2. (3.1 is current)

As far as Oracle layout is concerned, I won't go into grit detail, but considering this post is about a fallacy I discovered in SMO, we need to establish a few things.

As is typical, Oracle uses a /u01, /u02, etc, format to number drives/mounts for it's structure. Honestly you can name them whatever you want, SMO just polls Oracle for what the mount points are. I'm just listing what we use in case I refer to them later in the post.

SMO is a great product regardless of whether you use NetApp for your storage or not. It was basically (unofficially) modified by Oracle themselves. Once the Oracle devs got their handson it they began collaborating with NetApp to fine-tune it.

Ok...now to the meat and potatoes of the post. As part of our resiliency, we keep a secondary archive log location active on a remote piece of storage that is also mounted over nfs. There is a defUlt setting in SMO that specifics to include the secondary arc log locations in any SMO snapshots.

Sounds great, doesn't it? Well, it is, unless you are unaware of it, or don't account for it in your capacity planning, or when replacing the hardware that hosts this secondary location.....which is what bit us in the backside this week.

We upgraded our off site hardware from 2050 controllers to 3140 controllers. Fairly routine. Snaps continued to run fine once the hostnames were updated and storage remounted. However, we started getting some log chatter in e SMO jobs about not being able to delete old snapshots that had expired (based on retention policy).

Hmm...after some head scratching and digging, we saw that it was trying to delete snapshots on the secondary arc log location of the old filer we just replaced. More head scratching and chatter over coffee, we hypothesized that somehow we were just going to have to ride out the log spam until the retention period had passed. We weren't really breaking anything. Or so we thought.

To compound things, further investigation reveals that not only is it not deleting the snapshots on e storage that doesn't exist anymore, it is also not flushing ....well, anything. This wS brought to our attention by the filling up of some volumes because the snap space was eating into the usuable space.

Feature request for NetApp: I want to be able to tell my snap products to NOT snap if there's no snap reserve space remaining. A snap backup job failing is not near as critical as a LUN filling up because snaps ate up all the usable space and crashed whatever application is being hostied and corrupting data. I know there is auto grow and auto delete snapshot abilities, and thats all fine and good, but you cant just delete snapshots randomly in something like an SMO snap, because the whole snap backup becomes invalid if it cannot find or access one sub-snapshot under the whole umbrella of a SMO snapshot.

So, by changing the hostname of the secondary storage, we nullified ALL SMO snaps that included a snapshot of that secondary arc log location. We couldn't delete them (gracefully) and we had to go through and force-delete the snaps, as well as traverse every single volume and remove the snapshots related to those jobs manually on all filers.

I did speak with one of the SMO gurus at NetApp this morning, and he confirmed this behavior, and also confirmed that there is a setting in SMO.config that can be changed to not snap the sec arc log location.

Once I get back from thanksgiving next week, I'll be posting the results of fixing this, with a thorough walkthrough.

Thursday, November 11, 2010

[NetApp] Paranormal SNAPtivity!!!

Ghostz! In my NetAppz!

So recently, we upgraded our replication target box (known in proper terms as a "SnapVault Secondary") from a FAS2050 to a FAS3140.

Everything went great on the migration. I cleared all of the SnapVault relationships out of Protection Mgr, re-baselined to the new box (yes, I know I didn't have to, but I had my reasons), and all was well with the world.

I was reviewing syslogs this morning, and noticed something strange spamming my console...

Thu Nov 11 12:13:00 PST [snapvault.tgt.failure:error]: Could not create Snapshot target "" on volume <vol_name>: volume is not online or does not exist.

The interesting part about this was that...well, this was a particular relationship we had removed completely because it's location had moved. So why are you still trying to snapvault it?

I re-check Protection Mgr's console, and nope, no Dataset relationship listed.

Snapvault status tells a similar story.

Where I found it, was in the command line version of configuring SnapVault relationships.

The ol' "snapvault snap sched/unsched" commandset we used to have to use before Protection Mgr came around.

filer> snapvault snap sched

create <vol_name> 0@-
create <vol_name> dfpm_temp 1@-@0

Sure enough...there it was.

Come to find out, it's a bit flaky in the sense of....if any of the snapping/mirroring/protecting/DP technologies are in the process of taking snaps, and you remove the relationship via Protection Mgr during any of those processes, it doesn't always gracefully remove the base commandset from the underlying OS, and these need to be removed manually.

filer> snapvault snap unsched <vol_name>
Listing matching snapshot schedules ...
create <vol_name> 0@-
Unconfigure these snapshot schedules? y

[poltergeist] "I have exorcised the demons! This house is clean! [/poltergeist]

NetApp customers can reference this KB article.

-Nick

Wednesday, November 10, 2010

Oracle on VMware - ALL Oracle products now Supported!

Oh glorious day!

A big victory was won today in the Oracle on VMware battle front. Most of the resistance has been around excluding RAC from any support stance, and overall [lack of] any official support stance. Before they just blanket didn’t support it. They’ve now changed that stance, and updated their stuff to show that they DO support all of their products, including RAC (*11.2.0.2 only*), running on VMware.

The highlight is they don’t CERTIFY. They DO support all products, but do not CERTIFY VMware as a hardware platform. And if a problem is found, they request you work with VMware on it, which is exactly what we’ve always wanted anyway! (i.e. We would call Sun for the v880’s hardware, not Oracle, etc.)

The last obstacle is updating the Enterprise Licensing model to promote soft-provisioning technologies other than their own, so that customers can actually increase their Oracle footprint, rather than being fear-mongered into decreasing it by licensing models that don't make sense in a virtual world.

Here’s the actual updated metalink article, released on Tuesday and announced today:

*Support Position for Oracle Products Running on VMWare Virtualized Environments [ID 249212.1]*

	Modified 08-NOV-2010 Type ANNOUNCEMENT Status PUBLISHED

Purpose

---------

Explain to customers how Oracle supports our products when running on VMware

Scope & Application

----------------------

For Customers running Oracle products on VMware virtualized environments.

No limitation on use or distribution.

Support Status for VMware Virtualized Environments

--------------------------------------------------

Oracle has not certified any of its products on VMware virtualized environments. Oracle Support will assist customers running Oracle products on VMware in the following manner: Oracle will only provide support for issues that either are known to occur on the native OS, or can be demonstrated not to be as a result of running on VMware.

If a problem is a known Oracle issue, Oracle support will recommend the appropriate solution on the native OS. If that solution does not work in the VMware virtualized environment, the customer will be referred to VMware for support. When the customer can demonstrate that the Oracle solution does not work when running on the native OS, Oracle will resume support, including logging a bug with Oracle Development for investigation if required.

If the problem is determined not to be a known Oracle issue, we will refer the customer to VMware for support. When the customer can demonstrate that the issue occurs when running on the native OS, Oracle will resume support, including logging a bug with Oracle Development for investigation if required.

NOTE: Oracle has not certified any of its products on VMware. For Oracle RAC, Oracle will onlyaccept Service Requests as described in this note on Oracle RAC 11.2.0.2 and later releases.