Thursday, February 10, 2011

ESX/i 4.1 + HP NC522SFP+ = PSoD's.

We made a decision recently to exclusively cable all of our ESX hosts via 10GbE.  We had a ton of extra cables laying around, but would still need to order some to fulfill the complete 10GbE package.  So we order them, and off we went rebuilding the environment.

First two, aside from some issues with Host Profiles (which I could rant about for hours), went flawlessly.  This was because we had enough of the older style cables.  On the third host, I had to start using the new one's we had ordered, which, while nicer looking, did appear completely different.

(i.e. silicon board inside is red instead of green, pull tabs different.  Aesthetic stuff like that.)

@Cisco: I would love some definition as to what is different between these two.  The only different numbers I can find are:

Old:    37-0961-01
New:  37-0961-02
Both are SFP-H10GB-CU*M (* = length of cable in meters)

Anyway, I plugged the first new one in, no link lights.  Hrm, maybe a bad cable. I'll try another one.  No link lights.  OK, third times a charm.  Nope, no link lights.

Started digging and apparently, you have to upgrade the firmware of the HP NC522SFP+ 10GbE adapters in order for these new cables to work.

Again, @Cisco..... why is that?  What changed?

OK, not a huge deal.  Go out to HP's website, grab the latest Firmware Maintenance DVD ISO, which was 9.20B, burned it, loaded it, everything went fine.

"Wait, why does it say QLogic now instead of NexGen..."

POST process goes fine, and ESX starts to load.....it gets to "networking-drivers..." and BANG.  PSoD.

Crap.

Let the research begin.  I hopped onto HP's support site, and initiated a chat session online with one of their techs.  Kudos, because after about 10 minutes, this guy found the issue.

HP Advisory Link

^This is the advisory.  It's pretty long-winded, so I thought I would sum up what the problem is here in a TLDR version...

First, let me be clear:  Everything worked famously up until this point.  I had zero problems from the Cisco twinax cables, Nexus 5010's, HP NC522SFP+, and 10GbE NICs.   The trigger point was Cisco changing the cables, requiring the upgrades.


What the advisory says is:

This occurs due to an incompatibility of the VMware ESX/ESXi 4.1 in-box Qlogic 4.0.550 driver with the Qlogic 4.0.520 (or higher) NIC firmware installed on an NC522m or NC522SFP Gigabit Server Adapter.


I can confirm as of today that this does only affect 4.1 and up.  I successfully updated the firmware today on some ESX 4.0 boxes and they worked flawlessly.  There is an obvious mismatch in the ESX driver in 4.1 and the HP firmware.  I waited to post this until today to test this, and also that 4.1U1 was being released.  Apparently, there's no update to the driver in U1.  Seriously, guys?!  :\

So, how do we fix this?

You need to download a copy of the HP Firmware DVD Bundle 9.20B (iso).  Download Link
You need to download the USB utility to create a bootable USB from the above iso.  Download Link
You need the latest custom drivers from QLogic off the VMware site.  Download Link
You need the latest firmware (as of today is 4.0.539 (15Dec2010)) from HP.  Download Link

You will also need ESX/i 4.1 install media.  I'll leave this one to you to acquire based on your license level.
You will also need to burn the QLogic drivers ISO to disc for when we reload the OS.

OK, got it all together?  Good. Let's go through the motions.

1)  Unzip the firmware.9.20B.zip so you can get to the ISO file.
2)  Install the HP USB key creator.
3)  Run the HP USB key creator, and when prompted, point to the ISO file where you unzipped it.
4)  Once the USB key is created, browse the folder structure on the key and look for subfolders "/hp/swpackages"
5)  Once there, paste the .scexe file from HP firmware into /hp/swpackages (I believe the exact file name is CP14007.scexe).  This will not overwrite, it will just be an additional package.
6)  Put the ESX host in maint mode (at this point, it's useless anyway) and boot it off of the USB key.

I would hope it goes without saying, but in an effort to be pedantic, your NC522SFP+ cards must be installed in the server in order to receive the update.  If you add additional new cards after the fact, you'll need to repeat this process to update them.

7)  Choose INTERACTIVE UPDATE when the load screen appears.
8)  Select the top selection titled "ML/DL 300/500" and at the bottom, check the two boxes that say, "ALLOW NON-BUNDLE" options, and leave the "FORCE" option unchecked.
9)  It will go through an inventory process determining which packages need to be installed.  If you've placed the .scexe in the right place, you will see an option for which one you would like to install.  By default, it will select the most recent one, which is what we want.
10)  Leave the defaults, and click INSTALL.

Reboot the host when prompted (remove the USB key) and insert your vSphere 4.1 media.  Install as usual.  When you get to the screen where you're asked if you want to load custom drivers, choose YES, and insert your disc with the QLogic drivers in.  Click OK.  You should only see one option for the nx_nic.  Select it, and click OK.  Leave it in and continue on with the install process.  You will be prompted when you need to re-insert the ESX media.

You should be good to go at this point.  Finish your re-install.

HARD-NOSED CUSTOMER OPINION

This is a PAIN IN THE A$$ process, and HP, Cisco, AND VMware are all accountable here.  This simply cannot happen.  You three are some of (if not THE) the biggest players in this arena, and the simple fact that this could slip through the cracks is unacceptable, guys.  QA your stuff.  The fact that this is STILL not fixed after first emerging last summer is very telling of your lack of communication and working together.

-Nick

No comments:

Post a Comment