Cisco UCS – Zero to Hero in 5 Short Years

I’d love to call Randy Seidl and ask him for an interview. The problem is, I don’t have the street cred that it would take to even make it past his administrative assistant. You see, Mr. Seidl used to work for Hewlett-Packard as their “Senior Vice President of the Americas, Enterprise Servers, Storage, and Networking”.

He doesn’t work for HP any more.

From YouTube/Cisco: "The Worst Predictions in History"

From YouTube/Cisco: “The Worst Predictions in History”

“A year from now the difference will be UCS is dead and we have had phenomenal market share growth in the networking space.”

This is a quote taken from this article over at CRN just prior to HP’s 2010 partner conference, just one year after Cisco launched the UCS platform. HP’s strategy at this point was to try to take market share away from Cisco in their core switching business. I suppose this was a natural response considering Cisco’s foray into enterprise servers was aimed to strike a blow at the heart of HP’s business. HP’s strategy aimed to empower their partners to offer significant discounts and trade-in allowances for any existing Cisco customers, hoping to woo them away from the teal giant. 2-for-1 and 3-for-1 deals weren’t uncommon, and it seemed HP was ready to cut off their nose to spite their face just to grab more of the Ethernet switching pie.

5 years later Cisco remains atop the Ethernet switch market with a 60.4% share.

But we’re not here to talk about Ethernet switching. We’re here to talk about Cisco UCS.

Happy Birthday Unified Computing System

The official Cisco Unified Computing System press release came in March of 2009. By 2010, which marked my first Cisco Live event in Las Vegas, UCS had a lot of hype among the Cisco faithful. I returned home from the conference excited about UCS, because at my day job we were in the process of jumping head-first into virtualization and were looking at different options for servers and storage.

I shared my enthusiasm for UCS but was told Cisco would never touch HP or IBM in market share for servers. We bought Dell.

In just 5 years, Cisco UCS has vaulted to the #1 spot in the Americas(40.9%), and #2 Worldwide (26.3%) for x86 Blade Servers, according to the IDC Worldwide Quarterly Server Tracker for 2014Q1.

HP has fallen from 47.7% to 34.9% in that time.

IBM has plummeted from 34.4% to 10.2% in that time.

ucs-1 ucs-2

Also according to the IDC report, Cisco has the highest industry growth in the total worldwide server market, with 39% revenue growth on a cumulative four quarter basis ending in 2014Q1. This, while HP, IBM, Dell, Oracle, and Fujitsu all report flat or declining results.

ucs-3

Cisco UCS also presently holds 94 performance benchmark records.

The Numbers Don’t Lie

5 years after launch, and 4 years after Mr. Seidl’s bold prediction that UCS would be dead in a year, the numbers reveal the truth of the success of the UCS platform. Cisco UCS has established itself as a key player in the enterprise server market, not only in the Americas, but Worldwide, and growth continues every quarter as the Cisco UCS product team continue to drive innovation and performance within the platform.

Cisco’s UCS business unit deserves a round of applause and a lot of credit.

The numbers don’t lie, but I’d still really like to ask Randy Seidl what he thinks of those numbers.

The official press release can be found here: http://newsroom.cisco.com/release/1426059

One Way Audio on Cisco 7925G Wireless Phones

Knock Knock.

Who’s there?

Voice over Wireless LAN.

Voice over Wireless LAN who?

….

….

Hello?

….

Hello?

Working with a multitude of different technologies is great. I love it, for the most part. That being said sometimes it can be really frustrating as well. I am neither an expert in voice nor wireless technologies, but I am often times the primary ‘go-to’ person for both of these subjects at work. Now I like working with voice, it’s fun and presents its own interesting challenges sometimes, but for the size of our VoIP deployment, it pretty much just works. Wireless, while still fun to play around with, tends to be my nemesis, as I just haven’t had enough time to really delve into its deeper mysteries. Now, on that rare occasion when the problem is related to both voice AND wireless, things start to get really interesting.

I recently deployed some Cisco 7925G Wireless IP Phones to a number of our sites’ custodians as a replacement for cellular phones. They need to be mobile around the facility in order to troubleshoot issues in places that don’t have a hard line, but don’t require a full-blown cell phone.

Now some caveats; we don’t have sufficient AP coverage for a full-blown VoWLAN deployment, and during testing with the 7925G I did notice some interruption in the call stream when roaming from AP to AP. We also no longer have Cisco as our wireless vendor so I thought there may be some interoperability issues, but felt that 802.11 was after all, a standard right? What could possibly go wrong?

First Reports

The first rumblings of a problem came from some of the custodians saying they had ‘intermittent’ audio. I assumed (somewhat incorrectly) that this meant they were trying to wander around the building or even outside, treating the phone as a cell-phone, and losing sufficient signal from a nearby AP to maintain the call.

I explained to anyone with issues that these were not in fact cellular phones and they needed to stay within reasonable range of an AP to keep their call going. We would add capacity to the wireless as needed in the future, but for now it was the best we could do.

Sent Back

Next I received one of the phones, and it’s charger, in inter-office mail with a sticky note saying simply: “doesn’t work”. I tested the phone with a few different numbers and it seemed fine. I sent it back to the person who mailed it with a note: “works fine”.

As it turns out, I was wrong.

Definitely Broken

I next heard from another analyst who said all calls from the phone at one site were completely dropping. No audio at all. We tested and found that audio coming from the 7925’s was fine, but they were having problems receiving audio.  The initial call setup seemed fine and there were a few seconds of clear two-way audio, but almost immediately the receiving audio was failing.

One-way audio – the bane of any voice engineer’s existence. Coupled with the fact that these were wireless phones as well, made troubleshooting the issue even more complicated.

I had initially thought this might be a QoS issue but the wired phones at the site were fine. Wireshark confirmed QoS wasn’t an issue but I could clearly see in the captures that the RTP to the handsets stopped shortly after calls began, resulting in one-way audio.

Viewing the Call Statistics on the phone also confirmed there was definitely some sort of problem. Jitter was extremely high, Receiver lost Packets were many, and the MOS was around 2.

7925G-Before

Settings

I began playing around with the WLAN settings on the 7925G handsets, trying to find what might be causing the issue. Some suggestions from folks on Twitter pointed at forcing the phones to use 2.4 GHz only, while others insisted they would work fine on 5 GHz. Hard setting the frequency didn’t appear to resolve anything, so I continued the ever popular troubleshooting technique of randomly turning options on and off.

I came across the setting labeled “Call Power Save Mode” which was set by default to “U-APSD/PS-Poll” and also presented the option “None”.

Now, I had no idea what this option did, but I set it to “None” and performed a test call. Lo and behold, the issue appeared to go away. Two way audio persisted through the entire call, and call statistics on the handset were dramatically improved. Jitter was down to 2/22, only 2 dropped packets, and MOS was up to 4.5.

7925G-After

U-APSD/PS-Poll

So what exactly does this option do? U-APSD or Unscheduled Asynchronous Power Save Delivery is a mechanism that allows frames to be queued on a wireless access point in order to save power on a wireless client. When there is no data for the client to receive, it can go back into standby mode, allowing it to save power and battery life.

From Cisco’s Voice over Wireless LAN Design Guide:

The primary benefit of U-APSD is that it allows the voice client to synchronize the transmission and reception of voice frames with the AP, thereby allowing the client to go into power-save mode between the transmission/reception of each voice frame tuple. The WLAN client frame transmission in the access categories supporting U-APSD triggers the AP to send any data frames queued for that WLAN client in that AC. A U-APSD client remains listening to the AP until it receives a frame from the AP with an end-of-service period (EOSP) bit set. This tells the client that it can now go back into its power-save mode. This triggering mechanism is considered a more efficient use of client power than the regular listening for beacons method, at a period controlled by the delivery traffic indication map (DTIM) interval, because the latency and jitter requirements of voice are such that a WVoIP client would either not be in power-save mode during a call, resulting in reduced talk times, or would use a short DTIM interval, resulting in reduced standby times. The use of U-APSD allows the use of long DTIM intervals to maximize standby time without sacrificing call quality. The U-APSD feature can be applied individually across access categories, allowing U-APSD can be applied to the voice ACs in the AP, but the other ACs still use the standard power save feature.

Best Intentions

So why did turning this feature off resolve the one-way audio problem? It seems this is a technology that should help rather than hinder a wireless VoIP call. In this case it appears to do nothing but cause problems.

I can only speculate here because my understanding of this particular mechanism is limited, but I would suspect that even though U-APSD is a standard as part of IEEE 802.11e, the implementations may be somewhat disparate across vendors. Cisco in this case makes the phone and the wireless network is Ruckus. I suspect if I were using Cisco wireless gear, this wouldn’t be an issue. That’s not to blame Ruckus for the problem of course, it just seems to be one of those minor differences in how vendors implement certain technologies.

This brings about an entirely different topic of discussion, but if this is the case, can anything be done to hold vendors accountable for the little tweaks and changes to technologies that are supposed to be standards designed to improve, not prevent interoperability?

Troubleshooting MTU size over IPSEC VPN

I recently deployed a couple of wireless access points to two sites that connect to our main office over IPSEC VPN. After a recent firmware update to the wireless controller both access points got stuck in a provisioning loop and appeared to have difficulty communicating with the controller. Both AP’s repeatedly disconnected due to a “heartbeats lost” error.

Connectivity between the main office and the remote sites appeared fine. Both access points were reachable via ping and ssh. I set up a packet debug on both sites’ firewalls and saw traffic going back and forth between the access points and the controller, and both access points appeared on the controller status window, alternating between “Provisioning” and “Disconnected”.

Needless to say I was slightly baffled.

I opened a ticket with the wireless vendor and (very quickly) received an answer. The MTU for CAPWAP traffic between the access points and the controller is hard set by the controller to 1500*. With these sites connected via IPSEC, that was going to cause some fragmentation due to the overhead that IPSEC was going to add onto the traffic going between sites.

I needed to lower the MTU size on the controller, but to what value? IPSEC doesn’t seem to have a ‘fixed’ header size due to the different encryption options that can be used. So how do I find out exactly how much our particular IPSEC configuration is adding?

ping -f

The -f flag from a Windows command prompt prevents an ICMP packet from being fragmented. This, combined with the -l flag allows you to set the size of the ICMP packet being sent.

So, assuming a standard ethernet MTU of 1500, and accounting for an 8-byte ICMP header, and 20-byte IP header, I should be able to send an ICMP packet sized to 1472 bytes, but 1473 should be too large:

C:\Users\netcanuck>ping 172.16.32.1 -f -l 1472

Pinging 172.16.32.1 with 1472 bytes of data:
Reply from 172.16.32.1: bytes=1472 time=3ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=4ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=4ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=3ms TTL=251

C:\Users\netcanuck>ping 172.16.32.1 -f -l 1473

Pinging 172.16.32.1 with 1473 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Excellent! So now to test across our IPSEC tunnel:

C:\Users\netcanuck>ping 172.16.68.1 -f -l 1472

Pinging 172.16.68.1 with 1472 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Now this makes sense. The MTU size does not account for the IPSEC overhead.

After some testing with different packet sizes I hit on the magic number: 1384 bytes. At 1385 the packets were again rejected as being too large. So some quick math:

ICMP payload: 1384 bytes

ICMP header: 8 bytes

IP header: 20 bytes

Subtotal: 1412 bytes

This leaves 88 bytes as the IPSEC header. I should be able to set the MTU size on the controller to 1412 and the access points should resume functioning normally.

I did in fact set the MTU to 1400 – I like nice, round numbers – and sure enough both access points resumed proper communication with the controller.

What I Learned Today

Sometimes the simple tools are easy to overlook. Using a standard Windows command prompt and ping using the -f  flag is a quick and easy way to diagnose MTU and fragmentation issues across a VPN tunnel.

* It appears from the support documentation for this particular wireless vendor that the MTU size should be 1450 by default which should take into account at least some overhead and explains why these access points were working fine until now. The firmware update seems to have changed this to 1500.

The Problem With “Free”.

It’s rare to have a day go by during which I don’t hear or read about some product that a vendor is now ‘giving away’ or moving to a ‘freemium’ model. In some of the more contentious verticals in the IT industry this seems to be a key tactic for winning new customers and providing value-add for existing ones.

I’m not in marketing or sales, so I can only assume here that the premise behind these gratuitous offerings is to have new, potential customers try the product, fall in love with it, and want to then add more of that company’s products to their infrastructure. There is also a tiny voice in my head that suggests perhaps these organizations might also want their ‘free’ product to become so critical to your operation, that should they decide to charge a fee or licensing for said product at some point in the future, that you’d be forced to pay because it has become something you simply couldn’t live without.

Ultimately the short or long-term goal of offering these products doesn’t really matter. What matters is there is a very big problem with these free products:

They’re free.

They don’t generate revenue, at least directly, for the vendor providing them. This means they are, in all aspects, simply a cost center…a money sink. An expense that perhaps proves the old saying that “you have to spend money to make money”. But the real issue here for you or I as a potential user, or implementer of these products, is that it is very difficult to get any support.

Hello, Bonjour

This particular rant blog post is centered around one such product that everybody seems to be racing to give away. If you, like me, work in an environment that is moving to support the BYOD craze and have anything other than one large, flat network, then Apple’s Bonjour is probably driving you nuts and causing you to sprout gray hair, if you have any left.

Because this particular protocol and all of it’s relatives (mDNS, Zeroconf) can’t communicate across layer 3 boundaries (they have a TTL of 1) when someone on your BYOD wifi wants to talk to the Apple TV on your corporate wifi, you need something to broker that connection.  Enter the Bonjour Gateway (BG).

Aerohive was first to announce and make available their BG product in early 2012. It is built into their HiveOS on any Aerohive access point, or as a virtual machine that will run on VMware. It’s free up to 2 instances of the virtual appliance. I don’t know what the cost might be for anyone wishing to use more than 2, but I would imagine this is an opportunity to sell actual Aerohive hardware to a potential customer.

Cisco has included it as part of their Wireless Lan Controller (WLC) software beginning with version 7.4.  This isn’t free, per se, but is obviously a valuable addition for any existing customer.

Ruckus announced in January 2013 their SmartWay™ technology as “beyond bonjour bridging”, and would be available Q2. Again, this is only free in the sense that existing customers would not have to pay for the software upgrade to their existing controllers.

A quick Google search at some other vendor offerings show that pretty much everyone in the wireless space is offering support for Bonjour in some way.

I may be wrong about this but it seems to me that providing a solution for this issue in enterprise networks is/was a priority for each of these vendors. Why then has my experience with getting one of these platforms working been such a disaster?

Aerohive

If you don’t already follow Andrew von Nagy on Twitter (@revolutionwifi), you should. He is a true wifi evangelist and an excellent resource for keeping up-to-date on all things 802.11. His twitter feed was very active with the announcement of the release of Aerohive’s BG.

Working in a K-12 education environment we had already identified this as a need. Staff and students wanted to take advantage of AirPrint and AirPlay and we had to find a solution. I quickly signed up for my free Aerohive BG and HiveManager account.  Installation was easy as it comes in the form of an OVA. It’s pretty much ‘drop it into VMware’ and you are ready to go.

I had some problems with devices being able to see the AirPrint and AirPlay services across subnets. After some tinkering I decided to email Aerohive at the provided “free_bonjour_support@aerohive.com” address with my issue. That email must have ended up in the bit bucket because I received no reply.  I sent out a tweet about a week later asking @Aerohive how long one could expect to wait for support for the BG.  That too was met with silence. Two weeks later I was rather frustrated and sent out another tweet, this one a little more vitriolic:

“Going nowhere fast with Aerohive’s free bonjour gateway. Anyone have alternative suggestions? (That work)”

Now it should be noted that I’m in Canada and this tweet was sent out on November 22nd, 2012 – US Thanksgiving.

Andrew von Nagy responded via twitter and helped me out with some troubleshooting. I have to throw out a big thanks to him for taking the time on a holiday to offer some support.

On that same day, I received a reply to my original email (unsure if Andrew had anything to do with this) and began working with the online support to get the BG working.

A short 10 weeks later, I had resolved the issue (on my own) and closed the support request with Aerohive.  From the original email on November 5th to resolution on January 10th….granted there are a few holidays in there…but that’s a long time to get an issue with an initial configuration resolved.

Ruckus

Just around the same time (January 2013) I managed to get that first BG working, we received word from our current wireless vendor, Ruckus, that they too were working on a BG solution. This was direct from David Callisch, VP of Marketing for Ruckus Wireless. He even offered to let us beta test the new firmware. This is great news! Being able to implement this solution on infrastructure we already own and manage should be quick and easy, right?

It’s mid May, and we still haven’t received the beta firmware.

Also, Ruckus recently pulled their latest 9.6 firmware off their support site, so I have a feeling 9.7 and SmartWay™ are going to miss their targeted Q2 release.

“Ruckus    Wireless    has    decided    to    remove    the    9.6.0.0.264    release    for    ZoneDirector    while    we    investigate    an    issue    that    was    discovered    after    the    release.”

Aerohive Revisited

In April I received an email from Aerohive that outlined some major bug fixes and enhancements to their free BG.  While I had been able to get it working with AirPlay somewhat in my previous attempt we had never been able to get AirPrint to work properly. I hoped that this news would mean we could get both pieces to function properly.

Having deleted the VM for the original installation of Aerohive’s BG, attempted to reinstall it, only to be told that my serial # had already been activated and that I could not reactivate it.  Ok, easy fix, right?  I  fired off an email to “free_bonjour_support@aerohive.com” and explained my situation and asked if I could have a new key or the original key re-enabled.

That email went out April 19th, and I have yet to get any sort of reply.

Free Should Not Mean “free from support”

If these value-added features, or in some cases, fully ‘free’ products are meant to drive potential customers to become paying customers and/or if these products are meant to keep existing customers as loyal, long-term customers with an existing vendor, then I would expect support be as agile and attentive as it would be for any other product or offering from these same vendors.

I shouldn’t be left waiting for an email that never comes, and I certainly shouldn’t have to resort to social media shaming to get action from a vendor. Sadly it seems to be the most effective method of getting things moving, but it should be a last resort not the primary method of seeking resolution.

Perhaps I’m expecting too much from a free product or feature, and I may be misinterpreting the purpose of these add-ons as marketing/sales tools. I might be naive in believing that any truly ‘free’ product is going to become a key part of my infrastructure and solve a major technical hurdle for my users. I can only hope there is actually some sort of benevolent, beneficial reason for vendors to offer these solutions, and hope that they are able to provide some better support in the future.

Otherwise, there are truly free and open products like Avahi that are able to quickly and easily deploy mDNS service discovery options across subnets. If you know a little Linux…

Note: During the writing of this post I had been contacted by our local Aerohive rep who caught wind of a Tweet I sent out yesterday about my BG issue.  He’s managed to get me a new serial # for our BG so I can happily reinstall it and give it another go.  Social media wins again!