Cisco Catalyst Wifi, Take Two

On November 13th, Cisco announced their next-generation wireless platform with the release of the Catalyst 9800 Series Wireless Controller.

You read that right, the next WLC platform from Cisco is running on Catalyst and expands Cisco’s DNA-Center architecture into the wireless space.

The Catalyst 9800 controllers come in a variety of form factors. The option for a standalone hardware controller is still here with the 9800-40 and 9800-80, or the 9800 series can be run as a VM in a private or public cloud. A third option is now to run embedded wireless on the Catalyst 9k series switches.

Embedded wireless controllers on Catalyst switches…that sounds familiar, doesn’t it?

Cisco made a similar move a few years ago with an architecture called Converged Access. This embedded the wireless controller functionality into IOS XE on the 3650 and 3850 access switches. For various reasons, it did not live up to expectations, and Cisco killed it in IOS XE Everest 16.5.1a in late 2017.

Cisco and Aironet

Cisco acquired Aironet Wireless Communications in 1999 for $799M. Since then, Cisco wireless access points have generally been referred to as “Aironet” products by name. This includes the software that runs on the wireless controllers and access points, AireOS.

AireOS came from Cisco’s acquisition of Airespace in 2005. Airespace were the developers of the AP/Controller model and the Lightweight Access Point Protocol (LWAPP), which was the precursor to CAPWAP.

(Credit to Jake Snyder for correcting me on the origins of AireOS)

Whatever AireOS version is running on your wireless controller is the same that you have on your access points. Cisco has developed the platform to be what it is today, and very little of it remains what was once the original AireOS.

With this iteration, or rather re-invention of the Wireless Controller, Cisco have highlighted three key improvements to their predecessor wireless software.

Always-On

Controller redundancy is always critical to prevent downtime in the event of failure. Here, Cisco are touting stateful switch over with an active standby model in which client state is maintained across the standby controller, offering no downtime for clients in the event of a failure.

Patches and minor software updates now will not change the base image of the controller. Updates can be done without client downtime. Patches for specific AP models can be done without affecting the base image or other access point models with per-AP device packs. These are installed to the controller and then pushed only to the model of AP they are for.

New AP models can also be joined to the controller without impact to the overall base image with the AP device packs, allowing new hardware to join an existing environment without a major upgrade.

Citing “no disruption” base image/version upgrades, the new 9800 controllers can be updated independently of the access points, whereas previously the software version running on the controller and access points was coupled. Upgrades were done to the controller, and then pushed to the access points. More often than not, this resulted in interruption to clients on affected access points, some rebooting of the controller and AP’s was inevitable, and quite often, some orphaned access points that never quite upgraded properly or failed to rejoin the controller.

Cisco have made many improvements to the upgrade process over the years, including staged firmware upgrades, however in large wireless deployments, firmware upgrades would not generally be considered zero-downtime.

With the new controller architecture using an RF-based intelligent rolling upgrade process, Cisco has aimed at eliminating some of these issues. During the upgrade process, the standby or secondary controller is first upgraded to the new image. You can then specify a percentage of access points you would like upgraded at once (5%-25%), and the controller then determines which AP’s should be upgraded using the AP neighbor information and # of clients on each AP. APs with no clients are upgraded first. Clients on access points that are to be upgraded are steered toward neighboring access points in order to prevent interruption in service.

The idea of steering clients to other access points or 5Ghz radios instead of 2.4Ghz radios isn’t new, and because I’m not a wireless expert I won’t comment on exactly how it’s done, but it is my understanding that it is difficult to guarantee that the client will “listen” to the steering mechanism. I feel even with this intelligent RF behind this upgrade process, some clients will inevitably experience a loss of connectivity during the upgrade process.

Once the access point is upgraded, it then joins the already-upgraded controller, and resumes servicing clients.

After all access points are joined to the upgraded controller, the primary controller begins its upgrade process.

Secure

Encrypted Traffic Analytics was first announced as part of the Catalyst 9K switch launch, and uses advanced analytics and Cisco Stealthwatch to detect malware in encrypted flows, without the need for SSL decryption. ETA is now available for wireless traffic on the 9800 platform, if deployed in a centralized model, meaning all wireless traffic is tunneled back to the controller.

This is a great feature considering the only other option for gaining visibility into encrypted traffic is usually some form of sketchy certificate man-in-the-middle voodoo. In many situations this works okay for corporate domain-joined machines as here you can control the certificate trusts, but if you provide wireless to any BYOD devices or to the general public in any way, this often results in people not using your wireless because of certificate issues.

Deploy Anywhere

Cisco is offering a lot of flexibility in deployment options for this new wireless controller.

Branch offices can look at the embedded software controller on Catalyst 9K switches for up to 200 APs, and 4K clients.

Edit: Since the original publication of this post, I’ve clarified that the option to run the 9800 controller on a Catalyst 9K switch is only available as an SD-Access Fabric Mode deployment option. SD-Access requires DNA Center. This is an expensive proposition for what could truly have been a fantastic option for small/medium branch office deployments.

Private or public cloud options are available on KVM, VMware, Cisco ENCS, and will be available on AWS. These options support 1000, 3000, and up to 6000 APs, and 10K, 32K, and 64K clients. The AWS public cloud option only supports FlexConnect deployment models, which makes sense as tunneling all client traffic back to your controller in this case would get expensive quickly.

Physical appliance options include the 9800-40 at 2000 APs, 32K clients and 40Gbps (4x10Gbps interfaces), as well as the 9800-80 at 6000 APs, 64K clients, and 80Gbps (8x10Gbps interfaces). The 9800-80 also has a modular option which allows for GigE, 10GigE, 40GigE, and 100GigE uplinks.

Each of these options have identical setup, configuration, management, and features.

Lessons Learned?

Overall, the presentation of this new wireless platform seems solid. Cisco have acknowledged the problems with Converged Access, and have seem to have checked off all of the missing boxes from that first attempt. Feature parity was a big one, and Cisco insists here that all features will be the same up to the existing controller software version 8.8 (current version is 8.5 at the time if this post), so that would give Cisco and their customers quite a bit of time to flesh out the new architecture.

Now, AireOS isn’t going to disappear suddenly. Cisco have said that they are going to continue to develop and support the existing line of controllers and AireOS software, until they can be sure that this new architecture has been successfully adapted by their customers. Customers who previously bought into Converged Access may not be lining up to be the first customers to try out the new platform, but the popularity of the Catalyst 9K switches should provide a good foundation for the embedded controller to gain a foothold.

You can check out Cisco’s presentation at Networking Field Day 19 here:

 

One Way Audio on Cisco 7925G Wireless Phones

Knock Knock.

Who’s there?

Voice over Wireless LAN.

Voice over Wireless LAN who?

….

….

Hello?

….

Hello?

Working with a multitude of different technologies is great. I love it, for the most part. That being said sometimes it can be really frustrating as well. I am neither an expert in voice nor wireless technologies, but I am often times the primary ‘go-to’ person for both of these subjects at work. Now I like working with voice, it’s fun and presents its own interesting challenges sometimes, but for the size of our VoIP deployment, it pretty much just works. Wireless, while still fun to play around with, tends to be my nemesis, as I just haven’t had enough time to really delve into its deeper mysteries. Now, on that rare occasion when the problem is related to both voice AND wireless, things start to get really interesting.

I recently deployed some Cisco 7925G Wireless IP Phones to a number of our sites’ custodians as a replacement for cellular phones. They need to be mobile around the facility in order to troubleshoot issues in places that don’t have a hard line, but don’t require a full-blown cell phone.

Now some caveats; we don’t have sufficient AP coverage for a full-blown VoWLAN deployment, and during testing with the 7925G I did notice some interruption in the call stream when roaming from AP to AP. We also no longer have Cisco as our wireless vendor so I thought there may be some interoperability issues, but felt that 802.11 was after all, a standard right? What could possibly go wrong?

First Reports

The first rumblings of a problem came from some of the custodians saying they had ‘intermittent’ audio. I assumed (somewhat incorrectly) that this meant they were trying to wander around the building or even outside, treating the phone as a cell-phone, and losing sufficient signal from a nearby AP to maintain the call.

I explained to anyone with issues that these were not in fact cellular phones and they needed to stay within reasonable range of an AP to keep their call going. We would add capacity to the wireless as needed in the future, but for now it was the best we could do.

Sent Back

Next I received one of the phones, and it’s charger, in inter-office mail with a sticky note saying simply: “doesn’t work”. I tested the phone with a few different numbers and it seemed fine. I sent it back to the person who mailed it with a note: “works fine”.

As it turns out, I was wrong.

Definitely Broken

I next heard from another analyst who said all calls from the phone at one site were completely dropping. No audio at all. We tested and found that audio coming from the 7925’s was fine, but they were having problems receiving audio.  The initial call setup seemed fine and there were a few seconds of clear two-way audio, but almost immediately the receiving audio was failing.

One-way audio – the bane of any voice engineer’s existence. Coupled with the fact that these were wireless phones as well, made troubleshooting the issue even more complicated.

I had initially thought this might be a QoS issue but the wired phones at the site were fine. Wireshark confirmed QoS wasn’t an issue but I could clearly see in the captures that the RTP to the handsets stopped shortly after calls began, resulting in one-way audio.

Viewing the Call Statistics on the phone also confirmed there was definitely some sort of problem. Jitter was extremely high, Receiver lost Packets were many, and the MOS was around 2.

7925G-Before

Settings

I began playing around with the WLAN settings on the 7925G handsets, trying to find what might be causing the issue. Some suggestions from folks on Twitter pointed at forcing the phones to use 2.4 GHz only, while others insisted they would work fine on 5 GHz. Hard setting the frequency didn’t appear to resolve anything, so I continued the ever popular troubleshooting technique of randomly turning options on and off.

I came across the setting labeled “Call Power Save Mode” which was set by default to “U-APSD/PS-Poll” and also presented the option “None”.

Now, I had no idea what this option did, but I set it to “None” and performed a test call. Lo and behold, the issue appeared to go away. Two way audio persisted through the entire call, and call statistics on the handset were dramatically improved. Jitter was down to 2/22, only 2 dropped packets, and MOS was up to 4.5.

7925G-After

U-APSD/PS-Poll

So what exactly does this option do? U-APSD or Unscheduled Asynchronous Power Save Delivery is a mechanism that allows frames to be queued on a wireless access point in order to save power on a wireless client. When there is no data for the client to receive, it can go back into standby mode, allowing it to save power and battery life.

From Cisco’s Voice over Wireless LAN Design Guide:

The primary benefit of U-APSD is that it allows the voice client to synchronize the transmission and reception of voice frames with the AP, thereby allowing the client to go into power-save mode between the transmission/reception of each voice frame tuple. The WLAN client frame transmission in the access categories supporting U-APSD triggers the AP to send any data frames queued for that WLAN client in that AC. A U-APSD client remains listening to the AP until it receives a frame from the AP with an end-of-service period (EOSP) bit set. This tells the client that it can now go back into its power-save mode. This triggering mechanism is considered a more efficient use of client power than the regular listening for beacons method, at a period controlled by the delivery traffic indication map (DTIM) interval, because the latency and jitter requirements of voice are such that a WVoIP client would either not be in power-save mode during a call, resulting in reduced talk times, or would use a short DTIM interval, resulting in reduced standby times. The use of U-APSD allows the use of long DTIM intervals to maximize standby time without sacrificing call quality. The U-APSD feature can be applied individually across access categories, allowing U-APSD can be applied to the voice ACs in the AP, but the other ACs still use the standard power save feature.

Best Intentions

So why did turning this feature off resolve the one-way audio problem? It seems this is a technology that should help rather than hinder a wireless VoIP call. In this case it appears to do nothing but cause problems.

I can only speculate here because my understanding of this particular mechanism is limited, but I would suspect that even though U-APSD is a standard as part of IEEE 802.11e, the implementations may be somewhat disparate across vendors. Cisco in this case makes the phone and the wireless network is Ruckus. I suspect if I were using Cisco wireless gear, this wouldn’t be an issue. That’s not to blame Ruckus for the problem of course, it just seems to be one of those minor differences in how vendors implement certain technologies.

This brings about an entirely different topic of discussion, but if this is the case, can anything be done to hold vendors accountable for the little tweaks and changes to technologies that are supposed to be standards designed to improve, not prevent interoperability?

Troubleshooting MTU size over IPSEC VPN

I recently deployed a couple of wireless access points to two sites that connect to our main office over IPSEC VPN. After a recent firmware update to the wireless controller both access points got stuck in a provisioning loop and appeared to have difficulty communicating with the controller. Both AP’s repeatedly disconnected due to a “heartbeats lost” error.

Connectivity between the main office and the remote sites appeared fine. Both access points were reachable via ping and ssh. I set up a packet debug on both sites’ firewalls and saw traffic going back and forth between the access points and the controller, and both access points appeared on the controller status window, alternating between “Provisioning” and “Disconnected”.

Needless to say I was slightly baffled.

I opened a ticket with the wireless vendor and (very quickly) received an answer. The MTU for CAPWAP traffic between the access points and the controller is hard set by the controller to 1500*. With these sites connected via IPSEC, that was going to cause some fragmentation due to the overhead that IPSEC was going to add onto the traffic going between sites.

I needed to lower the MTU size on the controller, but to what value? IPSEC doesn’t seem to have a ‘fixed’ header size due to the different encryption options that can be used. So how do I find out exactly how much our particular IPSEC configuration is adding?

ping -f

The -f flag from a Windows command prompt prevents an ICMP packet from being fragmented. This, combined with the -l flag allows you to set the size of the ICMP packet being sent.

So, assuming a standard ethernet MTU of 1500, and accounting for an 8-byte ICMP header, and 20-byte IP header, I should be able to send an ICMP packet sized to 1472 bytes, but 1473 should be too large:

C:\Users\netcanuck>ping 172.16.32.1 -f -l 1472

Pinging 172.16.32.1 with 1472 bytes of data:
Reply from 172.16.32.1: bytes=1472 time=3ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=4ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=4ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=3ms TTL=251

C:\Users\netcanuck>ping 172.16.32.1 -f -l 1473

Pinging 172.16.32.1 with 1473 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Excellent! So now to test across our IPSEC tunnel:

C:\Users\netcanuck>ping 172.16.68.1 -f -l 1472

Pinging 172.16.68.1 with 1472 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Now this makes sense. The MTU size does not account for the IPSEC overhead.

After some testing with different packet sizes I hit on the magic number: 1384 bytes. At 1385 the packets were again rejected as being too large. So some quick math:

ICMP payload: 1384 bytes

ICMP header: 8 bytes

IP header: 20 bytes

Subtotal: 1412 bytes

This leaves 88 bytes as the IPSEC header. I should be able to set the MTU size on the controller to 1412 and the access points should resume functioning normally.

I did in fact set the MTU to 1400 – I like nice, round numbers – and sure enough both access points resumed proper communication with the controller.

What I Learned Today

Sometimes the simple tools are easy to overlook. Using a standard Windows command prompt and ping using the -f  flag is a quick and easy way to diagnose MTU and fragmentation issues across a VPN tunnel.

* It appears from the support documentation for this particular wireless vendor that the MTU size should be 1450 by default which should take into account at least some overhead and explains why these access points were working fine until now. The firmware update seems to have changed this to 1500.

The Problem With “Free”.

It’s rare to have a day go by during which I don’t hear or read about some product that a vendor is now ‘giving away’ or moving to a ‘freemium’ model. In some of the more contentious verticals in the IT industry this seems to be a key tactic for winning new customers and providing value-add for existing ones.

I’m not in marketing or sales, so I can only assume here that the premise behind these gratuitous offerings is to have new, potential customers try the product, fall in love with it, and want to then add more of that company’s products to their infrastructure. There is also a tiny voice in my head that suggests perhaps these organizations might also want their ‘free’ product to become so critical to your operation, that should they decide to charge a fee or licensing for said product at some point in the future, that you’d be forced to pay because it has become something you simply couldn’t live without.

Ultimately the short or long-term goal of offering these products doesn’t really matter. What matters is there is a very big problem with these free products:

They’re free.

They don’t generate revenue, at least directly, for the vendor providing them. This means they are, in all aspects, simply a cost center…a money sink. An expense that perhaps proves the old saying that “you have to spend money to make money”. But the real issue here for you or I as a potential user, or implementer of these products, is that it is very difficult to get any support.

Hello, Bonjour

This particular rant blog post is centered around one such product that everybody seems to be racing to give away. If you, like me, work in an environment that is moving to support the BYOD craze and have anything other than one large, flat network, then Apple’s Bonjour is probably driving you nuts and causing you to sprout gray hair, if you have any left.

Because this particular protocol and all of it’s relatives (mDNS, Zeroconf) can’t communicate across layer 3 boundaries (they have a TTL of 1) when someone on your BYOD wifi wants to talk to the Apple TV on your corporate wifi, you need something to broker that connection.  Enter the Bonjour Gateway (BG).

Aerohive was first to announce and make available their BG product in early 2012. It is built into their HiveOS on any Aerohive access point, or as a virtual machine that will run on VMware. It’s free up to 2 instances of the virtual appliance. I don’t know what the cost might be for anyone wishing to use more than 2, but I would imagine this is an opportunity to sell actual Aerohive hardware to a potential customer.

Cisco has included it as part of their Wireless Lan Controller (WLC) software beginning with version 7.4.  This isn’t free, per se, but is obviously a valuable addition for any existing customer.

Ruckus announced in January 2013 their SmartWay™ technology as “beyond bonjour bridging”, and would be available Q2. Again, this is only free in the sense that existing customers would not have to pay for the software upgrade to their existing controllers.

A quick Google search at some other vendor offerings show that pretty much everyone in the wireless space is offering support for Bonjour in some way.

I may be wrong about this but it seems to me that providing a solution for this issue in enterprise networks is/was a priority for each of these vendors. Why then has my experience with getting one of these platforms working been such a disaster?

Aerohive

If you don’t already follow Andrew von Nagy on Twitter (@revolutionwifi), you should. He is a true wifi evangelist and an excellent resource for keeping up-to-date on all things 802.11. His twitter feed was very active with the announcement of the release of Aerohive’s BG.

Working in a K-12 education environment we had already identified this as a need. Staff and students wanted to take advantage of AirPrint and AirPlay and we had to find a solution. I quickly signed up for my free Aerohive BG and HiveManager account.  Installation was easy as it comes in the form of an OVA. It’s pretty much ‘drop it into VMware’ and you are ready to go.

I had some problems with devices being able to see the AirPrint and AirPlay services across subnets. After some tinkering I decided to email Aerohive at the provided “free_bonjour_support@aerohive.com” address with my issue. That email must have ended up in the bit bucket because I received no reply.  I sent out a tweet about a week later asking @Aerohive how long one could expect to wait for support for the BG.  That too was met with silence. Two weeks later I was rather frustrated and sent out another tweet, this one a little more vitriolic:

“Going nowhere fast with Aerohive’s free bonjour gateway. Anyone have alternative suggestions? (That work)”

Now it should be noted that I’m in Canada and this tweet was sent out on November 22nd, 2012 – US Thanksgiving.

Andrew von Nagy responded via twitter and helped me out with some troubleshooting. I have to throw out a big thanks to him for taking the time on a holiday to offer some support.

On that same day, I received a reply to my original email (unsure if Andrew had anything to do with this) and began working with the online support to get the BG working.

A short 10 weeks later, I had resolved the issue (on my own) and closed the support request with Aerohive.  From the original email on November 5th to resolution on January 10th….granted there are a few holidays in there…but that’s a long time to get an issue with an initial configuration resolved.

Ruckus

Just around the same time (January 2013) I managed to get that first BG working, we received word from our current wireless vendor, Ruckus, that they too were working on a BG solution. This was direct from David Callisch, VP of Marketing for Ruckus Wireless. He even offered to let us beta test the new firmware. This is great news! Being able to implement this solution on infrastructure we already own and manage should be quick and easy, right?

It’s mid May, and we still haven’t received the beta firmware.

Also, Ruckus recently pulled their latest 9.6 firmware off their support site, so I have a feeling 9.7 and SmartWay™ are going to miss their targeted Q2 release.

“Ruckus    Wireless    has    decided    to    remove    the    9.6.0.0.264    release    for    ZoneDirector    while    we    investigate    an    issue    that    was    discovered    after    the    release.”

Aerohive Revisited

In April I received an email from Aerohive that outlined some major bug fixes and enhancements to their free BG.  While I had been able to get it working with AirPlay somewhat in my previous attempt we had never been able to get AirPrint to work properly. I hoped that this news would mean we could get both pieces to function properly.

Having deleted the VM for the original installation of Aerohive’s BG, attempted to reinstall it, only to be told that my serial # had already been activated and that I could not reactivate it.  Ok, easy fix, right?  I  fired off an email to “free_bonjour_support@aerohive.com” and explained my situation and asked if I could have a new key or the original key re-enabled.

That email went out April 19th, and I have yet to get any sort of reply.

Free Should Not Mean “free from support”

If these value-added features, or in some cases, fully ‘free’ products are meant to drive potential customers to become paying customers and/or if these products are meant to keep existing customers as loyal, long-term customers with an existing vendor, then I would expect support be as agile and attentive as it would be for any other product or offering from these same vendors.

I shouldn’t be left waiting for an email that never comes, and I certainly shouldn’t have to resort to social media shaming to get action from a vendor. Sadly it seems to be the most effective method of getting things moving, but it should be a last resort not the primary method of seeking resolution.

Perhaps I’m expecting too much from a free product or feature, and I may be misinterpreting the purpose of these add-ons as marketing/sales tools. I might be naive in believing that any truly ‘free’ product is going to become a key part of my infrastructure and solve a major technical hurdle for my users. I can only hope there is actually some sort of benevolent, beneficial reason for vendors to offer these solutions, and hope that they are able to provide some better support in the future.

Otherwise, there are truly free and open products like Avahi that are able to quickly and easily deploy mDNS service discovery options across subnets. If you know a little Linux…

Note: During the writing of this post I had been contacted by our local Aerohive rep who caught wind of a Tweet I sent out yesterday about my BG issue.  He’s managed to get me a new serial # for our BG so I can happily reinstall it and give it another go.  Social media wins again!