Cisco Catalyst Wifi, Take Two

On November 13th, Cisco announced their next-generation wireless platform with the release of the Catalyst 9800 Series Wireless Controller.

You read that right, the next WLC platform from Cisco is running on Catalyst and expands Cisco’s DNA-Center architecture into the wireless space.

The Catalyst 9800 controllers come in a variety of form factors. The option for a standalone hardware controller is still here with the 9800-40 and 9800-80, or the 9800 series can be run as a VM in a private or public cloud. A third option is now to run embedded wireless on the Catalyst 9k series switches.

Embedded wireless controllers on Catalyst switches…that sounds familiar, doesn’t it?

Cisco made a similar move a few years ago with an architecture called Converged Access. This embedded the wireless controller functionality into IOS XE on the 3650 and 3850 access switches. For various reasons, it did not live up to expectations, and Cisco killed it in IOS XE Everest 16.5.1a in late 2017.

Cisco and Aironet

Cisco acquired Aironet Wireless Communications in 1999 for $799M. Since then, Cisco wireless access points have generally been referred to as “Aironet” products by name. This includes the software that runs on the wireless controllers and access points, AireOS.

AireOS came from Cisco’s acquisition of Airespace in 2005. Airespace were the developers of the AP/Controller model and the Lightweight Access Point Protocol (LWAPP), which was the precursor to CAPWAP.

(Credit to Jake Snyder for correcting me on the origins of AireOS)

Whatever AireOS version is running on your wireless controller is the same that you have on your access points. Cisco has developed the platform to be what it is today, and very little of it remains what was once the original AireOS.

With this iteration, or rather re-invention of the Wireless Controller, Cisco have highlighted three key improvements to their predecessor wireless software.

Always-On

Controller redundancy is always critical to prevent downtime in the event of failure. Here, Cisco are touting stateful switch over with an active standby model in which client state is maintained across the standby controller, offering no downtime for clients in the event of a failure.

Patches and minor software updates now will not change the base image of the controller. Updates can be done without client downtime. Patches for specific AP models can be done without affecting the base image or other access point models with per-AP device packs. These are installed to the controller and then pushed only to the model of AP they are for.

New AP models can also be joined to the controller without impact to the overall base image with the AP device packs, allowing new hardware to join an existing environment without a major upgrade.

Citing “no disruption” base image/version upgrades, the new 9800 controllers can be updated independently of the access points, whereas previously the software version running on the controller and access points was coupled. Upgrades were done to the controller, and then pushed to the access points. More often than not, this resulted in interruption to clients on affected access points, some rebooting of the controller and AP’s was inevitable, and quite often, some orphaned access points that never quite upgraded properly or failed to rejoin the controller.

Cisco have made many improvements to the upgrade process over the years, including staged firmware upgrades, however in large wireless deployments, firmware upgrades would not generally be considered zero-downtime.

With the new controller architecture using an RF-based intelligent rolling upgrade process, Cisco has aimed at eliminating some of these issues. During the upgrade process, the standby or secondary controller is first upgraded to the new image. You can then specify a percentage of access points you would like upgraded at once (5%-25%), and the controller then determines which AP’s should be upgraded using the AP neighbor information and # of clients on each AP. APs with no clients are upgraded first. Clients on access points that are to be upgraded are steered toward neighboring access points in order to prevent interruption in service.

The idea of steering clients to other access points or 5Ghz radios instead of 2.4Ghz radios isn’t new, and because I’m not a wireless expert I won’t comment on exactly how it’s done, but it is my understanding that it is difficult to guarantee that the client will “listen” to the steering mechanism. I feel even with this intelligent RF behind this upgrade process, some clients will inevitably experience a loss of connectivity during the upgrade process.

Once the access point is upgraded, it then joins the already-upgraded controller, and resumes servicing clients.

After all access points are joined to the upgraded controller, the primary controller begins its upgrade process.

Secure

Encrypted Traffic Analytics was first announced as part of the Catalyst 9K switch launch, and uses advanced analytics and Cisco Stealthwatch to detect malware in encrypted flows, without the need for SSL decryption. ETA is now available for wireless traffic on the 9800 platform, if deployed in a centralized model, meaning all wireless traffic is tunneled back to the controller.

This is a great feature considering the only other option for gaining visibility into encrypted traffic is usually some form of sketchy certificate man-in-the-middle voodoo. In many situations this works okay for corporate domain-joined machines as here you can control the certificate trusts, but if you provide wireless to any BYOD devices or to the general public in any way, this often results in people not using your wireless because of certificate issues.

Deploy Anywhere

Cisco is offering a lot of flexibility in deployment options for this new wireless controller.

Branch offices can look at the embedded software controller on Catalyst 9K switches for up to 200 APs, and 4K clients.

Edit: Since the original publication of this post, I’ve clarified that the option to run the 9800 controller on a Catalyst 9K switch is only available as an SD-Access Fabric Mode deployment option. SD-Access requires DNA Center. This is an expensive proposition for what could truly have been a fantastic option for small/medium branch office deployments.

Private or public cloud options are available on KVM, VMware, Cisco ENCS, and will be available on AWS. These options support 1000, 3000, and up to 6000 APs, and 10K, 32K, and 64K clients. The AWS public cloud option only supports FlexConnect deployment models, which makes sense as tunneling all client traffic back to your controller in this case would get expensive quickly.

Physical appliance options include the 9800-40 at 2000 APs, 32K clients and 40Gbps (4x10Gbps interfaces), as well as the 9800-80 at 6000 APs, 64K clients, and 80Gbps (8x10Gbps interfaces). The 9800-80 also has a modular option which allows for GigE, 10GigE, 40GigE, and 100GigE uplinks.

Each of these options have identical setup, configuration, management, and features.

Lessons Learned?

Overall, the presentation of this new wireless platform seems solid. Cisco have acknowledged the problems with Converged Access, and have seem to have checked off all of the missing boxes from that first attempt. Feature parity was a big one, and Cisco insists here that all features will be the same up to the existing controller software version 8.8 (current version is 8.5 at the time if this post), so that would give Cisco and their customers quite a bit of time to flesh out the new architecture.

Now, AireOS isn’t going to disappear suddenly. Cisco have said that they are going to continue to develop and support the existing line of controllers and AireOS software, until they can be sure that this new architecture has been successfully adapted by their customers. Customers who previously bought into Converged Access may not be lining up to be the first customers to try out the new platform, but the popularity of the Catalyst 9K switches should provide a good foundation for the embedded controller to gain a foothold.

You can check out Cisco’s presentation at Networking Field Day 19 here:

 

One Way Audio on Cisco 7925G Wireless Phones

Knock Knock.

Who’s there?

Voice over Wireless LAN.

Voice over Wireless LAN who?

….

….

Hello?

….

Hello?

Working with a multitude of different technologies is great. I love it, for the most part. That being said sometimes it can be really frustrating as well. I am neither an expert in voice nor wireless technologies, but I am often times the primary ‘go-to’ person for both of these subjects at work. Now I like working with voice, it’s fun and presents its own interesting challenges sometimes, but for the size of our VoIP deployment, it pretty much just works. Wireless, while still fun to play around with, tends to be my nemesis, as I just haven’t had enough time to really delve into its deeper mysteries. Now, on that rare occasion when the problem is related to both voice AND wireless, things start to get really interesting.

I recently deployed some Cisco 7925G Wireless IP Phones to a number of our sites’ custodians as a replacement for cellular phones. They need to be mobile around the facility in order to troubleshoot issues in places that don’t have a hard line, but don’t require a full-blown cell phone.

Now some caveats; we don’t have sufficient AP coverage for a full-blown VoWLAN deployment, and during testing with the 7925G I did notice some interruption in the call stream when roaming from AP to AP. We also no longer have Cisco as our wireless vendor so I thought there may be some interoperability issues, but felt that 802.11 was after all, a standard right? What could possibly go wrong?

First Reports

The first rumblings of a problem came from some of the custodians saying they had ‘intermittent’ audio. I assumed (somewhat incorrectly) that this meant they were trying to wander around the building or even outside, treating the phone as a cell-phone, and losing sufficient signal from a nearby AP to maintain the call.

I explained to anyone with issues that these were not in fact cellular phones and they needed to stay within reasonable range of an AP to keep their call going. We would add capacity to the wireless as needed in the future, but for now it was the best we could do.

Sent Back

Next I received one of the phones, and it’s charger, in inter-office mail with a sticky note saying simply: “doesn’t work”. I tested the phone with a few different numbers and it seemed fine. I sent it back to the person who mailed it with a note: “works fine”.

As it turns out, I was wrong.

Definitely Broken

I next heard from another analyst who said all calls from the phone at one site were completely dropping. No audio at all. We tested and found that audio coming from the 7925’s was fine, but they were having problems receiving audio.  The initial call setup seemed fine and there were a few seconds of clear two-way audio, but almost immediately the receiving audio was failing.

One-way audio – the bane of any voice engineer’s existence. Coupled with the fact that these were wireless phones as well, made troubleshooting the issue even more complicated.

I had initially thought this might be a QoS issue but the wired phones at the site were fine. Wireshark confirmed QoS wasn’t an issue but I could clearly see in the captures that the RTP to the handsets stopped shortly after calls began, resulting in one-way audio.

Viewing the Call Statistics on the phone also confirmed there was definitely some sort of problem. Jitter was extremely high, Receiver lost Packets were many, and the MOS was around 2.

7925G-Before

Settings

I began playing around with the WLAN settings on the 7925G handsets, trying to find what might be causing the issue. Some suggestions from folks on Twitter pointed at forcing the phones to use 2.4 GHz only, while others insisted they would work fine on 5 GHz. Hard setting the frequency didn’t appear to resolve anything, so I continued the ever popular troubleshooting technique of randomly turning options on and off.

I came across the setting labeled “Call Power Save Mode” which was set by default to “U-APSD/PS-Poll” and also presented the option “None”.

Now, I had no idea what this option did, but I set it to “None” and performed a test call. Lo and behold, the issue appeared to go away. Two way audio persisted through the entire call, and call statistics on the handset were dramatically improved. Jitter was down to 2/22, only 2 dropped packets, and MOS was up to 4.5.

7925G-After

U-APSD/PS-Poll

So what exactly does this option do? U-APSD or Unscheduled Asynchronous Power Save Delivery is a mechanism that allows frames to be queued on a wireless access point in order to save power on a wireless client. When there is no data for the client to receive, it can go back into standby mode, allowing it to save power and battery life.

From Cisco’s Voice over Wireless LAN Design Guide:

The primary benefit of U-APSD is that it allows the voice client to synchronize the transmission and reception of voice frames with the AP, thereby allowing the client to go into power-save mode between the transmission/reception of each voice frame tuple. The WLAN client frame transmission in the access categories supporting U-APSD triggers the AP to send any data frames queued for that WLAN client in that AC. A U-APSD client remains listening to the AP until it receives a frame from the AP with an end-of-service period (EOSP) bit set. This tells the client that it can now go back into its power-save mode. This triggering mechanism is considered a more efficient use of client power than the regular listening for beacons method, at a period controlled by the delivery traffic indication map (DTIM) interval, because the latency and jitter requirements of voice are such that a WVoIP client would either not be in power-save mode during a call, resulting in reduced talk times, or would use a short DTIM interval, resulting in reduced standby times. The use of U-APSD allows the use of long DTIM intervals to maximize standby time without sacrificing call quality. The U-APSD feature can be applied individually across access categories, allowing U-APSD can be applied to the voice ACs in the AP, but the other ACs still use the standard power save feature.

Best Intentions

So why did turning this feature off resolve the one-way audio problem? It seems this is a technology that should help rather than hinder a wireless VoIP call. In this case it appears to do nothing but cause problems.

I can only speculate here because my understanding of this particular mechanism is limited, but I would suspect that even though U-APSD is a standard as part of IEEE 802.11e, the implementations may be somewhat disparate across vendors. Cisco in this case makes the phone and the wireless network is Ruckus. I suspect if I were using Cisco wireless gear, this wouldn’t be an issue. That’s not to blame Ruckus for the problem of course, it just seems to be one of those minor differences in how vendors implement certain technologies.

This brings about an entirely different topic of discussion, but if this is the case, can anything be done to hold vendors accountable for the little tweaks and changes to technologies that are supposed to be standards designed to improve, not prevent interoperability?