Troubleshooting MTU size over IPSEC VPN

I recently deployed a couple of wireless access points to two sites that connect to our main office over IPSEC VPN. After a recent firmware update to the wireless controller both access points got stuck in a provisioning loop and appeared to have difficulty communicating with the controller. Both AP’s repeatedly disconnected due to a “heartbeats lost” error.

Connectivity between the main office and the remote sites appeared fine. Both access points were reachable via ping and ssh. I set up a packet debug on both sites’ firewalls and saw traffic going back and forth between the access points and the controller, and both access points appeared on the controller status window, alternating between “Provisioning” and “Disconnected”.

Needless to say I was slightly baffled.

I opened a ticket with the wireless vendor and (very quickly) received an answer. The MTU for CAPWAP traffic between the access points and the controller is hard set by the controller to 1500*. With these sites connected via IPSEC, that was going to cause some fragmentation due to the overhead that IPSEC was going to add onto the traffic going between sites.

I needed to lower the MTU size on the controller, but to what value? IPSEC doesn’t seem to have a ‘fixed’ header size due to the different encryption options that can be used. So how do I find out exactly how much our particular IPSEC configuration is adding?

ping -f

The -f flag from a Windows command prompt prevents an ICMP packet from being fragmented. This, combined with the -l flag allows you to set the size of the ICMP packet being sent.

So, assuming a standard ethernet MTU of 1500, and accounting for an 8-byte ICMP header, and 20-byte IP header, I should be able to send an ICMP packet sized to 1472 bytes, but 1473 should be too large:

C:\Users\netcanuck>ping 172.16.32.1 -f -l 1472

Pinging 172.16.32.1 with 1472 bytes of data:
Reply from 172.16.32.1: bytes=1472 time=3ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=4ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=4ms TTL=251
Reply from 172.16.32.1: bytes=1472 time=3ms TTL=251

C:\Users\netcanuck>ping 172.16.32.1 -f -l 1473

Pinging 172.16.32.1 with 1473 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Excellent! So now to test across our IPSEC tunnel:

C:\Users\netcanuck>ping 172.16.68.1 -f -l 1472

Pinging 172.16.68.1 with 1472 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Now this makes sense. The MTU size does not account for the IPSEC overhead.

After some testing with different packet sizes I hit on the magic number: 1384 bytes. At 1385 the packets were again rejected as being too large. So some quick math:

ICMP payload: 1384 bytes

ICMP header: 8 bytes

IP header: 20 bytes

Subtotal: 1412 bytes

This leaves 88 bytes as the IPSEC header. I should be able to set the MTU size on the controller to 1412 and the access points should resume functioning normally.

I did in fact set the MTU to 1400 – I like nice, round numbers – and sure enough both access points resumed proper communication with the controller.

What I Learned Today

Sometimes the simple tools are easy to overlook. Using a standard Windows command prompt and ping using the -fΒ  flag is a quick and easy way to diagnose MTU and fragmentation issues across a VPN tunnel.

* It appears from the support documentation for this particular wireless vendor that the MTU size should be 1450 by default which should take into account at least some overhead and explains why these access points were working fine until now. The firmware update seems to have changed this to 1500.