DEV Community

Cover image for Troubleshooting a ZigBee PAN ID Conflict
Justin Ethier
Justin Ethier

Posted on

Troubleshooting a ZigBee PAN ID Conflict

ZigBee is a network protocol that allows for home and building automation using low power wireless controllers.

Last year I spent time troubleshooting a series of building networks which tore themselves apart shortly after being commissioned. In each case the network coordinator was raising a PAN ID conflict error and re-forming the network with a new PAN, causing devices to be stranded or split apart on two separate networks.

The Typical Cause of PAN ID Conflicts

This article explains how corrupt packets can cause a conflict. The suggested workaround on the SiLabs EmberZNet stack is to set a threshold to ensure the coordinator only takes corrective action if more than 63 conflicts are reported in a minute.

This threshold was in place on our building network, making packet corruption an extremely unlikely source of the problem.

Packet Captures to the Rescue

After taking several captures with the Simplicity Studio Network Analyzer a colleague found an inconsistency in ZigBee Beacon frames from various devices.

Consider the following frames from two different devices:

ZigBee Beacon [15 bytes]
   - Protocol Id: ZigBee Pro (0x00)
   -    .... 0010 = Stack Profile: ZigBee Pro (2)
   -    0010 .... = Network Protocol Version: 0x02
   -    .... .1.. = Router Capacity: true
   -    .000 1... = Depth: 0x01
   -    1... .... = End Device Capacity: true
   - Extended PAN ID: 0123456789ABCDEF
   - Tx Offset: 0xFFFFFF
   - NWK Update ID: 0x01
Enter fullscreen mode Exit fullscreen mode
ZigBee Beacon [15 bytes]
   - Protocol Id: ZigBee Pro (0x00)
   -    .... 0010 = Stack Profile: ZigBee Pro (2)
   -    0010 .... = Network Protocol Version: 0x02
   -    .... .1.. = Router Capacity: true
   -    .111 1... = Depth: 0x0F
   -    1... .... = End Device Capacity: true
   - Extended PAN ID: FEDCBA9876543210
   - Tx Offset: 0xFFFFFF
   - NWK Update ID: 0x00
Enter fullscreen mode Exit fullscreen mode

Based on these captures it became clear the root cause was due to an extended PAN ID that was reversed on some number of devices.

The ZigBee 3.0 Specification explains why this would be a problem:

3.6.1.13.1 Detecting a PAN Id Conflict
Any device that is operational on a network and receives an MLME-BEACON-NOTIFY.indication in which the PAN identifier of the beacon frame matches its own PAN identifier but the EPID value contained in the beacon payload is either not present or not equal to nwkExtendedPANID, shall be considered to have detected a PAN Identifier conflict.

This is exactly what was happening on the network! Everything would continue to operate normally but the mismatched EPID values were causing conflicts to be reported at a higher rate than our threshold value.

Once the threshold for these errors is exceeded (63 in a minute) the coordinator will select a new PAN ID and broadcast a network update to ask nodes to move to the new PAN. This is problematic because it often has the effect of splitting the network.

This also would only be a problem when devices on the network are sending beacon requests, such as when a device is searching for a network. Typically once a network is commissioned the EPID is not included in packets to save space since ZigBee is designed for low power consumption and low data rate applications:

Other than the scanning and joining processes, the EPID rarely appears in transmitted ZigBee packets due to its large overhead (8 bytes) in the header.

Conclusion

Fortunately for our team we were able to quickly identify a software bug that was the source of the reversed EPID and patch the system.

But this represents an important consideration for ZigBee networks. Sometimes issues only become apparent when deploying a large-scale distributed network, such as during a customer deployment. For example, the commissioning procedure that led us to the conflict was tested in-house. However despite having dozens of devices on the test system the number of conflicts being raised was not large enough to exceed the 63 per minute threshold. To an end user the network continued to operate as if nothing was wrong.

Perhaps more exhaustive testing would catch this before it hit the field. But that is often the case! With any large distributed system there will be some level of unintended behavior that - despite our best efforts - only manifests itself in a real-world application. We need to be prepared to deal with these unexpected issues when they do arise, and take the proper precautions to catch as many as possible before customers and production systems are affected.

Top comments (0)