F5 TMM Crash

We have a pair of F5 Viprions that are connected to Cisco Nexus 7K (Aggr A & B) switch as shown here:

network_diagram_1_1

TMM Crash:

The TMM Crashed in one of the F5 Viprions as the following conditions were met:

  1. Your BIG-IP system is processing a large amount of active connections.
  2. You attempt to display the connection table using the tmsh show sys connection command.
  3. You then attempt to cancel the tmsh show sys connection command by using the Ctrl+C key sequence while the command is still in the process of displaying the connection table.

SOL15246

When the Viprion is handling hundreds of thousands of connections and the “show sys  connection” is executed and subsequently cancelled with “Ctrl+C” before the connections are displayed, it will cause the TMM to crash. This is common to multi-blade system  like Viprion and single units.

BugID: 595773

For the Viprions, apart from the TMM crash, the “Ctrl+C” is not propagated to all the blades in the multi-blade chassis Viprion. This has been identified as BugID: 595773. This has been fixed in 11.5.6 code and it may be retroactively fixed in 11.5.4 + HF2 (not sure).

BugID: 579284

Under certain conditions, memory within mcpd can be corrupted. This memory corruption within mcpd has been identified as BugID: 579284. The previously stated BugID: 595773 will trigger BugID: 579284 resulting in memory corruption within mcpd.

The memory corruption was serious enough to cause loss of inter-blade connectivity and thus each blade was acting as a stand-alone system and this caused the packets to loop within the network.

This bug will probably be fixed in 12.x code.

Logs from the Viprion:

 May  3 16:20:15 slot1/LB1-domain.com err tmsh[29166]: 01420006:3: operation canceled
 May  3 16:20:31 slot3/LB1-domain.com crit tmm6[17982]: 01010020:2: MCP Connection aborted, exiting
 May  3 16:20:31 slot4/LB1-domain.com info bcm56xxd[9563]: 012c0012:6: Reprogram vDAG cmp state to 0xb for vtrunk default (previous state 0xf)
 May  3 16:20:31 slot3/LB1-domain.com info bcm56xxd[9919]: 012c0012:6: Reprogram vDAG cmp state to 0xb for vtrunk default (previous state 0xf)
 May  3 16:20:31 slot1/LB1-domain.com info bcm56xxd[8234]: 012c0012:6: Reprogram vDAG cmp state to 0xb for vtrunk default (previous state 0xf)
 ...
 May  3 16:20:31 slot4/LB1-domain.com info bcm56xxd[9563]: 012c0012:6: Reprogram vDAG cmp state to 0x2 for vtrunk default (previous state 0xa)
 May  3 16:20:31 slot1/LB1-domain.com info bcm56xxd[8234]: 012c0012:6: Reprogram vDAG cmp state to 0x2 for vtrunk default (previous state 0xa)
 May  3 16:20:31 slot4/LB1-domain.com info bcm56xxd[9563]: 012c0012:6: FFP HDAG installed for default (cmp state 0x2)
 May  3 16:20:31 slot1/LB1-domain.com info bcm56xxd[8234]: 012c0012:6: FFP HDAG installed for default (cmp state 0x2)
 
 ... and the blade logs a restart.

The following logs were identified in the Cisco Nexus 7K that was connected to the Viprion:

2016 May  3 16:20:26 switch-1 %FWM-2-STM_LOOP_DETECT: Loops detected in the network for mac 4111.3111.abc1 among ports Po66 and Po11 on vlan 100 - Disabling dynamic learning notifications for a period between 120 and 240 seconds on vlan 100
2016 May  3 16:20:33 switch-1 %FWM-2-STM_LOOP_DETECT: Loops detected in the network for mac 4111.3111.a6c1 among ports Po11 and Po66 on vlan 200 - Disabling dynamic learning notifications for a period between 120 and 240 seconds on vlan 200

Summary of the 2 conditions that we hit:

  1. TMM Crash because of the “Ctrl+C” used to break “show sys conn” command.
  2. Ctrl+C does not propagate to all the blades causing memory corruption resulting in loss of inter-blade connectivity and thus making the multi-blade Viprion to create a closed loop.

 

 

Leave a Reply