Ansible – Installation

If you have root access to your box, you can utilize the following link in order to install Ansible. I would recommend Ansible 2.1 & later version if your goal is to utilize Ansible as a Network Automation tool.

Creating a Virtual Environment with Ansible:

If you don’t have root access to the bastion host that is used to access the network infrastructure, you can utilize ansible in virtual environment.

pip install --upgrade pip virtualenv virtualenvwrapper
virtualenv ansible2.1
source ansible2.1/bin/activate
pip install ansible==2.2.1

Whenever required, the virtual environment can be accessed using:
source ansible2.1/bin/activate

root@ansible:~$ source ansible2.1/bin/activate

(ansible2.1)root@ansible:~$ ansible --version
ansible 2.1.0.0
  config file = 
  configured module search path = Default w/o overrides

F5 – Bleeding Active Connections

Scenario:

A Virtual Server is load balancing connections to a pool with 2 pool members. During maintenance window, one of the two pool members is disabled and maintenance is completed followed by the other pool member.

However, as the users make continuous API calls every 5 seconds, the existing TCP connection never bleeds out. Even after waiting for 24 hours, connections still exist on the disabled pool member.

Solution:

By default, F5 makes load balancing decision when the 1st HTTP request within a TCP connection is received. Subsequent HTTP request within the TCP connection are sent to the same pool member as the very 1st HTTP request.

By enabling OneConnect profile with a /32 netmask (255.255.255.255), we were able to force the F5 to make load balancing decision for every HTTP request instead of its default behavior.

The OneConnect profile used along with disabled or forced-offline setting will move the connection from the failed pool member to the active pool member.

Reference Link.

Sub-Domain Delegation GTM/DNS

 

Lets say that you have domain.com hosted with a 3rd party DNS provider and you would like to create GTM (BigIP-DNS) DNS load balancing by utilizing Sub-Domain Delegation.

In this scenario, there are 2 GTM. One in each DC (DC-1 & DC-2). The basic set up has been completed and the GTMs are in a common sync-group.

Create A-Records for the 2 GTM using their Listener IP addresses:

 gtm1.wip.domain.com. IN A 100.100.100.100
 gtm2.wip.domain.com. IN A 200.200.200.200

gtm1 and gtm2 exist in DC-1 and DC-2 respectively and 100.100.100.100 & 200.200.200.200 are the listener IP address configured on gtm1 and gtm2.

Delegate the sub-domain to the GTM using NS Records:

 wip.domain.com. IN NS gtm1.wip.domain.com.
 wip.domain.com. IN NS gtm2.wip.domain.com.

Use CNAME records:

www.domain.com. IN CNAME www.wip.domain.com.

The above DNS records (A, NS & CNAME) will be added to the 3rd party DNS records that is hosting domain.com. Any request for

www.domain.com

will be sent to the 3rd party DNS provider which will then resolve to

www.wip.domain.com

because of the CNAME and that will be handled by the GTMs because of the NS & A records.

SOL277 – Sub-domain delegation.

OneConnect & HTTP Requests

This is a copy/paste of a Q&A in devcentral. I didn’t change it as it is quite descriptive and gets the point across.

Current Setup:

We are using Cookie Insert method for session persistence. So LTM adds “BigipServer*” Cookie in the http response header with value as encoded IP address and port name. Subsequent requests from the client (in our case browser) will have this cookie in the request header and this helps LTM to send the request to same server. This LTM cookie’s expiry is set to session, so this cookie will be cleared when we close the browser or we expire it using iRule.

Use Case:

We have set of servers configured as pool members serving traffic to users who are logged in. During release time, we will release the code to new set of servers and add those servers also to the LTM pool. LTM will now have servers with both old code as well as new code. We disable all servers which has the old code so that LTM routes only the requests which already has “BigipServer*” Cookie value pointing to those servers. This will not interrupt the users who are already logged in and doing some work. All new requests (new users) will be load balanced to any of the active servers which has new code. We will ask our already logged in users to logout and login back again once they are done with the current work. We have an iRule configured to expire the LTM cookie during logout, so our expectation is that users will be connected to new servers when they are logging in again.

Problem:

Even though iRule expires the LTM cookie during logout and the cookie is not present the request header of login, users are still routed to the same disabled server when they are logging in again. Ideally, LTM should have load balanced the request to any of the active servers.

Root Cause:

Upon analyzing this further with network traffic, we found that, whenever the browser has a persistent TCP connection open with LTM after logout, browser uses that existing TCP connection for sending the login request. LTM routes this login request to the same disabled server which handled the previous request even though LTM cookie is not present in the request header. If we close the TCP connection manually after logout (using CurrPosts or some other tool), the browser establishes a new connection with LTM during login and LTM load balances this requests to any active server. One option for us is to send “Connection: close” in the response header during logout, but the browser may hold multiple persistent TCP connections (I have seen browser holding even three connections) and hence closing a single TCP connection will not help. Other option is to close the browser, but we don’t have that choice for reasons I cannot explain here (trust me).

SOLUTION:

Try using the following:

  1. OneConnect Profile in VS with netmask of /32.
  2. Action on Service Down in the Pool set to Reselect.

(1) will force the load balancing decision to be made for every HTTP request instead of the the default of lb decision being made only for the 1st HTTP request within a TCP connection.

(2) will force the HTTP Request to be sent to a new pool member when the selected member is down as the load balancing decision is made for every HTTP request instead of the very 1st HTTP request within a persistent/keep-alive connection.

Keep-alive Connection (also referred to as Persistent Connection) is used to refer to the same feature provided by HTTP1.1 where you can utilize a single TCP connection in order to send multiple HTTP requests within a single TCP connection.

 

Adding a Blade to a Viprion

Normally, when you add the new blade, the current master blade will synch it’s configuration onto the new blade. Make sure that the existing blade is master. Backup all relevant configuration on the device and off the device before adding new blade.

Make sure that the blades are same model and see K16992

Look at K13965 🙂 to identify the master blade.

Considerations when moving blade between chassis: K10541271

F5 Code Upgrade Steps

This is a rough template of F5 Code Upgrade steps that could be of help for your maintenance work.

  1. Before performing any F5 code upgrade, make sure that the “Service Check Date” on the device is AFTER the License Check Date for the new code version as listed here in SOL7727
  2. Upload the new code to the partition that you prefer on the F5.
  3. cpcfg to the new code version location – Example: cpcfg HD1.2

    Although “cpcfg HD1.x” has worked most of the times, I would recommend backing up the .UCS file in a remote location and also saving a copy in “/shared/tmp/<UCS File>“. After saving the UCS file in the “/shared/tmp/” location, you can utilize “load /sys ucs <path/to/UCS> no-license” to load the configuration as noted in SOL12880

  4. Reboot.This will take about 5-10 minutes for Hotfix updates and about 15-20 minutes when migrating major code version.

Recommended maintenance window is about 1 hour. This could change depending on any application level testing that you would like to incorporate within your maintenance window.

Reference:

F5 Code Upgrade – 10.x to 11.x

Viprion Chassis – Adding New Blade

Normally, when you add the new blade, the current master blade will synch it’s configuration onto the new blade. Make sure that the existing blade is master. Backup all relevant configuration on the device and off the device before adding new blade.

Make sure that the blades are same model and see SOL16992

Look at SOL13965 to identify the master blade.

Considerations when moving blade between chassis: SOL10541271

F5 – Automating CLI Execution

Purpose:
This is a really simple way to automate CLI command execution on multiple F5 devices using Bash & TCL scripting. The scripts have been tested on a linux and a mac machine.

How to use it:
There is a bash script (F5_Bash_v1) that is utilized to collect the username/password for F5 access. A text file (F5_Host.txt) that stores the management IP address of multiple F5 devices and a TCL script (F5_Out_v1.exp) that is used to execute CLI commands on the F5 devices.

The bash script is the master script that obtains the username/password and executes the TCL script for multiple F5 devices.

Setup:
On a linux machine that is utilized to connect to the F5 device:

#Create a directory
mkdir F5_Check

Within the “F5_Check” directory, create the following 3 files:
F5_Host.txt
F5_Bash_v1
F5_Out_v1.exp

File Content: F5_Host.txt contains the management IP of the F5 devices.
Example:

$ cat F5_Host.txt
10.12.12.200
10.12.12.201
10.12.12.202
10.12.12.203

File Content: F5_Bash_v1

#!/bin/bash
# Collect the username and password for F5 access
echo -n "Enter the username "
read -s -e user
echo -ne '\n'
echo -n "Enter the password "
read -s -e password
echo -ne '\n'

# Feed the expect script a device list & the collected username & passwords
for device in `cat ~/F5_Check/F5_Host.txt`; do
./F5_Out_v1.exp $device $password $user ;
done

File Content: F5_Out_v1.exp

#!/usr/bin/expect -f

# Set variables
set hostname [lindex $argv 0]
set password [lindex $argv 1]
set username [lindex $argv 2]

# Log results
log_file -a ~/F5_Check/F5LOG.log

# Announce which device we are working on and the time
send_user "\n"
send_user ">>>>>  Working on $hostname @ [exec date] <<<<<\n"
send_user "\n"

# SSH access to device
spawn ssh $username@$hostname

expect {
"no)? " {
send "yes\n"
expect "*assword: "
sleep 1
send "$password\r"
}
"*assword: " {
sleep 1
send "$password\r"
}
}

expect "(tmos)#"
send "sys\n"
expect "(tmos.sys)#"

send "show software\n"
expect "#"
send "exit\n"
expect "#"
send "quit\n"

expect ":~\$"
exit

F5 TMM Crash

We have a pair of F5 Viprions that are connected to Cisco Nexus 7K (Aggr A & B) switch as shown here:

network_diagram_1_1

TMM Crash:

The TMM Crashed in one of the F5 Viprions as the following conditions were met:

  1. Your BIG-IP system is processing a large amount of active connections.
  2. You attempt to display the connection table using the tmsh show sys connection command.
  3. You then attempt to cancel the tmsh show sys connection command by using the Ctrl+C key sequence while the command is still in the process of displaying the connection table.

SOL15246

When the Viprion is handling hundreds of thousands of connections and the “show sys  connection” is executed and subsequently cancelled with “Ctrl+C” before the connections are displayed, it will cause the TMM to crash. This is common to multi-blade system  like Viprion and single units.

BugID: 595773

For the Viprions, apart from the TMM crash, the “Ctrl+C” is not propagated to all the blades in the multi-blade chassis Viprion. This has been identified as BugID: 595773. This has been fixed in 11.5.6 code and it may be retroactively fixed in 11.5.4 + HF2 (not sure).

BugID: 579284

Under certain conditions, memory within mcpd can be corrupted. This memory corruption within mcpd has been identified as BugID: 579284. The previously stated BugID: 595773 will trigger BugID: 579284 resulting in memory corruption within mcpd.

The memory corruption was serious enough to cause loss of inter-blade connectivity and thus each blade was acting as a stand-alone system and this caused the packets to loop within the network.

This bug will probably be fixed in 12.x code.

Logs from the Viprion:

 May  3 16:20:15 slot1/LB1-domain.com err tmsh[29166]: 01420006:3: operation canceled
 May  3 16:20:31 slot3/LB1-domain.com crit tmm6[17982]: 01010020:2: MCP Connection aborted, exiting
 May  3 16:20:31 slot4/LB1-domain.com info bcm56xxd[9563]: 012c0012:6: Reprogram vDAG cmp state to 0xb for vtrunk default (previous state 0xf)
 May  3 16:20:31 slot3/LB1-domain.com info bcm56xxd[9919]: 012c0012:6: Reprogram vDAG cmp state to 0xb for vtrunk default (previous state 0xf)
 May  3 16:20:31 slot1/LB1-domain.com info bcm56xxd[8234]: 012c0012:6: Reprogram vDAG cmp state to 0xb for vtrunk default (previous state 0xf)
 ...
 May  3 16:20:31 slot4/LB1-domain.com info bcm56xxd[9563]: 012c0012:6: Reprogram vDAG cmp state to 0x2 for vtrunk default (previous state 0xa)
 May  3 16:20:31 slot1/LB1-domain.com info bcm56xxd[8234]: 012c0012:6: Reprogram vDAG cmp state to 0x2 for vtrunk default (previous state 0xa)
 May  3 16:20:31 slot4/LB1-domain.com info bcm56xxd[9563]: 012c0012:6: FFP HDAG installed for default (cmp state 0x2)
 May  3 16:20:31 slot1/LB1-domain.com info bcm56xxd[8234]: 012c0012:6: FFP HDAG installed for default (cmp state 0x2)
 
 ... and the blade logs a restart.

The following logs were identified in the Cisco Nexus 7K that was connected to the Viprion:

2016 May  3 16:20:26 switch-1 %FWM-2-STM_LOOP_DETECT: Loops detected in the network for mac 4111.3111.abc1 among ports Po66 and Po11 on vlan 100 - Disabling dynamic learning notifications for a period between 120 and 240 seconds on vlan 100
2016 May  3 16:20:33 switch-1 %FWM-2-STM_LOOP_DETECT: Loops detected in the network for mac 4111.3111.a6c1 among ports Po11 and Po66 on vlan 200 - Disabling dynamic learning notifications for a period between 120 and 240 seconds on vlan 200

Summary of the 2 conditions that we hit:

  1. TMM Crash because of the “Ctrl+C” used to break “show sys conn” command.
  2. Ctrl+C does not propagate to all the blades causing memory corruption resulting in loss of inter-blade connectivity and thus making the multi-blade Viprion to create a closed loop.