<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=2975524&amp;fmt=gif">
BLOG

Is Your Bypass Showing?

May 11, 2016

I’ve titled this one “Your Bypass Is Showing,” because you just can’t make this stuff up.

As we conduct analysis on major problems we look for the ordinary first, always with the thought in the back of your mind, "have we seen this one before?"

In most cases it’s something new or different.

 

The Problem

Recently, we were called in to look at a serious performance problem. The client’s corporate servers were moved from the corporate data center to a cloud solution provider. As this move took place, the performance suddenly began to degrade - to the point that accessing any server was unusable.    

In an effort to solve the problem the company IT department had selected a new WAN compression vendor. The old vendor’s version of WAN compression equipment was still in place, but running in “Bypass” mode. I might add by this time, lots of things had been looked into, such as bandwidth utilization, components in the core infrastructure, as well as latency and throughput testing. So we began with a fresh approach: take a packet trace to see what might be the problem. When we do this we always learn quickly what isn’t the problem. We do this early on and save the time and trouble of testing everything so we can quickly focus on what could be the problem. This also helps clears the congestion in everyone’s thought process.

Throughput Tests and Reviewing Trace Files

We then began running several throughput tests. This began to reveal several reoccurring conditions including, when a large file of one gigabyte or larger was written to servers in Colorado Springs, TCP errors would begin to show up, always occurring towards the end of the file. The impact was so bad it would slow down file transfer to the point it looked like it had stop, it really hadn’t.  And shortly after, SMB (the file transfer protocol) would time out.

Reviewing the trace files showed a difference in the pattern of packets leaving the server compared to what was being seen at the NEW vendor version of WAN compression equipment acceleration device at Colorado Springs. This indicated a device within the network path was changing the packets. Further investigation revealed the old vendor version of WAN compression equipment that was operating in bypass mode was affecting the TCP portion of packets in the traffic.

Specifically, "Selective ACKs" (SACK) were being introduced into the traffic as seen by trace files taken on the outbound side of Colorado Springs data center, but these had not been sent from the server. As the client recovered from processing these SACKs, the write process became very slow with response times of 100s of milliseconds.

The normal use of a SACK occurs when a host receives data, but due to resource constraints, the data is lost on the machine before it is delivered to the receiving application. For example, the operating system on the receiving host may 'confiscate' memory belonging to other processes to prevent a system deadlock or memory pool exhaustion, and the receive buffer is eliminated before the data is delivered to the application. When this happens, the receiving host will transmit a SACK to have the lost data retransmitted. The process on the receiver of freeing up resources and possibly losing data is also known as reneging

While this condition is viewed as a normal response, it was not presented by the server, demonstrating another transparent device within the data path was causing the problem. Further testing revealed that by removing the old vendor version of WAN compression equipment device from the network path the superfluous SACKs were eliminated.

An examination of further trace files revealed that by removing the Bypass, the superfluous SACKS were gone, and all data was being delivered without any loss. Server Message Block (SMB) Performance

SMB requests data in 61,440 bytes and is required to ask for each block of data during the file copy (read or write) operation. The following figures 1:1 – 1:2 show measurements taken from the server and client while the problem was being observed.

The server side response time is quite high, average 263 milliseconds.

Figure 1:1 Microsoft SMB Service Response Time: statistics show the Min, Max and Avg Service Response Time (SRT) from the server

 

Figure 1:1 Gearbit SMB Service Response Time from Server

Figure 1:2 Microsoft SMB Service Response Time: statistics show the Min, Max and Avg Service Response Time (SRT) from the client

Figure 1:2 SMB Response Time from Client

Figure 1:3 SBM files Write Request and Response Times-the following graph contains the time measurements from the SMB write request and the SMB write response.

Figure 1:3 Write and Response Time Measurements

Throughput Graphs

Figure 2:1 Throughput Graphs: The graph indicates the individual throughput for conversation TCP 4153 (RED) and TCP 4184 (GREEN), that were conducted while the WAAS was removed from the network. TCP 4153 took 139 seconds averaging and the identical file took 151 seconds shown in TCP 4184 conversation.

 

 

Figure 2:1- Throughput Graphics

Established Baseline (Gearbit Lab Tests Network)

At the Gearbit Lab, tests were conducted to provide a throughput baseline. The baseline provides a simulation of achievable throughput without any outside interference.

Figure 3:1 Gearbit Lab Test 1 Poor Throughput: conducted using WAN emulator, RTT 40ms, Window Size 16384 Results yielded 6,400,000 bps (800,000 Bps).

Gearbit_figure_3-1.png

Figure 3:2 Gearbit Lab Test 2 Much Better Throughput: conducted using WAN emulator, RTT 40ms, Window Size 64,440 Results yielded 720,000,000 bps (90,000,000 Bps)

Figure 3:2: Gearbit Lab Test 2 - Improved Throughput

To sum up, testing at Gearbit labs demonstrated a much higher throughput is achievable with the given equipment if a higher transfer window is set. Furthermore, replacement of the old WAN compress device increased performance noticeably eliminating the Reneging SACT issue.

 

Visit www.gearbit.com for other case studies.

See Everything. Secure Everything.

Contact us now to secure and optimized your network operations

Heartbeats Packets Inside the Bypass TAP

If the inline security tool goes off-line, the TAP will bypass the tool and automatically keep the link flowing. The Bypass TAP does this by sending heartbeat packets to the inline security tool. As long as the inline security tool is on-line, the heartbeat packets will be returned to the TAP, and the link traffic will continue to flow through the inline security tool.

If the heartbeat packets are not returned to the TAP (indicating that the inline security tool has gone off-line), the TAP will automatically 'bypass' the inline security tool and keep the link traffic flowing. The TAP also removes the heartbeat packets before sending the network traffic back onto the critical link.

While the TAP is in bypass mode, it continues to send heartbeat packets out to the inline security tool so that once the tool is back on-line, it will begin returning the heartbeat packets back to the TAP indicating that the tool is ready to go back to work. The TAP will then direct the network traffic back through the inline security tool along with the heartbeat packets placing the tool back inline.

Some of you may have noticed a flaw in the logic behind this solution!  You say, “What if the TAP should fail because it is also in-line? Then the link will also fail!” The TAP would now be considered a point of failure. That is a good catch – but in our blog on Bypass vs. Failsafe, I explained that if a TAP were to fail or lose power, it must provide failsafe protection to the link it is attached to. So our network TAP will go into Failsafe mode keeping the link flowing.

Glossary

  1. Single point of failure: a risk to an IT network if one part of the system brings down a larger part of the entire system.

  2. Heartbeat packet: a soft detection technology that monitors the health of inline appliances. Read the heartbeat packet blog here.

  3. Critical link: the connection between two or more network devices or appliances that if the connection fails then the network is disrupted.

NETWORK MANAGEMENT | THE 101 SERIES