I’ve titled this one “Your Bypass Is Showing,” because you just can’t make this stuff up.
As we conduct analysis on major problems we look for the ordinary first, always with the thought in the back of your mind, "have we seen this one before?"
In most cases it’s something new or different.
Recently, we were called in to look at a serious performance problem. The client’s corporate servers were moved from the corporate data center to a cloud solution provider. As this move took place, the performance suddenly began to degrade - to the point that accessing any server was unusable.
In an effort to solve the problem the company IT department had selected a new WAN compression vendor. The old vendor’s version of WAN compression equipment was still in place, but running in “Bypass” mode. I might add by this time, lots of things had been looked into, such as bandwidth utilization, components in the core infrastructure, as well as latency and throughput testing. So we began with a fresh approach: take a packet trace to see what might be the problem. When we do this we always learn quickly what isn’t the problem. We do this early on and save the time and trouble of testing everything so we can quickly focus on what could be the problem. This also helps clears the congestion in everyone’s thought process.
Throughput Tests and Reviewing Trace Files
We then began running several throughput tests. This began to reveal several reoccurring conditions including, when a large file of one gigabyte or larger was written to servers in Colorado Springs, TCP errors would begin to show up, always occurring towards the end of the file. The impact was so bad it would slow down file transfer to the point it looked like it had stop, it really hadn’t. And shortly after, SMB (the file transfer protocol) would time out.
Reviewing the trace files showed a difference in the pattern of packets leaving the server compared to what was being seen at the NEW vendor version of WAN compression equipment acceleration device at Colorado Springs. This indicated a device within the network path was changing the packets. Further investigation revealed the old vendor version of WAN compression equipment that was operating in bypass mode was affecting the TCP portion of packets in the traffic.
Specifically, "Selective ACKs" (SACK) were being introduced into the traffic as seen by trace files taken on the outbound side of Colorado Springs data center, but these had not been sent from the server. As the client recovered from processing these SACKs, the write process became very slow with response times of 100s of milliseconds.
The normal use of a SACK occurs when a host receives data, but due to resource constraints, the data is lost on the machine before it is delivered to the receiving application. For example, the operating system on the receiving host may 'confiscate' memory belonging to other processes to prevent a system deadlock or memory pool exhaustion, and the receive buffer is eliminated before the data is delivered to the application. When this happens, the receiving host will transmit a SACK to have the lost data retransmitted. The process on the receiver of freeing up resources and possibly losing data is also known as reneging.
While this condition is viewed as a normal response, it was not presented by the server, demonstrating another transparent device within the data path was causing the problem. Further testing revealed that by removing the old vendor version of WAN compression equipment device from the network path the superfluous SACKs were eliminated.
An examination of further trace files revealed that by removing the Bypass, the superfluous SACKS were gone, and all data was being delivered without any loss. Server Message Block (SMB) Performance
SMB requests data in 61,440 bytes and is required to ask for each block of data during the file copy (read or write) operation. The following figures 1:1 – 1:2 show measurements taken from the server and client while the problem was being observed.
The server side response time is quite high, average 263 milliseconds.
Figure 1:1 Microsoft SMB Service Response Time: statistics show the Min, Max and Avg Service Response Time (SRT) from the server
Figure 1:2 Microsoft SMB Service Response Time: statistics show the Min, Max and Avg Service Response Time (SRT) from the client
Figure 1:3 SBM files Write Request and Response Times-the following graph contains the time measurements from the SMB write request and the SMB write response.
Figure 2:1 Throughput Graphs: The graph indicates the individual throughput for conversation TCP 4153 (RED) and TCP 4184 (GREEN), that were conducted while the WAAS was removed from the network. TCP 4153 took 139 seconds averaging and the identical file took 151 seconds shown in TCP 4184 conversation.
Established Baseline (Gearbit Lab Tests Network)
At the Gearbit Lab, tests were conducted to provide a throughput baseline. The baseline provides a simulation of achievable throughput without any outside interference.
Figure 3:1 Gearbit Lab Test 1 Poor Throughput: conducted using WAN emulator, RTT 40ms, Window Size 16384 Results yielded 6,400,000 bps (800,000 Bps).
Figure 3:2 Gearbit Lab Test 2 Much Better Throughput: conducted using WAN emulator, RTT 40ms, Window Size 64,440 Results yielded 720,000,000 bps (90,000,000 Bps)
To sum up, testing at Gearbit labs demonstrated a much higher throughput is achievable with the given equipment if a higher transfer window is set. Furthermore, replacement of the old WAN compress device increased performance noticeably eliminating the Reneging SACT issue.
Visit www.gearbit.com for other case studies.