Guest blog by Andrew Watters, Raellic CEO/Director
Advances in hardware in recent years have made it possible for anyone with a reasonable level of experience to build a device that captures all traffic on gigabit-level networks without packet loss.
I sell one such device called The Vision™. At today's performance levels, with spinning disks that can each write more than 200 MB/s, it only takes two hard drives in RAID 0 to guarantee full duplex line rate capture to disk on a saturated 1G connection. With three hard drives in RAID 0 topping out at over 600 MB/s, you could potentially tap two full duplex 1G connections and capture both streams to disk at the same time.
With such a low barrier to entry, 1G capture to disk is just not as sexy as it used to be, which begs the question: is there an upper limit on capture to disk performance?
Time Division Packet Steering
Emerging technologies have led me to believe that there is no upper limit, and I propose here the solution I came up with for continuous 100G full duplex line rate capture to disk. Apparently there is no industry-standard term for this strategy, so I suggest calling it "time division packet steering."
The genesis of this project was the Snowden files, which refer repeatedly to the government's ability to tap massive data rate connections such as undersea cables.
I kept ruminating on how I might do it with commercial off the shelf hardware, but I couldn't figure it out until it hit me one day: divide and conquer a high data rate stream into small enough pieces that each machine captures to disk at a rate within its hardware limits. The resulting design is troubling and interesting at the same time because it would enable nation state-level entities to record all of their traffic and replay as many as several days of it before an event of interest. In the U.S., traffic replay would be really useful for investigating cyber incidents and terrorist attacks. In Iran, it would certainly be an instrument of control. I don't claim to have all the answers to such problems.
The Vision Omega™
In any event, I believe there are a couple of ways to capture to disk on a 100G connection without packet loss, some of which would require a lot of further research and development (email me about "wavelength division packet steering" if you like). The most straightforward and easiest solution using publicly available current technology appears to be the following:
- Insert a Garland passive 100G TAP to copy and send 100% of the network traffic from each direction. Both streams go into a FPGA-based NIC in one of two master machines (one master machine for each direction of traffic);
- Send every 20th packet received on the 100G NIC to one of twenty 10G ports on the same machine on a round robin basis;
- Send each 10G stream to a cluster unit in a 42U storage cluster;
- Capture the 10G stream on each cluster unit using a modified version of tcpdump;
- Write the stream to a RAID 0 array of three hard drives; and
- Merge the resulting capture files with tcpslice for later analysis.
Alternatively, I could send all packets received on the 100G NIC during a particular 50 ms time period to one of the 10G NICs, again on a round robin basis. In this way, the master machines split each direction of traffic into twenty smaller streams, and each of forty cluster units captures those streams at a maximum of 625 MB/s. This rate is near the maximum write performance of three top-of-the-line 8 TB hard drives in RAID 0. With a 42U cluster I would have forty active capture units plus two hot spares, capturing at the theoretical 100G maximum of 25,000 MB/s in full duplex. I would merge the resulting binary capture files using the standard tcpslice utility with high resolution (4 ns) timestamps.
The system takes up one and a quarter racks and would cost at least a quarter million dollars to build. If I used two storage clusters, one for each direction of traffic, traffic replay would double. I could add more storage clusters and master machines to scale up to any level of traffic without too many changes.
I have a feeling this is a smaller scale version of how the U.S. government captures massive data rate connections, because according to the Snowden files, they have been doing it for many years and there doesn't seem to be another way that makes sense, because:
- Using SSD's doesn't make sense because their service life would be short in such a demanding environment.
- Using RAM disks as buffers doesn't make sense because that would not support continuous line rate capture to disk, only bursts.
What would make sense is using the FPGA NICs to drop uninteresting traffic and capture only interesting traffic to disk. But that would defeat the purpose of 100% capture to disk and would not permit full traffic replay.
The system I propose also enables long-term, indexed storage in a database system of the customer's choice. I suggest iterating over the binary capture files with tshark, converting them to ASCII on the fly, and saving in a high performance database.
The Vision Omega with long-term, searchable storage appears to be very similar to the XKEYSCORE system revealed by Snowden. According to the XKEYSCORE slide on the FBI's PRISM PowerPoint, they had like 150 sites all around the world with at least 700 servers, implying that each site captured a big portion of the internet but not all of it. This would probably cost tens of billions of dollars, but it could be done using the approach I propose.
There are, of course, numerous challenges to implementing this system. This is a starting point, not an ending point, and you have to start somewhere. I welcome and encourage comments because I am going to be sweating bullets for a long time if my financial partner puts up the money to build this device, contact me directly at firstname.lastname@example.org