This morning (our time) @martin_casado tweeted a link to a freshly minted IETF draft – draft-davie-stt-00, that describes a new product of Nicira’s engineering effort (with contributions from Broadcom, Rackspace, eBay, Intel and Yahoo!), “A Stateless Transport Tunneling Protocol for Network Virtualisation (STT)”.
Naturally, I was intrigued by the proposal, and would like to share what I’ve learned and some thoughts about it.
Why yet another protocol?
Currently, as far as I know, tunnelling protocol of choice for Nicira’s NVP is GRE. They do support other protocols, but none of them has been specifically designed to meet the needs of Network Virtualisation in the shape Nicira sees it. That is, until now.
So, what’s so special about this protocol? Nicira hints at it in introductory paragraph of the Design Rationale section of the draft:
“The primary motivation for STT as opposed to one of the existing tunneling methods is to improve the performance of data transfers from hosts that implement tunnel endpoints.”
Where this performance improvement is to come from? Looks like it’s the offloading fragmentation and reassembly to Network Interface Cards (NICs) with TCP Segmentation Offload (TSO) capability which will deliver that.
The need for segmentation and reassembly
The problem with any kind of tunnelling is that it inevitably adds overhead. It may be more or may be less, it may be fixed or may be variable, but it is always there. Which means that when a sum of the size of the original frame/packet that is to be transported inside a tunnelling PDU and the size of the tunnelling protocol’s header(s) is bigger than the transmission path can support, fragmentation will need to happen at some layer, or the data isn’t going to make it through.
Ivan (@ioshints) wrote an excellent article about MTUs and fragmentation, for those wishing to understand the issue in greater depth.
Fragmentation and reassembly is a resource-intensive task. If it is to be done by a general CPU of a server host, it can consume substantial amounts of host’s resources, and could still not be fast enough in some cases.
What does it have to do with the TSO?
Well, TSO can perform segmentation and reassembly in NIC hardware. Unfortunately, it can only do it to TCP packets, and not many tunnelling protocols use TCP (in fact, I’m struggling to think of one – let me know in comments if you can come up with any). This is for a good reason, because TCP may not feel very good running over TCP, with congestion / transmission rate control happening at two levels.
So, what to do? Nicira decided to do this: use “double encapsulation” for the data to be sent over the tunnel. First, an untagged “private” Ethernet frame which is to be encapsulated and transported, is encapsulated with an inner (STT) header, which deals largely with multiplexing and auxiliary NVP functions. Then, this frame is passed on to a NIC that supports TSO for transmission, with instructions to fragment it, if needed, and encapsulate it (or all of its fragments) with a “TCP-like” header(s).
The trick is, while “TCP-like” (read: looks exactly like TCP, but isn’t) headers are used, TCP itself isn’t. Which means, there are no “normal” TCP functions happening – no rate control, no ACKs, no handshakes – no state machine. Sequence numbers and Acknowledgement numbers are used; however for “private purposes”, but in such a way as not to confuse NICs that expect to see a “normal” TCP session.
What do I think about it?
Nicira recognises that using TCP headers without “doing actual TCP” can cause problems, most likely with security devices. Devices that implement proxy architecture will break STT (as there’s no actual TCP sessions), firewalls and intrusion detection systems are likely to freak out, too, and “will need to be enhanced”. The proposed mitigation mechanisms are: (a) “special” IANA-assigned Destination Port, which is to be treated as “special” by those “enhanced” security devices; and, if that fails, (b) run it over some sort of other tunnel (Eeek!).
While I understand the rationale, I am undecided about whether this is a good idea or not, especially when it will need to be implemented outside single Data Centre. It would be very interesting to see whether this draft makes it into a standard. As much as I myself like to break things in the name of greater good, I’m not sure whether this draft is a step or maybe a half-step too far.
Update: Nicira guys have published their explanation of the reasons behind design decisions around STT.
Update 2 (13 March 2012): This post seems to continue to be fairly popular, so here are a few more thoughts:
Quantifying the benefits of TSO
This question has been asked in a few places (most recently – in a blog post by Scott Lowe) – “just how much performance benefit does use of TSO bring?”. While I swear I previously saw an actual number mentioned, I can’t seem to be able to find it any more. Instead, Nicira guys say that the benefits are “significant”. Considering that the performance improvement is cited as a primary reason behind creation of STT, I would say that it is likely that it does deliver the goods. Especially considering that according to Nicira guys, the protocol is actually already in production or live testing.
Implications of the “fake TCP” for “middlebox” Vendors
One thing that made me wonder (and I touched on it in the original post above) is how soon “middlebox” Vendors would start supporting the fake-TCP (as in “recognise it for what it is”). I asked Nicira guys about just that in the comments on their blog post. The answer they came back with was very interesting:
My guess is that middlebox vendors are more likely to be swayed by customers than the IETF. There already are a handful that are working on STT support.
That was very nice to know.
One more small pink elephant in the room
BUM (Broadcast / Unknown unicast / Multicast) traffic handling is something that may be easy to overlook. Other network virtualisation protocols (NVGRE and VXLAN) mandate the use of IP multicast in the underlying transport network to support handling of these types of traffic. STT does not. Its current draft says that “if the underlying physical network supports IP multicast”, then STT tunnels can be established using multicast addresses as destination.
In my opinion, this is quite an important, because generally, running IP multicast is not very trivial. I, for one, could very much do without it.