Stateless Transport Tunnelling Protocol (STT): a half-step too far?

This morning (our time) @martin_casado tweeted a link to a freshly minted IETF draft – draft-davie-stt-00, that describes a new product of Nicira’s engineering effort (with contributions from Broadcom, Rackspace, eBay, Intel and Yahoo!), “A Stateless Transport Tunneling Protocol for Network Virtualisation (STT)”.

Naturally, I was intrigued by the proposal, and would like to share what I’ve learned and some thoughts about it.

Why yet another protocol?

Currently, as far as I know, tunnelling protocol of choice for Nicira’s NVP is GRE. They do support other protocols, but none of them has been specifically designed to meet the needs of Network Virtualisation in the shape Nicira sees it. That is, until now.

So, what’s so special about this protocol? Nicira hints at it in introductory paragraph of the Design Rationale section of the draft:

“The primary motivation for STT as opposed to one of the existing tunneling methods is to improve the performance of data transfers from hosts that implement tunnel endpoints.”

Where this performance improvement is to come from? Looks like it’s the offloading fragmentation and reassembly to Network Interface Cards (NICs) with TCP Segmentation Offload (TSO) capability which will deliver that.

The need for segmentation and reassembly

The problem with any kind of tunnelling is that it inevitably adds overhead. It may be more or may be less, it may be fixed or may be variable, but it is always there. Which means that when a sum of the size of the original frame/packet that is to be transported inside a tunnelling PDU and the size of the tunnelling protocol’s header(s) is bigger than the transmission path can support, fragmentation will need to happen at some layer, or the data isn’t going to make it through.

Ivan (@ioshints) wrote an excellent article about MTUs and fragmentation, for those wishing to understand the issue in greater depth.

So what?

Fragmentation and reassembly is a resource-intensive task. If it is to be done by a general CPU of a server host, it can consume substantial amounts of host’s resources, and could still not be fast enough in some cases.

What does it have to do with the TSO?

Well, TSO can perform segmentation and reassembly in NIC hardware. Unfortunately, it can only do it to TCP packets, and not many tunnelling protocols use TCP (in fact, I’m struggling to think of one – let me know in comments if you can come up with any). This is for a good reason, because TCP may not feel very good running over TCP, with congestion / transmission rate control happening at two levels.

So, what to do? Nicira decided to do this: use “double encapsulation” for the data to be sent over the tunnel. First, an untagged “private” Ethernet frame which is to be encapsulated and transported, is encapsulated with an inner (STT) header, which deals largely with multiplexing and auxiliary NVP functions. Then, this frame is passed on to a NIC that supports TSO for transmission, with instructions to fragment it, if needed, and encapsulate it (or all of its fragments) with a “TCP-like” header(s).

The trick is, while “TCP-like” (read: looks exactly like TCP, but isn’t) headers are used, TCP itself isn’t. Which means, there are no “normal” TCP functions happening – no rate control, no ACKs, no handshakes – no state machine. Sequence numbers and Acknowledgement numbers are used; however for “private purposes”, but in such a way as not to confuse NICs that expect to see a “normal” TCP session.

What do I think about it?

Nicira recognises that using TCP headers without “doing actual TCP” can cause problems, most likely with security devices. Devices that implement proxy architecture will break STT (as there’s no actual TCP sessions), firewalls and intrusion detection systems are likely to freak out, too, and “will need to be enhanced”. The proposed mitigation mechanisms are: (a) “special” IANA-assigned Destination Port, which is to be treated as “special” by those “enhanced” security devices; and, if that fails, (b) run it over some sort of other tunnel (Eeek!).

While I understand the rationale, I am undecided about whether this is a good idea or not, especially when it will need to be implemented outside single Data Centre. It would be very interesting to see whether this draft makes it into a standard. As much as I myself like to break things in the name of greater good, I’m not sure whether this draft is a step or maybe a half-step too far.

Update: Nicira guys have published their explanation of the reasons behind design decisions around STT.

Update 2 (13 March 2012): This post seems to continue to be fairly popular, so here are a few more thoughts:

Quantifying the benefits of TSO

This question has been asked in a few places (most recently – in a blog post by Scott Lowe) – “just how much performance benefit does use of TSO bring?”. While I swear I previously saw an actual number mentioned, I can’t seem to be able to find it any more. Instead, Nicira guys say that the benefits are “significant”. Considering that the performance improvement is cited as a primary reason behind creation of STT, I would say that it is likely that it does deliver the goods. Especially considering that according to Nicira guys, the protocol is actually already in production or live testing.

Implications of the “fake TCP” for “middlebox” Vendors

One thing that made me wonder (and I touched on it in the original post above) is how soon “middlebox” Vendors would start supporting the fake-TCP (as in “recognise it for what it is”). I asked Nicira guys about just that in the comments on their blog post. The answer they came back with was very interesting:

My guess is that middlebox vendors are more likely to be swayed by customers than the IETF. There already are a handful that are working on STT support.

That was very nice to know.

One more small pink elephant in the room

BUM (Broadcast / Unknown unicast / Multicast) traffic handling is something that may be easy to overlook. Other network virtualisation protocols (NVGRE and VXLAN) mandate the use of IP multicast in the underlying transport network to support handling of these types of traffic. STT does not. Its current draft says that “if the underlying physical network supports IP multicast”, then STT tunnels can be established using multicast addresses as destination.

In my opinion, this is quite an important, because generally, running IP multicast is not very trivial. I, for one, could very much do without it.

Advertisements

About Dmitri Kalintsev

Some dude with a blog and opinions ;) View all posts by Dmitri Kalintsev

11 responses to “Stateless Transport Tunnelling Protocol (STT): a half-step too far?

  • Mark

    I’m curious about the mention of LRO in the STT draft. According to the README file for Intel’s latest 10GE NIC drivers for Linux:

    “WARNING: The ixgbe driver compiles by default with the LRO (Large
    Receive Offload) feature enabled. This option offers the lowest CPU
    utilization for receives, but is completely incompatible with
    *routing/ip forwarding* and *bridging*. If enabling ip forwarding or
    bridging is a requirement, it is necessary to disable LRO using compile
    time options as noted in the LRO section later in this document. The
    result of not disabling LRO when combined with ip forwarding or bridging
    can be low throughput or even a kernel panic.”

    If one of STT’s benefits is ostensibly to make use of LRO on the NIC, then it only receives that benefit in use cases where bridging and forwarding aren’t in play, apparently?

  • Bo

    n fact, I’m struggling to think of one – let me know in comments if you can come up with any

    — How about VXLAN, new feature in vmware MN.next release.

  • Roger

    So, I’m just a little curious about STT and the TCP 3-way handshake at the beginning and the end of a TCP session. Since the VM is using a vNIC into OVS, it also has it’s own TCP configuration, so the actual TCP connection is being started between the vNICs. Is this correct? I mean the application thinks it is running TCP, the vNIC thinks it ‘s running TCP, the receiving client or server thinks it is TCP, so how is this being avoided? I mean how does a receiving client or server stat to receive TCO frames if the 3-way handshake is not completed first?

    • Dmitri K

      Hi Roger,

      STT sessions are not between VMs; they are between OVS instances, who are fully aware of the fact that that they’re talking STT rather than TCP and thus don’t attempt to do the TCP handshake.

      Hope this makes sense.

      — Dmitri

      • Roger

        I understand the part that there is no handshake at the STT level, but the actual connections are between VMs via TCP utilizing the vNIC, so that would mean the 3-way handshake still must occur before a TCP connection between the VMs can be established.

        Otherwise, just how does the receiving VM or application know that the packets are coming or the connection request is being made?

        Second part, how does the VM and application know if it received all of the data in the transfer?

    • Dmitri K

      Hi Roger,

      > the actual connections are between VMs via TCP utilizing the vNIC, so that would mean the 3-way handshake still must occur before a TCP connection between the VMs can be established.

      STT tunnels and TCP connections between VMs operate on different levels. Think of STT tunnels as “virtual cables” connecting OVS instances, running on separate physical hosts. OVSes can then create a “virtual Ethernet network segment”, which is where your VMs attach. The VMs then can run any protocol on top of that virtual Ethernet segment, including IP, and of course TCP on top of it. From the standpoint of VMs on the same virtual Ethernet segment, they are operating as if they were attached to a real VLAN, so there’s no change to how they handle their own TCP connections.

      Cheers,

      — Dmitri

  • elnemesisdivina

    thanks for the abstraction!!!

    vRay

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: