Yet another networking method for Linux namespaces

All code for this can be found on github.

For an application that I’ve been working on, that leverages rootless containers (or rather, rootless namespaces as I don’t follow any of the container standards for this), I needed networking. Docker uses bridges by default, which require root. So that wasn’t worth looking at. Podman mentions slirp4netns quite a bit. Which requires an extra binary to be installed so isn’t ideal, but I ultimately added it as an option anyway (after I wrote what I’m about to talk about).

But I wanted something easy that would just work as a default, without needing to install anything extra (like slirp4netns). I had previously played around with a golang library that allows the creation of tun and tap devices. And I figured that if I were to bind mount /dev/net/tun, to then use this device to create a tun device within the new namespace. Meaning I now have a network interface that I have full control over from userspace. But, how would it actually communicate to the outside world? Well.. The host process has network, right? But, how would those two communicate? Well, let me introduce you to a wonderful option in exec.Cmd. Called ExtraFiles. This allows us to pass a file descriptor to the new process. Which I initially combined with the Pipe() function. Where I would create one pair to communicate to the container process. And one pair to communicate back.

After this I added some code where the container would send all data it received on the tun device to the host over the pipe. And all the data it would read from the pipe it would write back to the tun device. This way all I would need to do is handle all the magic in the host process. I initially wrote a simple UDP forwarder, which worked pretty much right away. But, of course I wanted TCP as well. For about an hour I tried to implement a proof of concept for this. But I ultimately decided that it would be very unlikely that I would get every detail right and more importantly, safe. So I went to look for another solution. At which point I found gvisor, or more specifically their tcpip stack, which is completely in golang. Annoyingly enough it is not super straight forward to just use this package as a library, due to the weird build system used. But I ultimately got it working with an older version.

So we now had the basic functionality that we wanted. Both TCP and UDP connections can be made from within the network namespace, started from golang. Proxying everything through the host process. Time to see how the performance is I suppose. Of course I would want to use iperf3 for this. But my testing setup consistent of just a static busybox binary for now and no libc. So I could either just use a basic alpine rootfs. Or I just use a statically compiled iperf3, which is the route I ended up taking. So I ran iperf3 -s on my host machine. And ./iperf3-static -c <host ip> from the container.

[  5]   0.00-1.00   sec  28.2 MBytes   236 Mbits/sec
[  5]   1.00-2.00   sec  16.9 MBytes   142 Mbits/sec
[  5]   2.00-3.00   sec  4.51 MBytes  37.8 Mbits/sec
[  5]   3.00-4.00   sec  8.05 MBytes  67.5 Mbits/sec
[  5]   4.00-5.00   sec  15.7 MBytes   132 Mbits/sec
[  5]   5.00-6.00   sec  21.1 MBytes   177 Mbits/sec
[  5]   6.00-7.00   sec  9.34 MBytes  78.3 Mbits/sec
[  5]   7.00-8.00   sec  13.9 MBytes   117 Mbits/sec
[  5]   8.00-9.00   sec  10.6 MBytes  88.6 Mbits/sec
[  5]   9.00-10.00  sec  9.11 MBytes  76.4 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec   137 MBytes   115 Mbits/sec

That’s not terrible, but yikes that’s spiky. So I went digging into the layers. Initially I decided to dig into the os.Pipe() function that I’m using for the communication between the 2 processes. Just looking at the source code, I see that it’s just using the pipe() system call. That isn’t too surprising. When having a look at the man page of this I stumbled upon the following.

O_DIRECT (since Linux 3.4)
Create a pipe that performs I/O in "packet" mode.  Each write(2) to the pipe is dealt with as a separate packet, and read(2)s from the pipe will read one packet at a time.  Note the following points:

*  Writes of greater than PIPE_BUF bytes (see pipe(7)) will be split into multiple packets.  The constant PIPE_BUF is defined in <limits.h>.

*  If a read(2) specifies a buffer size that is smaller than the next packet, then the requested number of bytes are read, and the excess bytes in the packet are discarded.  Specifying a buffer size of PIPE_BUF will be sufficient to read the largest possible packets (see the previous point).

*  Zero-length packets are not supported.  (A read(2) that specifies a buffer size of zero is a no-op, and returns 0.)

Older kernels that do not support this flag will indicate this via an EINVAL error.

Since Linux 4.5, it is possible to change the O_DIRECT setting of a pipe file descriptor using fcntl(2).

Well, that surely sounds interesting. So rather than a stream, each read/write just acts upon packets. So, like a network interface would do anyway. And it requires Linux 3.4. Yeah ok, that came out like 10 years ago. So not a problem. So I just copied the entire os.Pipe() function and just added O_DIRECT as an extra flag to it. Now, let’s try our iperf3 test again.

[  5]   0.00-1.00   sec   205 MBytes  1.72 Gbits/sec
[  5]   1.00-2.00   sec   194 MBytes  1.63 Gbits/sec
[  5]   2.00-3.00   sec   202 MBytes  1.69 Gbits/sec
[  5]   3.00-4.00   sec   189 MBytes  1.59 Gbits/sec
[  5]   4.00-5.00   sec   195 MBytes  1.64 Gbits/sec
[  5]   5.00-6.00   sec   193 MBytes  1.62 Gbits/sec
[  5]   6.00-7.00   sec   195 MBytes  1.64 Gbits/sec
[  5]   7.00-8.00   sec   195 MBytes  1.64 Gbits/sec
[  5]   8.00-9.00   sec   196 MBytes  1.64 Gbits/sec
[  5]   9.00-10.00  sec   203 MBytes  1.70 Gbits/sec
[  5]  10.00-10.00  sec   445 KBytes  1.32 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  1.92 GBytes  1.65 Gbits/sec

That’s a… Huge difference. And a way less spiky. I used it in this state for a while. Added a bunch of basic helper functions to the library, to disallow connecting back to the host. Keeping track of packet stats, etc.

Several months later, I happened to stumble upon unix.Socketpair() system call. Which basically creates 2 sockets (file descriptors, so I can just pass them like the pipes). And everything written to the first socket, will be available to be read from the second socket. And vice versa. So basically what I was already doing, but just nicer. So naturally I replaced this model with this system call. And, performance was back to being spiky. But after browsing more man pages, I stumbled upon SOCK_SEQPACKET in the socket man page. Which with:

Provides a sequenced, reliable, two-way connection-based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call.

Basically sounded like O_DIRECT. So I applied it and performance was basically the same again. (Here’s the actual commit for this, including iperf3 results) After this I decided to play around with tweaking the MTU, which was still on the default of 1500 at this point. And after a while I settled on 32 kibibyte (32768), which gave about the following performance on an iperf3 test.

[  8] local 10.0.0.1 port 44194 connected to 192.168.100.123 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  8]   0.00-1.00   sec  1.06 GBytes  9.13 Gbits/sec    0    639 KBytes
[  8]   1.00-2.00   sec  1.04 GBytes  8.96 Gbits/sec    0    639 KBytes
[  8]   2.00-3.00   sec  1.00 GBytes  8.59 Gbits/sec    0    639 KBytes
[  8]   3.00-4.00   sec  1.04 GBytes  8.94 Gbits/sec    0    639 KBytes
[  8]   4.00-5.00   sec  1023 MBytes  8.58 Gbits/sec    0    639 KBytes
[  8]   5.00-6.00   sec  1.03 GBytes  8.81 Gbits/sec    0    639 KBytes
[  8]   6.00-7.00   sec  1.01 GBytes  8.69 Gbits/sec    0    639 KBytes
[  8]   7.00-8.00   sec  1.04 GBytes  8.97 Gbits/sec    0    639 KBytes
[  8]   8.00-9.00   sec  1.02 GBytes  8.75 Gbits/sec    0    639 KBytes
[  8]   9.00-10.00  sec  1.01 GBytes  8.68 Gbits/sec    0    639 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  8]   0.00-10.00  sec  10.3 GBytes  8.81 Gbits/sec    0             sender
[  8]   0.00-10.00  sec  10.3 GBytes  8.81 Gbits/sec                  receiver

Basically just over times 4 the performance. But that’s basically where this journey ends for now. This solution isn’t perfect by far. But for something that runs completely without root and without the need to install anything extra, it’s good enough. And it gives us complete control over the network traffic from golang on the host. Meaning we could view, alter, firewall the traffic as we would please. Which is something I am likely to implement in a later stage for the application I originally wrote this for.

As for this library itself. I still aim to look at either a decent more reliable way to upgrade gvisor at least once in a while. Or to switch to a completely different network stack, if I were to run into one at some point.