Yet another networking method for Linux namespaces
All code for this can be found on github.
For an application that I’ve been working on, that leverages rootless containers (or rather, rootless namespaces as I don’t follow any of the container standards for this), I needed networking. Docker uses bridges by default, which require root. So that wasn’t worth looking at. Podman mentions slirp4netns quite a bit. Which requires an extra binary to be installed so isn’t ideal, but I ultimately added it as an option anyway (after I wrote what I’m about to talk about).
But I wanted something easy that would just work as a default, without needing to install anything extra (like slirp4netns).
I had previously played around with a golang library that allows the creation of tun and tap devices.
And I figured that if I were to bind mount /dev/net/tun
, to then use this device to create a tun device within the new namespace.
Meaning I now have a network interface that I have full control over from userspace.
But, how would it actually communicate to the outside world?
Well.. The host process has network, right?
But, how would those two communicate?
Well, let me introduce you to a wonderful option in exec.Cmd. Called ExtraFiles
.
This allows us to pass a file descriptor to the new process.
Which I initially combined with the Pipe() function.
Where I would create one pair to communicate to the container process. And one pair to communicate back.
After this I added some code where the container would send all data it received on the tun device to the host over the pipe. And all the data it would read from the pipe it would write back to the tun device. This way all I would need to do is handle all the magic in the host process. I initially wrote a simple UDP forwarder, which worked pretty much right away. But, of course I wanted TCP as well. For about an hour I tried to implement a proof of concept for this. But I ultimately decided that it would be very unlikely that I would get every detail right and more importantly, safe. So I went to look for another solution. At which point I found gvisor, or more specifically their tcpip stack, which is completely in golang. Annoyingly enough it is not super straight forward to just use this package as a library, due to the weird build system used. But I ultimately got it working with an older version.
So we now had the basic functionality that we wanted.
Both TCP and UDP connections can be made from within the network namespace, started from golang.
Proxying everything through the host process.
Time to see how the performance is I suppose.
Of course I would want to use iperf3 for this.
But my testing setup consistent of just a static busybox binary for now and no libc.
So I could either just use a basic alpine rootfs.
Or I just use a statically compiled iperf3, which is the route I ended up taking.
So I ran iperf3 -s
on my host machine. And ./iperf3-static -c <host ip>
from the container.
[ 5] 0.00-1.00 sec 28.2 MBytes 236 Mbits/sec
[ 5] 1.00-2.00 sec 16.9 MBytes 142 Mbits/sec
[ 5] 2.00-3.00 sec 4.51 MBytes 37.8 Mbits/sec
[ 5] 3.00-4.00 sec 8.05 MBytes 67.5 Mbits/sec
[ 5] 4.00-5.00 sec 15.7 MBytes 132 Mbits/sec
[ 5] 5.00-6.00 sec 21.1 MBytes 177 Mbits/sec
[ 5] 6.00-7.00 sec 9.34 MBytes 78.3 Mbits/sec
[ 5] 7.00-8.00 sec 13.9 MBytes 117 Mbits/sec
[ 5] 8.00-9.00 sec 10.6 MBytes 88.6 Mbits/sec
[ 5] 9.00-10.00 sec 9.11 MBytes 76.4 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 137 MBytes 115 Mbits/sec
That’s not terrible, but yikes that’s spiky.
So I went digging into the layers.
Initially I decided to dig into the os.Pipe()
function that I’m using for the communication between the 2 processes.
Just looking at the source code, I see that it’s just using the pipe()
system call.
That isn’t too surprising. When having a look at the man page of this I stumbled upon the following.
O_DIRECT (since Linux 3.4)
Create a pipe that performs I/O in "packet" mode. Each write(2) to the pipe is dealt with as a separate packet, and read(2)s from the pipe will read one packet at a time. Note the following points:
* Writes of greater than PIPE_BUF bytes (see pipe(7)) will be split into multiple packets. The constant PIPE_BUF is defined in <limits.h>.
* If a read(2) specifies a buffer size that is smaller than the next packet, then the requested number of bytes are read, and the excess bytes in the packet are discarded. Specifying a buffer size of PIPE_BUF will be sufficient to read the largest possible packets (see the previous point).
* Zero-length packets are not supported. (A read(2) that specifies a buffer size of zero is a no-op, and returns 0.)
Older kernels that do not support this flag will indicate this via an EINVAL error.
Since Linux 4.5, it is possible to change the O_DIRECT setting of a pipe file descriptor using fcntl(2).
Well, that surely sounds interesting.
So rather than a stream, each read/write just acts upon packets.
So, like a network interface would do anyway.
And it requires Linux 3.4.
Yeah ok, that came out like 10 years ago. So not a problem.
So I just copied the entire os.Pipe()
function and just added O_DIRECT
as an extra flag to it.
Now, let’s try our iperf3 test again.
[ 5] 0.00-1.00 sec 205 MBytes 1.72 Gbits/sec
[ 5] 1.00-2.00 sec 194 MBytes 1.63 Gbits/sec
[ 5] 2.00-3.00 sec 202 MBytes 1.69 Gbits/sec
[ 5] 3.00-4.00 sec 189 MBytes 1.59 Gbits/sec
[ 5] 4.00-5.00 sec 195 MBytes 1.64 Gbits/sec
[ 5] 5.00-6.00 sec 193 MBytes 1.62 Gbits/sec
[ 5] 6.00-7.00 sec 195 MBytes 1.64 Gbits/sec
[ 5] 7.00-8.00 sec 195 MBytes 1.64 Gbits/sec
[ 5] 8.00-9.00 sec 196 MBytes 1.64 Gbits/sec
[ 5] 9.00-10.00 sec 203 MBytes 1.70 Gbits/sec
[ 5] 10.00-10.00 sec 445 KBytes 1.32 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 1.92 GBytes 1.65 Gbits/sec
That’s a… Huge difference. And a way less spiky. I used it in this state for a while. Added a bunch of basic helper functions to the library, to disallow connecting back to the host. Keeping track of packet stats, etc.
Several months later, I happened to stumble upon unix.Socketpair() system call.
Which basically creates 2 sockets (file descriptors, so I can just pass them like the pipes).
And everything written to the first socket, will be available to be read from the second socket. And vice versa.
So basically what I was already doing, but just nicer.
So naturally I replaced this model with this system call.
And, performance was back to being spiky.
But after browsing more man pages, I stumbled upon SOCK_SEQPACKET
in the socket man page.
Which with:
Provides a sequenced, reliable, two-way connection-based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call.
Basically sounded like O_DIRECT. So I applied it and performance was basically the same again. (Here’s the actual commit for this, including iperf3 results) After this I decided to play around with tweaking the MTU, which was still on the default of 1500 at this point. And after a while I settled on 32 kibibyte (32768), which gave about the following performance on an iperf3 test.
[ 8] local 10.0.0.1 port 44194 connected to 192.168.100.123 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 8] 0.00-1.00 sec 1.06 GBytes 9.13 Gbits/sec 0 639 KBytes
[ 8] 1.00-2.00 sec 1.04 GBytes 8.96 Gbits/sec 0 639 KBytes
[ 8] 2.00-3.00 sec 1.00 GBytes 8.59 Gbits/sec 0 639 KBytes
[ 8] 3.00-4.00 sec 1.04 GBytes 8.94 Gbits/sec 0 639 KBytes
[ 8] 4.00-5.00 sec 1023 MBytes 8.58 Gbits/sec 0 639 KBytes
[ 8] 5.00-6.00 sec 1.03 GBytes 8.81 Gbits/sec 0 639 KBytes
[ 8] 6.00-7.00 sec 1.01 GBytes 8.69 Gbits/sec 0 639 KBytes
[ 8] 7.00-8.00 sec 1.04 GBytes 8.97 Gbits/sec 0 639 KBytes
[ 8] 8.00-9.00 sec 1.02 GBytes 8.75 Gbits/sec 0 639 KBytes
[ 8] 9.00-10.00 sec 1.01 GBytes 8.68 Gbits/sec 0 639 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 8] 0.00-10.00 sec 10.3 GBytes 8.81 Gbits/sec 0 sender
[ 8] 0.00-10.00 sec 10.3 GBytes 8.81 Gbits/sec receiver
Basically just over times 4 the performance. But that’s basically where this journey ends for now. This solution isn’t perfect by far. But for something that runs completely without root and without the need to install anything extra, it’s good enough. And it gives us complete control over the network traffic from golang on the host. Meaning we could view, alter, firewall the traffic as we would please. Which is something I am likely to implement in a later stage for the application I originally wrote this for.
As for this library itself. I still aim to look at either a decent more reliable way to upgrade gvisor at least once in a while. Or to switch to a completely different network stack, if I were to run into one at some point.