Moving to Beta

Originally launched in March 2020, was my first attempt at open infrastructure for the web. After 2 years of Alpha, it's time to give this service some tender-loving-care.

Moving to Beta

Originally launched in March 2020, was my first attempt at open infrastructure for the web. After 2 years of Alpha, it's time to give this service some tender-loving-care.

There are three major areas that have received updates in order to bring into Beta:

  1. Update and improve the management of the production server.
  2. Use a battle-tested DNS server to act as a load-balancer.
  3. Improve the underlying software.

In this post I'll be giving an overview of what changes were made.

The Server

The name servers are a publicly accessible version of the domain-park software.

Which, during the Alpha, was very literal...

The server was running a single instance of domain-park on port 53 of my hand-crafted testing server which was going to stop receiving security updates in a few month's time.

This is definitely not adequate for real world usage.

The first port of call was to move to a more recent version of Ubuntu Server, which in the world of cloud computing is just a few clicks, and PRESTO, server. At this point in my career I am very used to configuring new servers with all the good stuff like SSH only access. But this is a server I don't want to be managing by hand, especially when I get to the point of having a number of globally located servers each of which would need managing.

Enter Ansible, my current tool of choice for managing servers. Ansible allows you to automate server tasks by defining everything in configuration files. You can then use these to either set a desired state for the server to be in (like installing domain-park 😉), or to simply run maintenance tasks like applying security updates and restarting the server.

My test server was already configured to run domain-park using systemd, the service manager used by Ubuntu (I'm not some kind of deranged maniac!). So, porting the existing config to Ansible was a pretty painless experience.

With my new found powers of automation I could now not only easily configure our freshly minted production server, I could also easily create test servers with the exact same configuration!

The Load-Balancer

Even with our new automation powers, the actual processing DNS queries is still the same as it was during the Alpha. A single instance of domain-park responding to queries. What's worse is because we are using Python, we are essentially limited to responding to one query at a time and hoping that the operating system can buffer all the incoming requests (spoiler: it won't do this well).

I want more.

This is a pretty well understood and solved problem with Python web applications, with tooling like uWSGI and gunicorn. Unfortunately for me, there is less python specific tooling in the DNS space, with the highest-level library for writing DNS servers having being written by me for this project. Instead, we're going to have to find something that can load-balance between multiple but distinct copies of domain-park.

For this purpose I have chosen to use CoreDNS. There are a number of reasons for choosing to use CoreDNS:

  • Has a large number of official and community maintained plugins.
  • Supports all major DNS protocols.
  • Is widely used.

Apart from the forwarding plugin that we will use for load-balancing between multiple domain-park instances, there are a number of other useful plugins for us to consider. There are caching and logging plugins that we will definitely be making use of.

Running public name servers also comes with a number of risks and responsibilities to prevent malicious activities. This means that we need to use the plugins for things like rate-limiting both requests and responses, and hard timeouts.

Although most DNS resolvers still use UDP by default, the internet is slowly moving towards more modern and secure protocols. Rather the attempting to implement protocols such as DNS-over-HTTPS (DoH) in the Python server, it is much more convenient to simply use CoreDNS which has this functionality built-in. And whilst I have not actually enabled these yet, these configuration changes will be coming in the near future.

Finally, given the public nature of these servers it's important to ensure that they are secure. Although I've taken care to write a secure Python library, it's not exactly battle tested. On the other hand, CoreDNS is the default name server and resolver used in Kubernetes.

The Software

Even with the use of CoreDNS the public facing server, we still need to give some love to our Python code because I'm very demanding parent.

If domain-park is the brains of our operation, nserver is the muscle: receiving, decoding, and routing incoming queries before encoding and transmitting the domain-park responses. Which means, to improve the robustness and performance of the public name servers I'll need to improve the underlying nserver package.

The current UDP server implementation is pretty robust due to the simplicity of working with UDP packets, however UDP is an unreliable transport. If we begin receiving UDP packets faster than we can process them, the operating system will kindly buffer them for us. However, if this load is sustained, that buffer will eventually fill up and the operating system will simply start dropping packets from the buffer.

Instead, it would be preferable for our backend to use TCP to receive messages. From the start nserver has supported TCP, however its implementation is quite simplistic. Even though the DNS TCP specification supports re-using connections for multiple requests, the initial implementation of the server would simply close the connection after each request.

So, although we are using a reliable protocol, we will be spending a large amount of time doing a three-way-handshake on each request - a massive performance hit! A much better approach is to hold connections open to that we can process all those extra requests.

This is definitely easier said than done, especially for a single threaded application like Python. We now need to be able to read from any open connection when data arrives, whenever that may be. The when part is actually already solved as it is required by many high performance networking applications. In Linux, this is provided through the select, poll, and epoll kernel interfaces. These interfaces allow you monitor many connections at once and be alerted to which ones have data ready.

Python goes a step further with the selectors module that hides the details of which underlying function you are using and simply selects the best for your system. Unfortunately, this is where the simplicity ends as we now need to manage a potentially large number of connections that could be in a large variety of states. A few hundred lines of changes later though and the task was complete with version v0.3.0.

The Result

With all these changes is now powered by an easily deployed, maintained, robust, and powerful server 🎉. It comfortably handles 1000 queries per second.

Even under high loads in the order of thousands of queries per second, where connections may be unexpectedly dropped, the server still manages to respond to the vast majority of queries without crashing. Even running a basic DNS fuzzer filled with invalid requests is business as usual.

As always you have any questions, feedback, or concerns you can: