TLS performance has come a long way over the past few years. Widespread deployment of elliptic curve Diffie–Hellman key exchange protocols has led to significant improvements in both the speed of the initial TLS handshake, as well as the forward secrecy of the encrypted traffic.
Despite this, that initial handshake remains the biggest performance hit from TLS. This is especially painful as it slows down the user’s initial page load. The TLS handshake must be completed before the server can even begin to process the HTTP request. While there are various things that can be done to improve this (like tuning the TCP initial congestion window), the biggest wins come from avoiding it entirely on subsequent connections.
There are two primary mechanisms for avoiding that initial handshake. The first is simply caching the TLS session on the server. When the client shows up again, it says “Hey, last time we were using session xzy123, should we just pick up where we left off?”. If the server has that session ID in its cache, it can resume the session. This has the advantage of being very simple, but has a number of disadvantages in terms of scalability. Caching all of those sessions can start to take a non-trivial amount of resources on a busy server, and if you want to scale out to multiple servers, you need to share the session cache between them (which isn’t an easy task with any of the popular open source web servers).
The other option is to use TLS session tickets. At the end of the handshake process, the server sends the client an encrypted ‘ticket’ which includes all the information the server needs to resume the session. When the client reconnects it sends the ticket to the server, and if the server is able to decrypt it, it can pick up with the same session. This makes the storage of the session information the client’s problem (which is fine, as it only has to worry about its own session), and makes it easy to scale out to multiple servers (it’s just a small matter of programming to distribute the TLS ticket encryption key so any server is able to decrypt the tickets).
Great! Problem solved! We just stick the same ticket encryption key on all our servers and we’re good to go! Not so fast… We have now completely broken forward secrecy for our clients. Any attacker that is able to compromise that ticket encryption key now has everything they need to decrypt the sessions it was used to store (assume our adversary was playing the long game and recording all historical traffic with the hopes of decrypting it later). Darn…
So, what to do? We need to rotate these keys on a regular basis. How frequently is up to you and your threat model. Hourly, daily, weekly… all would be reasonable choices, based on your particular needs. The exact method for how you would rotate those keys varies depending on your webserver, and your configuration management tooling, but the title of this post includes “Nginx” and “Ansible”, so I guess I’ll use those…
Nginx makes it very easy to rotate these keys. You can specify as many keys as you would like; the first one will be used to encrypt new tickets, and when a ticket is received, Nginx will attempt to decrypt it with each key in turn. To rotate the keys, overwrite the last key with the penultimate key, and work your way up, finally replacing the first key with a brand spanking new one. There are lots of ways to accomplish this, but I like Ansible, so here’s the playbook I use:
This is a pretty straightforward playbook. We generate a new key, make sure all the key files exist (this could be a newly created server), then rotate them (key ‘n’ becomes key ‘n + 1’, dropping the last one), and set the new key as the first one. If we need to create new keys, we just create them at random. This means that this server won’t be able to pick up old sessions, but it will catch up with the other servers as the keys continue to get rotated.
The only odd thing in the playbook is the fact that we have to convert the new key to base64 before copying it to the server (just to convert it back to raw bytes). The issue here is that Ansible tries to be too helpful when it comes to dealing with bytes. Nginx expects a TLS session ticket key to be a file containing exactly 48 bytes. Ansible, however, tries to encode the bytes as text, munging into something that is almost always not 48 bytes. So, we base64 encode the bytes before Ansible sees them, and then decode that base64 string into the new key file.
One important security consideration this post glosses over is the fact that you really shouldn’t store these keys on persistent storage (if someone gets access to your hardware, they could forensically recover old keys). The simplest solution would just be to store them on an in-memory filesystem. If you don’t reboot your servers that often, this would probably work. If you do reboot them frequently, then you’ll some means to distribute the last N keys to server after a reboot, otherwise you’ll lose most of the benefit of rolling the keys in the first place. If you’re interested in solving this problem at scale, check out CloudFlare’s excellent write-up of how they’re handling it.
The other minor detail is that if you are handling multiple hosts with the same server, they should use different sets of keys. If you were using this playbook, you could either run it multiple times with different key paths, or just tweak the playbook a bit to handle the different hosts.
This post is intended to lay out the 20% of effort needed to solve 80% of the problem. Even if you store them on persistent storage, simply enabling TLS session tickets and rotating your keys will put you well ahead of most of the TLS deployments out there.