Last week, I dealt with a critical SSH vulnerability that allows root access to a server. This vulnerability is known as regreSSHion because it existed in early versions of SSH, was fixed, but then a line of code was accidentally removed, causing the bug to return.
This vulnerability was discovered by a security company whose specialists were studying the source code. They found areas that could lead to server instability. This bug only occurs if SSH was built using the glibc library.
The vulnerability works like this: when a user tries to log in to a server via SSH using a password, they have about 2 minutes to enter it. The server won’t keep the connection open indefinitely. After 2 minutes, the server processes a signal to disconnect. This processing is done concurrently. In this concurrent environment, a race condition can occur, leading to memory corruption. This, in turn, causes the control to shift to a random memory area. If malicious code resides in that area, it will execute with root privileges since the SSH daemon runs with root privileges.
The malicious code, as I understand it, needs to be embedded in the connection. It will be received by the remote server and temporarily stored in memory.
Concurrency not only decreases the readability of the source code but also leads to hard-to-detect and resolve bugs. This case is no exception. As described, such an attack has a non-deterministic nature. First, you need to trigger the race condition, which may not happen. Then, you need the server to jump to the correct memory address, which is also a probabilistic event.
This vulnerability was tested on 32-bit architecture. Initially, it took 30 days to gain access. After improving the algorithm, it now takes 6 hours. Most servers are 64-bit, where the address space is larger, making the exploit even less likely. However, security experts believe it is a solvable task.
I decided to update the SSH daemon on our servers. Some of our dedicated servers run Ubuntu 20 and even 17, which don't need the OpenSSH-server update. Other machines run Ubuntu 22, which requires patching, but I didn't have to do it myself since Hetzner, our provider, updated all cloud servers.
I also fixed a bug with pg_basebackup. Sometimes, we couldn't back up one of the databases.
This utility might seem strange. First, when you run it, you need to pass a connection string. Usually, when we want to make a binary backup, we either copy or sync files. pg_basebackup does the same but over a network interface. Why? As I understand it, this utility waits for a checkpoint when connecting. Only the server knows when the checkpoint occurs. After that, it can start copying files. But what if the files change during this time? We run the backup at night, but even then, files might change, for example, during autovacuuming. Normally, a backup results in corrupted files, partially written, partially empty, possibly copied in the middle of a transaction. Such a database is unusable and needs fixing. For this, we add log files to the backup because Postgres writes any changes to log files first, then to the database. Thanks to log files, we can restore a consistent version of the database.
The problem is we run the backup utility with an option that adds log files at the end of the backup. If the backup takes a long time or if the database is actively written to, many log files appear, and old ones might be deleted. The log files created at the start of the backup must be added to the backup. If they are deleted per the retention policy, the utility returns an error.
To solve this, I changed the option values. Now, the utility opens two connections—one for the base backup and another for logs. We no longer wait for the base backup to complete before copying the necessary log files.