Prevent docker from using runc

Prevent docker from using runc

🚧🚧🚧🚧 WARNING!!!! Do, NOT do this if you don't know what or why you would want to do this!!!! 🚧🚧🚧🚧🚧

Feel free to read about it though 🙂


Preface...

So...I don't know if I would actually suggest doing this...when I had finished this project/post. I was going back to look at the container/blog again, and rebooted the server ( which I had done before, but apparently this time was different )...the container seemed to be running but I couldn't get a docker command to execute...it was timing out I believe and I don't know if it was an issue with me removing runc or something else. So, if you want to do this I would recommend you do more test to validate this works properly. I ended up going with another solution which I am going to do another post about soon.


TL;DR:

Move the runc binary out of your path. You can enumerate that by doing which runc, but mine was located at /usr/bin/runc. I don't think you can remove containerd.io and still have docker function, but if it can't access the runc binary it can't launch containers with it. I just moved mine to root's home, in case I need it in the future. ( more below about other things I am going to do thinking about updates... )

container not being able to launch
container not being able to launch

Backstory

Why?

So, I have recently been talking about sysbox more,

and I wanted to migrate this blog to using a container instead of the marketplace deployment that DigitalOcean provides ( which has been pretty much flawless for me, and I would recommend if you want to give Ghost a try 👻😁 ). I was thinking about alternative solutions for running docker containers from a lower privileged user, to help prevent privileges escalation ( priv esc ). I tried to use podman-compose, the snap version of docker/docker-compose, docker w/ns-remap, and finally the official docker-compose with sysbox ( currently the winner ( not anymore, read my preface ) ).

Why compose?

I want to eventually get everything I touched to be configured as IaC ( Infrastructure as Code ), and I think that the easiest way for me to do that is to manage a bunch of docker-compose.yml files. Instead of me having to figure out how to have a process to check for the running container and create containers predictably. So, I want to use the -compose version of these tools. ( I am not ready to run everything in k8s ( kubernetes ) yet 🙂.

podman-compose

podman-compose (talks about what it is in the link that I put) seems to still be a more recent project, and while I don't know if it will work out in the long run I wanted to try it after hear about on a podcast episode that I listen to ( it's been a while, and I don't remember which one 😅 ). So, I spun up a fedora instance on digitalocean ( in my experience podman isn't very native on the debian ( specifically ubuntu ) flavors ), and tried to get things working with podman-compose. I initially got some different errors, but alas I didn't take a screenshot then 🙃, now though I was taking an almost identical version of my current docker-compose ( changed the password stuff 😁 ). I still got an error( pic below ), and at the time of writing this blog they even notate that they are in a development phase.

permission denied podman-compose
permission denied podman-compose

So, I don't fault them, and I know I should have submitted an issue but I don't have time/feel like it now 🙃. ( maybe in the future, but please let me know if I did something wrong, because I would love to use this )

snap docker/docker-compose

Honestly I love that ubuntu has this docker snap and for all my desktop linux installations I prefer to use it, and even for some of my homelab vms that are internal facing. It is super easy, just run these commands on your ubuntu machine:

sudo addgroup --system docker
sudo adduser $USER docker
newgrp docker
sudo snap install docker

That installs docker and adds some special apparmor magic ( that I don't know the details about completely 🙃 ), but it has been instrumental in preventing a 0-day exploit that I was testing out to see if I was vulnerable.

While I love this version of docker. One thing that I did try is attempting to start a container while mounting the host file system ( instructions here by one of my co-workers if you want to know more ), which it allowed whenever I tested it...So, that didn't really prevent the attack surface that I was trying to stop...which means next candidate... ( please let me know if I did something wrong, because I would love to use this )

docker w/ns-remapping

So...this is the good-ole standby that almost all ( including myself ( as you can see here ) ) security people tell you to use. The problem is that it is a bit hacky when you are using volume bind mounts, because all the files have to be changed to the container user's ID. I did this for a pi-hole setup on my LAN a while back, but I didn't want to have to deal with any potential issues with file permissions. So, I opted for the solution below...

official docker + sysbox

As you saw in the tweet I linked above in the backstory section, I learned about sysbox from SEDaily's podcast episode they had with them. I thought it sounded like a really cool technology, and I have some future plans for it as well on one of my personal projects.

One of the really amazing things I love about sysbox is the shift-fs component, which prevents users from mounting volumes that are above their uid ( pic below ).

sysbox prevents mounting root
sysbox prevents mounting root

So, this accomplishes preventing against the priv esc that I was looking for! The issue is that the default runc application is still installed...so, someone ( if they were able to access the docker cli ( or more importantly docker socket ( i.e. become the local ghost user on the server somehow... ) ) then they could just specify the runc runtime --runtime=runc. So, keep reading on how I setup how to prevent this.

Setup

system config

I added a lower privileged user: sudo adduser --disabled-{password,login} --gecos '' ghost, with no sudo rights or login ( essentially a service user ).

Added them to the docker group: usermod -aG docker ghost

docker

So, I just did the stock install command for docker recommended here: curl -fsSL https://get.docker.com | bash - ( I pipe to bash because I have audited the script before and noticed that it was in docker's GH repo, so we would have bigger problems if that was compromised... ).

Added docker-compose with this script.

sysbox

I did their standard install instructions, manual for now but I plan to automate it with an ansible role eventually, and configured it to be the default runtime environment.

The problem

So, after all that setup ( and as I briefly talked about in the docker + sysbox section ) and trials of different platforms. I have a solution that is working and not allowing people to priv esc to root for the containers I am currently running...but there is a problem... If anyone is able to escape out of that container and become the unprivileged docker user that runs the containers...eventually they can figure out to specify the default runc runtime and I will be back in the same situation. So, how do I stop that?

Granted this is kind of hacky in it of itself, but I simply move the runc binary ( which was at /usr/bin/runc for me ) to root's home. So, no one can access it. I really wish there was a better way, but from what I understand you can't just rip out the containerd.io dependency because it also includes containerd as well as runc...So, that is my answer for now, but if you have any other suggestions please let me know!

Technical Info

final docker-compose.yml config

version: '3.8'

services:

  ghost:
    image: ghost:4
    restart: unless-stopped
    ports:
      - 127.0.0.1:8080:2368
    environment:
      # see https://ghost.org/docs/config/#configuration-options
      database__client: mysql
      database__connection__host: db
      database__connection__user: ghost
      database__connection__password: '<password>'
      database__connection__database: ghost
      # this url value is just an example, and is likely wrong for your environment!
      url: http://localhost:8080
    volumes:
      - "./ghost_data:/var/lib/ghost/content"
    cap_drop: 
      - ALL

  db:
    image: mysql:5.7
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: '<password>'
      MYSQL_DATABASE: ghost
      MYSQL_USER: ghost
      MYSQL_PASSWORD: '<password>'
    volumes:
      - "./ghost_db:/var/lib/mysql"
    cap_drop: 
      - ALL