A minimal container runtime written in Python for educational purposes only. It demonstrates the three Linux features that make containers work: namespaces, cgroups, and chroot.
The implementation is most likely full of all kinds of security holes so don't use it for anything you care about.
pycrate must run on Linux as root.
cd /pycrate
python3 pycrate.py run <image.tar.gz> <command> [options]# Run a shell inside a container with resource limits
sudo ./pycrate.py run ubuntu.tar.gz /bin/bash --hostname mycontainer --memory 64 --cpu 50| Flag | Description | Example |
|---|---|---|
--hostname |
Set the container's hostname | mybox |
--memory |
Memory limit in MB | 64 |
--cpu |
CPU limit as a percentage | 50 |
When you run pycrate.py run, it goes through four steps that mirror what Docker does with docker run:
ubuntu.tar.gz -> /tmp/container-<id>/
|-- bin/
|-- etc/
|-- lib/
|-- proc/
|-- ...
The image tarball is extracted into a temporary directory. This becomes the container's root filesystem. A real container runtime like Docker uses a layered filesystem (overlayfs), but a flat tarball achieves the same result for our purposes. The container gets a populated filesystem to run in.
cgroups (control groups) are the kernel mechanism for limiting resources. They work through a virtual filesystem at /sys/fs/cgroup. To limit a process:
- Create a directory under
/sys/fs/cgroup/(this is the cgroup) - Write limits to files in that directory (
memory.max,cpu.max) - Write the process's PID to
cgroup.procs
The kernel then enforces those limits. If a process exceeds its memory limit, the kernel OOM kills it. CPU limits use a quota/period model e.g. 50% CPU means the process gets 50ms of CPU time per 100ms window.
The current process is added to the cgroup before namespaces are created, so the forked child inherits membership automatically.
os.unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWNET | CLONE_NEWIPC)unshare() tells the kernel to give this process and its children isolated versions of system resources. Each namespace type isolates something different:
| Namespace | Flag | What it isolates |
|---|---|---|
| PID | CLONE_NEWPID |
Process IDs (container sees only its own processes) |
| Mount | CLONE_NEWNS |
Mount table (container's mounts don't affect host) |
| UTS | CLONE_NEWUTS |
Hostname (container gets its own hostname) |
| Network | CLONE_NEWNET |
Network stack (container gets its own interfaces) |
| IPC | CLONE_NEWIPC |
Inter-process communication (shared memory, etc.) |
After creating namespaces, the process forks:
os.fork()
/ \
Child (pid=0) Parent
| |
sethostname() waitpid(child)
| |
setup_filesystem() cleanup()
|
os.execvp(command)
The child becomes the container:
- Sets the hostname inside the UTS namespace using
socket.sethostname() - Mounts
/procinside the new root - a virtual filesystem that exposes process info. Without it,psandtopwon't work. - Calls
chroot(rootfs)this tells the kernel "for this process,/now meansrootfs." The container can only see files inside the extracted image. - Calls
os.execvp(command)replaces itself with the requested command (e.g./bin/bash). The container is now "running."
The parent manages the lifecycle:
- Waits for the child to exit
- Moves itself out of the cgroup and removes it
- Deletes the temporary rootfs directory
This is an educational tool and I am no expert in low-level kernel interfaces. Some differences that I know about vs something like Docker:
- chroot instead of pivot_root -
chrootis a path-lookup trick that a root process can escape. Docker usespivot_rootwhich actually swaps the mount namespace root, making escape much harder. - No overlayfs - Docker uses layered filesystems so multiple containers can share base image layers. We extract a flat tarball each time.
- No networking - We create a network namespace but don't set up virtual interfaces, so the container has no network access.
- No cgroup namespace - We skip
CLONE_NEWCGROUPbecause our single-process architecture (unlike Docker's daemon model) means the parent needs access to/sys/fs/cgroupfor cleanup after the container exits. - Must run as root - Docker's daemon handles privilege separation. We require root directly.