The first time I heard about 'containers' I thought that containers were very small sized stripped down nodes. Or maybe a process that would cut up nodes their hardware in such a way that the machine was divided into isolated components? Well, I can tell you that's not the case.
First things first, if you are reading this article I assume that you've already worked with some kind of containerization. It's good to note that containers are not VMs. Containers provide a way to virtualize an operating-system so that multiple workloads run on top of a single operating-system (in a sense that the same kernel is shared among the containers). VMs virtualize the hardware so that every VM has it's own isolated instance of an operating-system (by using a hypervisor). This might be hard to grasp when you start learning about containerization and seems abstract, but I'll try my best to form a clear understanding of containerization for you.
The container metaphor
Containerization provides a clean separation of concerns. They offer an abstraction from the environment they are run on top of. We pack all of the dependencies and configuration together so we can easily move it around.
An analogy I like to refer to when describing containerization is that we have small apartments that share plumbing and central heating with other small apartments, but every appartment has its own furntiture, its own locked front door and some apartments may have a cat running around the place called Gizmo, whilst others don't. We'll put the apartment in a box including a template of how plumbing and central heating should be done incase a new energy company decides to take over, and we call it a 'container'.
Due to this layer of system abstraction a container becomes very portable, whether you want to deploy the container to a staging environment or production, the content of the container remains the same (from application code to installed dependencies). If we wanted to move our apartment to another building we have everything ready. We now own metadata from if the cat is running around the place or not, to metadata of the available furntiture, to the template of how the plumbing and central heating should be done.
This allows for a far more lightweight portable experience. We have less to bootstrap and use just a fraction of the memory compared to providing a new isolated instance of an operating-system.
So what makes up a container?
How are processes isolated whilst they use the same kernel? To answer this question we'll need to dive into the Linux kernel. Are you already on the edge of your seat?
The Linux kernel has a few features such as:
These features allow us to securely isolate processes from each other which in return makes up this 'apartment in a box' we spoke about earlier, called a 'container'.
But how do you use these kernel features and what makes it so that we can ensure it's isolated? Lets take a look at some of these Linux kernel features.
Namespaces are one of the building blocks of isolation, they perform the isolation at a kernel level. Similar to how chroot works, which jails the process in a different root directory, namespaces separate other aspects of the system. It's an integral part of the architecture of tools like Docker. Namespaces were added in the 2.6.24 Linux kernel release back in 2008 (it's good to note that there are several types of namespaces added to the kernel, not all of them were added in this release).
The Mount or mnt. namespace.
The Mount namespace virtually partitions the file system so that processes running in a mount namespace cannot access file outside of their mount point.
The Process or pid. namespace.
In Linux processes follow a branch, a so called process tree, every PID in the tree refers to an active process. The process namespace cuts off a branch of the process tree, and doesn’t allow access further up the process branch. This is essential because if a process is priviliged to inspect or may even be able to kill other processes we would have a problem. The process that does this remains in the parent namespace, in the original process tree, but makes the child the root of its own process tree.
With PID namespace isolation, processes in the child namespace have no way of knowing of the parent processes its existence. However, processes in the parent namespace have a complete view of processes in the child namespace, as if they were any other process in the parent namespace. The PID namespace prevents the process from seeing or interacting with other processes.
The Network or net. namespace.
A network namespace allows processes to see a different set of networking interfaces and routing tables. In order to provide a usable network interface in the child network namespace we are required to create an additional virtual network interface.
To have incoming and outgoing traffic we must be able to communicate from the parent network namespace to the virtual interface in the child network namespace. This can be achieved by creating a bridge and a routing process. As this is quite a complex subject I recommend to read upon this yourself.
We now know the theory behind system isolation at the kernel level, but what if we also wanted to restrict certain resource usages of this isolated component? This is where control groups come in handy. Lets say that I want this specific apartment to not use more electricity than 0,8702 kWh when someone is home, this can be interpreted as not using more than N amount of memory for a group of isolated processes.
What if we wanted to prevent certain syscalls to be made from the container to the kernel? Apartment 121 should not have the rights to turn on the lights, could be the equivalent of telling a container that it has no rights to access the internet. Telling a container what its allowed to do and what not is defined in profiles. You can pass different profiles to different containers. The default profile for a Docker container can be viewed here in the Docker repository on Github, it blocks 44 syscalls out of the 300 available ones.
Seccomp uses the Berkeley Packet Filter program, this allows you to also setup custom filters. You can limit syscalls by providing conditions on how or when it should be limited when the syscall is made. The seccomp filters replace the syscalls with pointers to the Bpf program, which will perform the execution instead of the syscall.
Lets make a container!
Now that we have a basic understanding of the features, lets do some hands on by making a 'container' from scratch. It turns out that it's pretty easy to make new namespaces, you can just run unshare. Time to make a new PID namespace and run Bash in it.
Where are we? It seems like we are in the branched off process tree we spoke about earlier, I'm not able to see any other processes running in this process tree besides Bash, that's good, we like isolation!
I also added a '--net' argument to the 'unshare' command we ran earlier, this creates an independent IPv4 and IPv6 stack. By default new network namespaces will have a loopback device but no other network devices. To be able to access the internet from your network namespace I recommend reading upon network namespace configurations.
Time to fiddle around with the control group we spoke about earlier. I gave an example of restricting a group of processes their memory usage, before being able to do this, we need to make a new level in the cgroup hierarchy. I named it 'mymemgroup'.
Lets see what's inside of the 'mymemgroup' memory cgroup..
It looks like a bunch of files containing configurations for the memory cgroup.
So many possible ways of restricting our memory usage, lets try to see what the hard limit for our allocated kernel memory is by checking the value of memory.kmem.limit_in_bytes.
This looks like the highest positive signed 64-bit integer (263-1) rounded down to multiples of 4096 (212), but why? After some digging in the Linux Kernel I found that the value is set to the default PAGE_COUNTER_MAX and it's multiplied by PAGE_SIZE when reading the value . It's not the exact number of possible memory in bytes as it's rounded down to where the last bytes drop off.
This number is the maximum for holding a number in binary in a memory store and apparently it's the default value for our memorygroup it's memory.kmem.limit_in_bytes property.
Lets try to alter the value and lower it to 5 megabytes, I rather be on the safe side with our container.
Time to use the cgroup, we can use cgexec to run a task in a given cgroup. Lets run Bash in our 5 megabyte restricted cgroup!
Obviously we want to test the behaviour of the cgroup if we exceed a certain amount of memory allocation, in our case 5 megabytes. I think that 'yum get update' uses a bit more to initiate this process.
'Killed'! It looks like the OOM ("Out of Memory") Killer terminated the process and did its job correctly, thanks OOM Killer.
The OOM Killer is daemon that is called by the Linux kernel if the system is critically low on memory or when a process is trying to bypass a memory limit. The OOM Killer has many more features than the ones I just explained, I recommend to read upon it.
It seems like we did a good job! We created an isolated environment and limited our memory usage successfully.
We have half a 'container' now, I think?
Now that you have half a container you're probably already thinking on how to automate it, run it on top of a machine and have something really cool because it's lightweight and everything is isolated! That's what Docker did, and docker has lots of containerization experts. I would advice you to not try and reinvent the wheel. Playing around with these Linux kernel features is great to get a good grasp on containerization, but that's it (for me). Of course Docker has much more advanced features, but the building blocks are more or less these Linux kernel primitives. If you would like to dive more into this subject, I recommend reading the footnotes at the end of this article.
Containers and microservices
Containers are best used in a microservice architecture, as each service is independent and loosely-coupled to serve its own responsibility as a service proxy, we can easily assign one container to one microservice and measure by this rule. As containers are easy to spin up, it becomes easy to scale-out. Microservices will be able to scale independent from each other. Data can be decentralized as there is no need to maintain the same type of database in every microservice. One might use a document store such as Elasticsearch while another one uses a simple key-store database such as Redis.
Every microservice requires lifecycle management and task management such as: load balancing, scaling policies, routing, health monitoring and security policies. There's some great tools out there such as Kubernetes to orchestrate your containers.
Lets take a local supermarket as an example. The supermarket is split up in 3 logical independent components (microservices).
- A payment gateway API that is the cash register counter.
- An inventory API that keeps track of the items in the store.
- A database that acts as our supplier.
If we suddenly get a burst of payment gateway requests it would be illogical to open up an entire new store, because why would we want to scale out both our inventory API and database if there was no additional load? We rather add another 2 cash register counters, and quick! If one of the employees at the cash register counter would feel unwell due to all the busyness, we would lose a 'container'. In this case tools like Kubernetes would add a new cash register counter to the store and assign a healthy employee to stand behind it.
Bottomline, containers allow for flexibility and portability. Engineers should be able to ship a container to any environment without having to worry about the content (code, runtime, dependencies and configuration) of the container.
Most of my research was done by reading the man pages.
I've completed a course and read a book from Nigel Poulton to brush up on my Kubernetes skills, he's a great educator. We like great educators.
IBM their spectrum symphony contains a very detailed cgroup explanation, I recommend reading this.
Did you enjoy this read? Feel free to buy me a coffee! :)
Contact me? You can do that through firstname.lastname@example.org or LinkedIn.
If you're looking for other articles I recommend you to look in the library.