Are docker volumes better option for write heavy operations than binding directories directly? - linux

Reading through docker documentation I found this passage (located here):
Block-level storage drivers such as devicemapper, btrfs, and zfs perform better for write-heavy workloads (though not as well as Docker
volumes).
So does this mean that one should always use docker volumes when expecting lot's of persistent writing?

The container-local filesystem never stores persistent data, so you don't have a choice but to mount something into the container if you want data to live on after the container exits. The "block-level storage drivers" you quote discuss particular install-time options for how images and containers are stored, and aren't related to any particular volume or bind-mount implementation.
As far as performance goes, my general expectation is that the latency of disk I/O will far outweigh any overhead of any particular implementation. Without benchmarking any particular implementation, on a native Linux host, I would expect a named volume, a bind-mount, and writes to the container filesystem to be more or less similar.
From a programming point of view, you will probably get better long-term performance improvement from figuring out how to have fewer disk accesses (for example, by grouping together related database requests into a single transaction) than by trying to optimize the Docker-level storage.
The one prominent exception to this is that bind mounts on MacOS are known to be very slow and you should avoid them if your workload involves substantial disk access. (This includes both reading and writing, and includes some interpreted languages that want to read in every possible source file at startup time.) If you're managing something like database storage where you can't usefully directly access the files anyways, use a named volume. For your application code, COPY it into an image in a Dockerfile and do not overwrite it at run time.

should always use docker volumes when expecting lot's of persistent writing?
It depends.
Yes you want some kind of external to the container storage for any persistent data since data written inside the container is lost when that container is removed.
Whether that should be a host bind or named volume depends on how you need to manage that data. A host volume is a bind mount to the host filesystem. It gives you direct access to that data, but that direct access also comes with uid/gid permission issues and losses the initialization feature of named volumes.
Named volumes with all the defaults is just a bind mount to a folder under /var/lib/docker, so performance would be the same as a host volume of the underlying filesystem is the same. That said the named volume can be configured to mount just about anything you can do with the mount command.
Since each of these options can have varying underlying filesystem, and the performance difference comes from that underlying filesystem choice, there's no way to answer this in any generic sense. Hence, it depends.

Related

does docker manage filesystem like a standalone OS?

I have a program I'm running in a docker container. After 10-12 hours of run, the program terminated with filesystem-related errors (FileNotFoundError, or similar).
I'm wondering if the disk space got filled up or a similar filesystem-related issue or there was a problem in my code (e.g one process deleted the file pre-maturely).
I don't know much about docker management of files and wonder if inside docker it creates and manages its own FS or not. Here are three possibilities I'm considering and mainly wonder if #1 could be the case or not.
If docker manages it's own filesystem, could it be that although disk space is available on the host machine, docker container ran out of it's own storage space? (I've seen similar issues regarding running out of memory for a process that has limited memory artificially imposed using cgroups)
Could it be that host filesystem ran out of space and the files got corrupted or maybe didn't get written correctly?
There is some bug in my code.
This is likely a bug in your code. Most programs print the error they encounter, and when a program encounters out-of-space, the error returned by the filesystem is: "No space left on device" (errno 28 ENOSPC).
If you see FileNotFoundError, that means the file is missing. My best theory is that it's coming from your consumer process.
It's still possible though, that the file doesn't exist because the producer ran out of space and you didn't handle the error correctly - you'll need to check your logs.
It might also be a race condition, depending on your application. There's really not enough details to answer that.
As to the title question:
By default, docker just overlay-mounts an empty directory from the host's filesystem into the container, so the amount of free space on the container is the same as the amount on the host.
If you're using volumes, that depends on the storage driver you use. As #Dan Serbyn mentioned, the default limit for the devicemapper driver is 10 GB. The overlay2 driver - the default driver - doesn't have that limitation.
In the current Docker version, there is a default limitation on the Docker container storage of 10 GB.
You can check the disk space that containers are using by running the following command:
docker system df
It's also possible that the file your container is trying to access has access level restrictions. Try to make it available for docker or maybe everybody (chmod 777 file.txt).

Can kubernetes provide a pod with an emptyDir volume from the host backed by a specific filesystem different than the host's?

I know this is a bit weird, but I'm building an application that makes small local changes to ephemeral file/folder systems and needs to sync them with a store of record. I am using NFS right now, but it is slow, not super scalable, and expensive. Instead, I'd love to take advantage of btrfs or zfs snapshotting for efficient syncing of snapshots of a small local filesystem, and push the snapshots into cloud storage.
I am running this application in Kubernetes (in GKE), which uses GCP VMs with ext4 formatted root partitions. This means that when I mount an emptyDir volume into my pods, the folder is on an ext4 filesystem I believe.
Is there an easy way to get an ephemeral volume mounted with a different filesystem that supports these fancy snapshotting operations?
No. Nor does GKE offer that kind of low level control anyway but the rest of this answer presumes you've managed to create a local mount of some kind. The easiest answer is a hostPath mount, however that requires you manually account for multiple similar pods on the same host so they don't collide. A new option is an ephemeral CSI volume combined with a CSI plugin that basically reimplements emptyDir. https://github.com/kubernetes-csi/csi-driver-host-path gets most of the way there but would 1) require more work for this use case and 2) is explicitly not supported for production use. Failing either of those, you can move the whole kubelet data directory onto another mount, though that might not accomplish what you are looking for.

Applying ZFS snapshot to a non-ZFS FS

So this is a bit of a question of theory as well as specific (temporary use case)
Two servers are to be in sync of each other. One On-Site, the other an Off-Site backup.
However, the Off-Site should have the data duplicated and accessible if need be (not storing archive images of server1)
server1 and server2 are connected over internet via VPN connection
server1 uses ZFS Raid 10
server2 uses ext4 Raid5 (temporary setup, will be replaced in future with ZFS and this use case vanishes)
Can you take a ZFS snapshot on server1, send it to server2 and have it be unpacked/applied to the raid5 array, essentially duplicating server1 via incremental snapshots?
I know that there are some other tools for duplication of filesystems, but i was wondering if we can use snapshots in a non zfs fs. (documentation leads me to believe this is not possible, but i do not know enough about this)
Yes, there are two theoretical options. Both use async replication so will have a nonzero RPO (although from your description that seems acceptable to some extent):
Use zfs send to create a stream on the source system, and then use some tool that can understand the contents of that stream and translate to POSIX filesystem primitives on the receiving system.
Take a snapshot on the source system and then use an FS-agnostic tool to copy stuff from that snapshot over.
The first one has the benefit of being the most performant option, because ZFS knows what parts of its pool have been changed and only has to look at / send those parts. However, I don’t know of any tool that can actually do this. (Prototypes have been built at ZFS developer hackathons, but there is not a big audience for this type of tool so they’ve never been made production quality AFAIK.)
The second one is less performant because it will have to inspect the data to see what changed, but it has the benefit that tools exist — although you may have to fight with it a little, you can use rsync for this. Also, its RPO might be higher since transferring the data will take a bit longer. The slightly tricky parts will be:
Writing its metadata to a writable part of the pool on the source side, since the snapshot you’re copying will be read-only. (Look in the .zfs/ directory in the root of the filesystem you want to copy to find a readable copy of the snapshot.)
Making the failover target not have intermediate state if the source system dies during an rsync run. Hopefully your target filer has the ability to snapshot before you start an rsync run, so that you can roll back to the “last good state” if the run fails. Otherwise, hopefully your data / application can tolerate some inconsistencies. (Or maybe there’s an rsync option that does this that I haven’t used before.)

Why would I want to to use VOLUME inside a Dockerfile?

To me the VOLUME in a Dockerfile doesn't seam to be doing anything, where -v on the commandline actually make a directory available inside the container.
When I read the Docker manual for VOLUME, it is not really clear to me, why I ever want to write it in the Dockerfile, and not just on the commandline?
Defining the volume in the Dockerfile doesn't expose the volumes to the host by default. Instead it sets up the linked volume to allow other containers to link to the volume(s) in other Docker containers. This is commonly used in a "Data Container" configuration where you start a container with the sole purpose of persisting data. Here's a simple example:
docker run -d --name docker_data docker/image1
docker run -d --volumes-from docker_data --name new_container docker/image2
Notice the --volumes-from flag.
See http://container-solutions.com/understanding-volumes-docker/ for a more thorough explanation.
In addition to the accepted answer, another consideration for using volumes is performance. Typically, the layered filesystems used by Docker (typically AUFS or Devicemapper, depending on which Linux distribution you're using) aren't the fastest and may become a bottleneck in high-throughput scenarios (like, for example, databases or caching directories).
Volumes, on the other hand, even if not explicitly mapped to a host directory, are still simple bind mounts to the host file system, allowing a higher throughput when writing data.
For further reading, there's an interesting paper by IBM on this topic, which contains some interesting conclusions regarding the performance impact of using Docker volumes (emphasis mine):
AUFS introduces significant overhead which
is not surprising since I/O is going through several layers, [...].
Applications that are filesystem or disk intensive should bypass AUFS
by using volumes. [...]
Although containers themselves have almost no overhead,
Docker is not without performance gotchas. Docker volumes
have noticeably better performance than files stored in AUFS.

Move docker data volume containers between CoreOS hosts

For some scenarios a clustered file system is just too much. This is, if I got it right, the use case for the data volume container pattern. But even CoreOS needs updates from time to time. If I'd still like to minimise the down time of applications, I'd have to move the data volume container with the app container to an other host, while the old host is being updated.
Are there best practices existing? A solution mentioned more often is the "backup" of a container with docker export on the old host and docker import on the new host. But this would include scp-ing of tar-files to an other host. Can this be managed with fleet?
#brejoc, I wouldn't call this a solution, but it may help:
Alternative
1: Use another OS, which does have clustering, or at least - doesn't prevent it. I am now experimenting with CentOS.
2: I've created a couple of tools that help in some use cases. First tool, retrieves data from S3 (usually artifacts), and is uni-directional. Second tool, which I call 'backup volume container', has a lot of potential in it, but requires some feedback. It provides a 2-way backup/restore for data, from/to many persistent data stores including S3 (but also Dropbox, which is cool). As it is implemented now, when you run it for the first time, it would restore to the container. From that point on, it would monitor the relevant folder in the container for changes, and upon changes (and after a quiet period), it would back up to the persistent store.
Backup volume container: https://registry.hub.docker.com/u/yaronr/backup-volume-container/
File sync from S3: https://registry.hub.docker.com/u/yaronr/awscli/
(docker run yaronr/awscli aws s3 etc etc - read aws docs)

Resources