What does the GlusterFS server option cluster.readdir-optimize control? - linux

I have been trying to optimise the small file performance of my GlusterFS storage cluster.
A number of forum threads and blog posts seem to suggest setting the cluster.readdir-optimize property on the volume, like:
$ gluster volume get test-share cluster.readdir-optimize on
The default for this option (as of GlusterFS v3.10) seems to be off, which makes me think there must be some trade-off to having this feature enabled. However, I have not been able to find anywhere any documentation explaining exactly what this option does.
I would like to understand the function of this option before I enable it in production.

As noted in the relevant GlusterFS git repository commit message, the readdir-optimize option supports the following:
Bring in option which is supported by posix xlator
to filter out directory's entries from being returned.
DHT would now request non-first subvols to filter out
directory entries.
I don't fully understand how this directly improves performance in GlusterFS with respect to small files. But according to the GlusterFS documentation the BD xalator performs the function of wrapping the GlusterFS block back-end and enables GlusterFS volumes to be composed of bricks which are themselves underlying logical volumes.

Related

Can kubernetes provide a pod with an emptyDir volume from the host backed by a specific filesystem different than the host's?

I know this is a bit weird, but I'm building an application that makes small local changes to ephemeral file/folder systems and needs to sync them with a store of record. I am using NFS right now, but it is slow, not super scalable, and expensive. Instead, I'd love to take advantage of btrfs or zfs snapshotting for efficient syncing of snapshots of a small local filesystem, and push the snapshots into cloud storage.
I am running this application in Kubernetes (in GKE), which uses GCP VMs with ext4 formatted root partitions. This means that when I mount an emptyDir volume into my pods, the folder is on an ext4 filesystem I believe.
Is there an easy way to get an ephemeral volume mounted with a different filesystem that supports these fancy snapshotting operations?
No. Nor does GKE offer that kind of low level control anyway but the rest of this answer presumes you've managed to create a local mount of some kind. The easiest answer is a hostPath mount, however that requires you manually account for multiple similar pods on the same host so they don't collide. A new option is an ephemeral CSI volume combined with a CSI plugin that basically reimplements emptyDir. https://github.com/kubernetes-csi/csi-driver-host-path gets most of the way there but would 1) require more work for this use case and 2) is explicitly not supported for production use. Failing either of those, you can move the whole kubelet data directory onto another mount, though that might not accomplish what you are looking for.

Read from a XFS brick, write on a volume?

Filesystem notifications are not available on volumes, the reason why we started reading directly from brick.
Is it okay to read directly from a brick, but write to a volume so that replication happens?
The volume is created using 3 bricks using a replication strategy. Could anyone please suggest the demerits of directly reading from brick.
If the file on the brick from which you read is not in sync with the other copy/copies of the replica (i.e. there is a self-heal that is pending), you can get stale data. Reading from the mount ensures that you always get the up to date data.
Though not comparable with inotify, you can use glusterfind to provide some level of filesystem notifications.

Are docker volumes better option for write heavy operations than binding directories directly?

Reading through docker documentation I found this passage (located here):
Block-level storage drivers such as devicemapper, btrfs, and zfs perform better for write-heavy workloads (though not as well as Docker
volumes).
So does this mean that one should always use docker volumes when expecting lot's of persistent writing?
The container-local filesystem never stores persistent data, so you don't have a choice but to mount something into the container if you want data to live on after the container exits. The "block-level storage drivers" you quote discuss particular install-time options for how images and containers are stored, and aren't related to any particular volume or bind-mount implementation.
As far as performance goes, my general expectation is that the latency of disk I/O will far outweigh any overhead of any particular implementation. Without benchmarking any particular implementation, on a native Linux host, I would expect a named volume, a bind-mount, and writes to the container filesystem to be more or less similar.
From a programming point of view, you will probably get better long-term performance improvement from figuring out how to have fewer disk accesses (for example, by grouping together related database requests into a single transaction) than by trying to optimize the Docker-level storage.
The one prominent exception to this is that bind mounts on MacOS are known to be very slow and you should avoid them if your workload involves substantial disk access. (This includes both reading and writing, and includes some interpreted languages that want to read in every possible source file at startup time.) If you're managing something like database storage where you can't usefully directly access the files anyways, use a named volume. For your application code, COPY it into an image in a Dockerfile and do not overwrite it at run time.
should always use docker volumes when expecting lot's of persistent writing?
It depends.
Yes you want some kind of external to the container storage for any persistent data since data written inside the container is lost when that container is removed.
Whether that should be a host bind or named volume depends on how you need to manage that data. A host volume is a bind mount to the host filesystem. It gives you direct access to that data, but that direct access also comes with uid/gid permission issues and losses the initialization feature of named volumes.
Named volumes with all the defaults is just a bind mount to a folder under /var/lib/docker, so performance would be the same as a host volume of the underlying filesystem is the same. That said the named volume can be configured to mount just about anything you can do with the mount command.
Since each of these options can have varying underlying filesystem, and the performance difference comes from that underlying filesystem choice, there's no way to answer this in any generic sense. Hence, it depends.

User Level Library for Loopback Storage (no loopback device for Spark applications in HPC)

Cray recommends using loopback devices for running Spark on HPC cluster with Lustre file systems [1]. The problem is most HPC clusters do not provide access to loopback devices for their users. So I wonder if there is a library that opens only one huge file on Lustre ad let use treat that huge file as a file system, and then we can utilize the parallel file access to that one file.
This way we can have parallel IO while having proper partitions and one file per partition. Searching didn't show me anything.
[1] http://wiki.lustre.org/images/f/fb/LUG2016D2_Scaling-Apache-Spark-On-Lustre_Chaimov.pdf
Whether this is possible depends heavily on your application. It would be possible to create eg. an ext4 filesystem image in a regular file using mke2fs as a regular user, and it would be possible to access this with libext2fs linked into your application (probably single-threaded) or via fuse2fs in userspace. It may be that fuse2fs still needs root permission to set up, but I'm not positive, but after that it would behave like a normal filesystem, and does not need a block device.

Applying ZFS snapshot to a non-ZFS FS

So this is a bit of a question of theory as well as specific (temporary use case)
Two servers are to be in sync of each other. One On-Site, the other an Off-Site backup.
However, the Off-Site should have the data duplicated and accessible if need be (not storing archive images of server1)
server1 and server2 are connected over internet via VPN connection
server1 uses ZFS Raid 10
server2 uses ext4 Raid5 (temporary setup, will be replaced in future with ZFS and this use case vanishes)
Can you take a ZFS snapshot on server1, send it to server2 and have it be unpacked/applied to the raid5 array, essentially duplicating server1 via incremental snapshots?
I know that there are some other tools for duplication of filesystems, but i was wondering if we can use snapshots in a non zfs fs. (documentation leads me to believe this is not possible, but i do not know enough about this)
Yes, there are two theoretical options. Both use async replication so will have a nonzero RPO (although from your description that seems acceptable to some extent):
Use zfs send to create a stream on the source system, and then use some tool that can understand the contents of that stream and translate to POSIX filesystem primitives on the receiving system.
Take a snapshot on the source system and then use an FS-agnostic tool to copy stuff from that snapshot over.
The first one has the benefit of being the most performant option, because ZFS knows what parts of its pool have been changed and only has to look at / send those parts. However, I don’t know of any tool that can actually do this. (Prototypes have been built at ZFS developer hackathons, but there is not a big audience for this type of tool so they’ve never been made production quality AFAIK.)
The second one is less performant because it will have to inspect the data to see what changed, but it has the benefit that tools exist — although you may have to fight with it a little, you can use rsync for this. Also, its RPO might be higher since transferring the data will take a bit longer. The slightly tricky parts will be:
Writing its metadata to a writable part of the pool on the source side, since the snapshot you’re copying will be read-only. (Look in the .zfs/ directory in the root of the filesystem you want to copy to find a readable copy of the snapshot.)
Making the failover target not have intermediate state if the source system dies during an rsync run. Hopefully your target filer has the ability to snapshot before you start an rsync run, so that you can roll back to the “last good state” if the run fails. Otherwise, hopefully your data / application can tolerate some inconsistencies. (Or maybe there’s an rsync option that does this that I haven’t used before.)

Resources