Linux filesystem nesting and syscall hooking - linux

Using 2.6.32 linux kernel, I need to use a specific filesystem on a block device partition and I wan't to hook open/write/read/close (and few others) syscalls to read/write, in an other fashion that the specific filesystem, what should be written on this partition.
It would be only for this partition, others partitions using this filesystem would act as usual.
Fuse would have been perfect for this but I can't use it because of the memory consumption (too large for the targeted system)
How could I hook syscalls between VFS and the mounted filesystem, for, e.g. having an intermediate index and buffering all the read / write ?
I tried stuff like that :
mount -t ext3 /dev/sda1 /my/mount/data
mkfs.vfat /my/mount/data/big_file
mount -o loop -t vfat /my/mount/data/big_file /my_mount/custom_data
where vfat would be my custom filesystem, but debug shows that vfat is never referencing to jfs files operations where there is file operation that are done inside custom_data mount.
Any hints on how I should proceed ?

I discovered the stackable file system.
Wrapfs is interesting and should fit my needs : http://wrapfs.filesystems.org/
It allows to catch, in an intermediate layer between vfs and the lower fs, all the system calls.
Solve.

Related

how to check whether two files in same physical harddisk in linux bash or python?

I'm optimizing an I/O intensive Linux program. So is there any way to know whether two given files/folders path are in same hard disk?
Thanks.
If, by "same physical harddisk" you mean same fileystem, then you can use the stat command to get the device ID:
$ stat -c '%D' filename
$ fd03
If the device IDs match, they're in the same filesystem.
To actually determine the physical disk the file is on, you'd have to know the filesystem in use (some filesystems can span multiple disks), and even the "device" itself may be mapped to more than one actual physical disk by a volume manager such as LVM or a RAID controller.

libmount equivalent for FUSE filesystems

What is the libmount equivalent function to mount a FUSE file-system. I understand that FUSE is not a real file system and my strace of mount.fuse shows opening a /dev/fuse file and doing some complicated manipulations.
I tried seeing how the mount.fuse works by reading it's source code but not only it is needlessly complicated by string manipulations in C, it is a GPL program.
My question is, am I missing the obvious API to mount fuse file systems?
The kernel interface for mounting a FUSE filesystem is described in "linux/Documentation/filesystems/fuse.txt" (for example, see here).
In a nutshell, you call mount(2) as you would to mount any filesystem. However, the key difference is that you must provide a mount option fd=n where n is a file descriptor you've obtained by opening /dev/fuse and which will be used by the userspace process implementing the filesystem to respond to kernel requests.
In particular, this means that the mount is actually performed by the user space program that implements the filesystem. Specifically, most FUSE filesystems use libfuse and call the function fuse_main or fuse_session_mount to perform the mount (which eventually call the internal function fuse_mount_sys in mount.c that contains the actual mount(2) system call).
So, if you want to mount a FUSE filesystem programmatically, the correct way to do this is to fork and exec the corresponding FUSE executable (e.g., sshfs) and have it handle the mount on your behalf.
Note that /sbin/mount.fuse doesn't actually mount anything itself. It's just a wrapper to allow you to mount FUSE filesystems via entries in "/etc/fstab" via the mount command line utility or at boot time. That's why you can't find any mounting code there. It mounts FUSE filesystems the same way I described above, by running the FUSE executable for the filesystem in question to perform the actual mount.

Does Linux need a writeable file system

Does Linux need a writeable file system to function correctly? I'm just running a very simple init programme. Presently I'm not mounting any partitions. The Kernel has mounted the root partition as read-only. Is Linux designed to be able run with just a read-only file system as long as I stick to mallocs, readlines and text to standard out (puts), or does Linux require a writeable file system in-order even to perform standard text input and output?
I ask because I seem to be getting kernel panics and complaints about the stack. I'm not trying to run a useful system at the moment. I already have a useful system on another partition. I'm trying to keep it as simple as possible so as I can fully understand things before adding in an extra layer of complexity.
I'm running a fairly standard x86-64 desktop.
No, writable file system is not required. It is theoretically possible to run GNU/Linux with the only read-only file system.
In practice you probably want to mount /proc, /sys, /dev, possibly /dev/pts to everything work properly. Note that even some bash commands requires writable /tmp. Some other programs - writable /var.
You always can mount /tmp and /var as ramdisk.
Yes and No. No it doesn't need to be writeable if it did almost nothing useful.
Yes, you're running a desktop so it's needed to be writeable.
Many processes actually need a writeable filesystem as many system calls can create files. e.g. Unix Domain Sockets can create files.
Also many applications write into /var, and /tmp
The way to get around this is to mount the filesystem read/only and use a filesystem overlay to overlay an in memory filesystem. That way, the path will be writable but they go to ram and any changes are thrown away on reboot.
See: overlayroot
No it's not required. For example as most distributions have a live version of Linux for booting up for a cd or usb disk with actually using and back end hdd.
Also on normal installations, the root partitions are changed to read-only when there are corruptions on the disk. This way the system still comes up as read-only partition.
You need to capture the vmcore and the stack trace of the panic form the dmesg output to analyse further.

Is there a way to determine the type UFS in Linux?

I have a UFS partition of unknown type. I would like to know what type of UFS it is from Linux (it could be any). Is there a tool (library) for Linux or method which can solve my problem. I know I can try to mount all types, but this message stops me:
>>> WARNING Wrong ufstype may corrupt your filesystem, default is ufstype = old
Tell me please is there a safe way to mount unknown UFS type partition?
You might run as root
file -s -L /dev/sda5
if /dev/sda5 is the disk partition. The file(1) command should tell more about the data on that partition, in particular if it is some file system data.
One you know more about that partition, you could mount it in a safer way. BTW, I would mount it as read-only first.

LVM snapshot of mounted filesystem

I'd like to programmatically make a snapshot of a live filesystem in Linux, preferably using LVM. I'd like not to unmount it because I've got lots of files opened (my most common scenario is that I've got a busy desktop with lots of programs).
I understand that because of kernel buffers and general filesystem activity, data on disk might be in some more-or-less undefined state.
Is there any way to "atomically" unmount an FS, make an LVM snapshot and mount it back? It will be ok if the OS will block all activity for few seconds to do this task. Or maybe some kind of atomic "sync+snapshot"? Kernel call?
I don't know if it is even possible...
You shouldn't have to do anything for most Linux filesystems. It should just work without any effort at all on your part. The snapshot command itself hunts down mounted filesystems using the volume being snapshotted and calls a special hook that checkpoints them in a consistent, mountable state and does the snapshot atomically.
Older versions of LVM came with a set of VFS lock patches that would patch various filesystems so that they could be checkpointed for a snapshot. But with new kernels that should already be built into most Linux filesystems.
This intro on snapshots claims as much.
And a little more research reveals that for kernels in the 2.6 series the ext series of filesystems should all support this. ReiserFS probably also. And if I know the btrfs people, that one probably does as well.
I know that ext3 and ext4 in RedHat Enterprise, Fedora and CentOS automatically checkpoint when a LVM snapshot is created. That means there is never any problem mounting the snapshot because it is always clean.
I believe XFS has the same support. I am not sure about other filesystems.
It depends on the filesystem you are using. With XFS you can use xfs_freeze -f to sync and freeze the FS, and xfs_freeze -u to activate it again, so you can create your snapshot from the frozen volume, which should be a save state.
Is there any way to "atomically" unmount an FS, make an LVM snapshot and mount it back?
It is possible to snapshot a mounted filesystem, even when the filesystem is not on an LVM volume. If the filesystem is on LVM, or it has built-in snapshot facilities (e.g. btrfs or ZFS), then use those instead.
The below instructions are fairly low-level, but they can be useful if you want the ability to snapshot a filesystem that is not on an LVM volume, and can't move it to a new LVM volume. Still, they're not for the faint-hearted: if you make a mistake, you may corrupt your filesystem. Make sure to consult the official documentation and dmsetup man page, triple-check the commands you're running, and have backups!
The Linux kernel has an awesome facility called the Device Mapper, which can do nice things such as create block devices that are "views" of other block devices, and of course snapshots. It is also what LVM uses under the hood to do the heavy lifting.
In the below examples I'll assume you want to snapshot /home, which is an ext4 filesystem located on /dev/sda2.
First, find the name of the device mapper device that the partition is mounted on:
# mount | grep home
/dev/mapper/home on /home type ext4 (rw,relatime,data=ordered)
Here, the device mapper device name is home. If the path to the block device does not start with /dev/mapper/, then you will need to create a device mapper device, and remount the filesystem to use that device instead of the HDD partition. You'll only need to do this once.
# dmsetup create home --table "0 $(blockdev --getsz /dev/sda2) linear /dev/sda2 0"
# umount /home
# mount -t ext4 /dev/mapper/home /home
Next, get the block device's device mapper table:
# dmsetup table home
home: 0 3864024960 linear 9:2 0
Your numbers will probably be different. The device target should be linear; if yours isn't, you may need to take special considerations. If the last number (start offset) is not 0, you will need to create an intermediate block device (with the same table as the current one) and use that as the base instead of /dev/sda2.
In the above example, home is using a single-entry table with the linear target. You will need to replace this table with a new one, which uses the snapshot target.
Device mapper provides three targets for snapshotting:
The snapshot target, which saves writes to the specified COW device. (Note that even though it's called a snapshot, the terminology is misleading, as the snapshot will be writable, but the underlying device will remain unchanged.)
The snapshot-origin target, which sends writes to the underlying device, but also sends the old data that the writes overwrote to the specified COW device.
Typically, you would make home a snapshot-origin target, then create some snapshot targets on top of it. This is what LVM does. However, a simpler method would be to simply create a snapshot target directly, which is what I'll show below.
Regardless of the method you choose, you must not write to the underlying device (/dev/sda2), or the snapshots will see a corrupted view of the filesystem. So, as a precaution, you should mark the underlying block device as read-only:
# blockdev --setro /dev/sda2
This won't affect device-mapper devices backed by it, so if you've already re-mounted /home on /dev/mapper/home, it should not have a noticeable effect.
Next, you will need to prepare the COW device, which will store changes since the snapshot was made. This has to be a block device, but can be backed by a sparse file. If you want to use a sparse file of e.g. 32GB:
# dd if=/dev/zero bs=1M count=0 seek=32768 of=/home_cow
# losetup --find --show /home_cow
/dev/loop0
Obviously, the sparse file shouldn't be on the filesystem you're snapshotting :)
Now you can reload the device's table and turn it into a snapshot device:
# dmsetup suspend home && \
dmsetup reload home --table \
"0 $(blockdev --getsz /dev/sda2) snapshot /dev/sda2 /dev/loop0 PO 8" && \
dmsetup resume home
If that succeeds, new writes to /home should now be recorded in the /home_cow file, instead of being written to /dev/sda2. Make sure to monitor the size of the COW file, as well as the free space on the filesystem it's on, to avoid running out of COW space.
Once you no longer need the snapshot, you can merge it (to permanently commit the changes in the COW file to the underlying device), or discard it.
To merge it:
replace the table with a snapshot-merge target instead of a snapshot target:
# dmsetup suspend home && \
dmsetup reload home --table \
"0 $(blockdev --getsz /dev/sda2) snapshot-merge /dev/sda2 /dev/loop0 P 8" && \
dmsetup resume home
Next, monitor the status of the merge until all non-metadata blocks are merged:
# watch dmsetup status home
...
0 3864024960 snapshot-merge 281688/2097152 1104
Note the 3 numbers at the end (X/Y Z). The merge is complete when X = Z.
Next, replace the table with a linear target again:
# dmsetup suspend home && \
dmsetup reload home --table \
"0 $(blockdev --getsz /dev/sda2) linear /dev/sda2 0" && \
dmsetup resume home
Now you can dismantle the loop device:
# losetup -d /dev/loop0
Finally, you can delete the COW file.
# rm /home_cow
To discard the snapshot, unmount /home, follow steps 3-5 above, and remount /home. Although Device Mapper will allow you to do this without unmounting /home, it doesn't make sense (since the running programs' state in memory won't correspond to the filesystem state any more), and it will likely corrupt your filesystem.
I'm not sure if this will do the trick for you, but you can remount a file system as read-only. mount -o remount,ro /lvm (or something similar) will do the trick. After you are done your snapshot, you can remount read-write using mount -o remount,rw /lvm.
FS corruption is "highly unlikely", as long as you never work in any kind of professional environment. otherwise you'll meet reality, and you might try blaming "bit rot" or "hardware" or whatever, but it all comes down to having been irresponsible. freeze/thaw (as mentioned a few times, and only if called properly) is sufficient outside of database environments. for databases, you still won't have a transaction-complete backup and if you think a backup that rolls back some transaction is fine when restored: see starting sentence.
depending on the activity you might just added another 5-10 mins of downtime if ever you need that backup.
Most of us can easily afford that, but it can not be general advice.
Be honest about downsides, guys.

Resources