How can I create a userspace filesystem with FUSE without using libfuse? - linux

I've found that the FUSE userspace library and kernel interface has been ported, since its inception on Linux, to many other systems, and presents a relatively stable API with a supposedly small surface area. If I wanted to author a filesystem in userspace, and I were not on Plan 9 or Hurd, I would think that FUSE is my best choice.
However, I am not going to use libfuse. This is partially because of pragmatism; using C is hard in my language of choice (Monte). It's also because I am totally uninterested in writing C support code, and libfuse's recommended usage is incompatible with Monte philosophy. This shouldn't be a problem, since C is not magical and /dev/fuse can be opened with standard system calls.
Going to look for documentation, however, I've found none. There is no documentation that I can find for the /dev/fuse ABI/API, and no stories of others taking this same non-C-bound route. Frustrating.
Does any kind of documentation exist on how to interact in a language-agnostic way with /dev/fuse and the FUSE subsystem of the kernel? If so, could you point me to it? Thanks!
Update: There exists go-fuse, which is in Go, a slightly more readable language than C. However, it does not contain any ABI/API documentation either.
Update: I notice that people have voted to close this. Don't worry, there is no need for that. I have satisfied myself that the documentation that I desire does not yet exist. I will write the documentation myself, publish it, and then link to it in an accepted answer. Hopefully the next person to search for this documentation will not be disappointed.

(I'm not accepting this until it's complete. In the interim, edits are welcome!)
The basic outline of a FUSE session:
open() is called on /dev/fuse. I'll call the resulting FD the control FD.
mount() is called with the target mount point, filesystem type "fuse" for normal mode or "fuseblk" for block-device mode, and options including "fd=X" where X is the control FD.
FUSE-specific structs are transferred on the control FD repeatedly. The general pattern of communication follows a request-response pattern, where the program read()s filesystem commands from the control FD and then write()s responses back.
umount() is called with the target mount point.
close() is called on the control FD.
With that all said, there's a handful of complications that one should be aware of. First, mount() is almost always a privileged syscall, so you'll have to be root to mount a FUSE filesystem. However, as one may have noticed, FUSE programs can generally be started as non-root! How?
There's a helper, /bin/fusermount, installed setuid. Usage is totally undocumented, but that's what I'm here for. Instead of open()ing /dev/fuse yourself, run fusermount as a subprocess, passing the target mount point as an argument, any extra mount options you like with -o, and (crucially) with the environment variable _FUSE_COMMFD exported and set to the ASCII string of an open FD, which I'll call the comm FD. You must create the comm FD yourself using e.g. pipe(). fusermount will call open() and mount() for you, and share the control FD back to you along the comm FD, using the sendmsg() trick for sharing FDs. Use recvmsg() to read it back.
Editorial: I really don't understand why this is structured to be so difficult. FDs are inherited by subprocesses; it would have been so much easier to open() the control FD in the top process and pass it down into fusermount. True, there's some confused deputy dangers, but fusermount is already installed and setuid and dangerous.
Anyway! fusermount will crudely daemonize and take care of calling umount() and close() to clean up once your main process exits.
Things not yet covered:
How is non-blocking access to FUSE handled? Can the control FD just be kicked into non-blocking mode? Does it actually not block, or does it behave like an ordinary file and secretly block on access?
The struct layouts. These can be more or less rediscovered from the C or Go source, but that's no excuse. I'll document them more seriously when I've worked up sufficient masochism.


How can a program change a directory without using chdir()?

I can find a lot of documentation on using chdir() to change a directory in a program (a command shell, for instance). I was wondering if it is possible to somehow do the same thing without the use of chdir(). Yet, I can't find any documentation or examples of code where a person is changing directories without using chdir() to some capacity. Is this possible?
In Linux, chdir() is a syscall. That means it's not something a program does in its own memory, but it's a request for the OS kernel to do something on the program's behalf.
Granted, it's one of two syscalls that can change directories -- the other one is fchdir(). Theoretically you could use the other one, though whether that's what your professor actually wants is very much open to interpretation.
In terms of why chdir() and fchdir() can't be reimplemented by an application but need to be leveraged: The current working directory is among the process state maintained by the kernel on a program's behalf; the program itself can't access kernel memory without asking the kernel to operate on its behalf.
Things are syscalls because they need to be syscalls -- if something could be done in-process, it would be done that way (crossing the boundary between userspace and kernelspace involves a context-switch penalty; it's not without performance impact). In this case, letting the kernel do accurate bookkeeping as to what a process's working directory is ensures that the working directory is maintained when a new executable is loaded (with execve()), and helps to ensure the integrity of the kernel's records (making sure a program can't pretend to have its current working directory be a directory it doesn't actually have access to).

How do I open a file in a kernel module if calling process is in user space?

I am trying to create a character device driver that dumps /etc/shadow when read from as a non-privileged user. This is for purely academic purposes of course.
I was reading about how reading/writing files in kernel space opens a system to possible exploits. I am trying to implement this in practice.
Please spare me the "don't touch the filesystem in kernel mode" talk. I am precisely trying to exploit the nuances of doing so.
Problem is that the only way I have found so far that works to open a file in kernel mode is filp_open, which is currently producing EACCESS when I read from the device file as a non-privileged user. This was confounding at first as I assumed that I can do anything in kernel space.
For example, when I cat the device file I have created as a non-root user, filp_open produces EACCESS in kernel space???
Further investigation has led me to believe that filp_open checks the capabilities of the calling process. This would make sense as it is used internally by open(), but I am in kernel mode here! There must be a way!
I am very new to programming in kernel space. I have extensive application C experience, but I am finding it difficult to navigate the kernel documentation for precisely what I am looking for. Additionally, it seems that more and more symbols within the kernel are not exported for use in modules. As I am developing an exploit proof of concept, I would like it to work without recompiling the kernel. I am finding a lot of code (vfs and syscalls) that is deprecated as the symbols are no longer exported to kernel modules.
Is what I am trying to do a thing that is specifically engineered against? Loading a kernel module requires root to begin with, so I would see this more in the lens of a persistence focused attack rather than an access one.
Also, I got the proof of concept working by just reading from the file when the module is loaded, but this is no fun! Any pointers here are much appreciated.
After some rethinking and digging I have found two solutions to my problem. Thank you to Tsyvarev and stark for the pointers.
Solution 1
The first solution is to elevate the privileges of the calling process before making a call of filp_open. This is also basically making a rootkit, so not as interesting.
Here is a link to the guide that I found on the subject.
Solution 2
The module will have an init function that by nature must be run with elevated privs when the module is loaded. So you can open the file pointer there and just close it when the module is unloaded. Caveats are that you have the file pointer open the whole time, so all of the gotchas there are still present. Better to only read, writing is where things can get a bit tricky. This is the solution I chose in the interim, as I didn't want this thing to be a full rootkit.
Another direction is workqueue or to spawn a thread. Probably the most tricky but also the most inline with what my original vision of this demo was. I did not test this direction but it probably is the best solution.

intercepting file system system calls

I am writing an application for which I need to intercept some filesystem system calls eg. unlink. I would like to save some file say abc. If user deletes the file then I need to copy it to some other place. So I need unlink to call my code before deleting abc so that I could save it. I have gone through threads related to intercepting system calls but methods like LD_PRELOAD it wont work in my case because I want this to be secure and implemented in kernel so this method wont be useful. inotify notifies after the event so I could not be able to save it. Could you suggest any such method. I would like to implement this in a kernel module instead of modifying kernel code itself.
Another method as suggested by Graham Lee, I had thought of this method but it has some problems ,I need hardlink mirror of all the files it consumes no space but still could be problematic as I have to repeatedly mirror drive to keep my mirror up to date, also it won't work cross partition and on partition not supporting link so I want a solution through which I could attach hooks to the files/directories and then watch for changes instead of repeated scanning.
I would also like to add support for write of modified file for which I cannot use hard links.
I would like to intercept system calls by replacing system calls but I have not been able to find any method of doing that in linux > 3.0. Please suggest some method of doing that.
As far as hooking into the kernel and intercepting system calls go, this is something I do in a security module I wrote:
Look at hijacks.c and symbols.c for the code; how they're used is in the hijack_syscalls function inside security.c. I haven't tried this on linux > 3.0 yet, but the same basic concept should still work.
It's a bit tricky, and you may have to write a good deal of kernel code to do the file copy before the unlink, but it's possible here.
One suggestion could be Filesystems in Userspace (FUSE.) That is, write a FUSE module (which is, granted, in userspace) which intercepts filesystem-related syscalls, performs whatever tasks you want, and possibly calls the "default" syscall afterwards.
You could then mount certain directories with your FUSE filesystem and, for most of your cases, it seems like the default syscall behavior would not need to be overridden.
You can watch unlink events with inotify, though this might happen too late for your purposes (I don't know because I don't know your purposes, and you should experiment to find out). The in-kernel alternatives based on LSM (by which I mean SMACK, TOMOYO and friends) are really for Mandatory Access Control so may not be suitable for your purposes.
If you want to handle deletions only, you could keep a "shadow" directory of hardlinks (created via link) to the files being watched (via inotify, as suggested by Graham Lee).
If the original is now unlinked, you still have the shadow file to handle as you want to, without using a kernel module.

Can regular file reading benefited from nonblocking-IO?

It seems not to me and I found a link that supports my opinion. What do you think?
The content of the link you posted is correct. A regular file socket, opened in non-blocking mode, will always be "ready" for reading; when you actually try to read it, blocking (or more accurately as your source points out, sleeping) will occur until the operation can succeed.
In any case, I think your source needs some sedatives. One angry person, that is.
I've been digging into this quite heavily for the past few hours and can attest that the author of the link you cited is correct. However, the appears to be "better" (using that term very loosely) support for non-blocking IO against regular files in native Linux Kernel for v2.6+. The "libaio" package contains a library that exposes the functionality offered by the kernel, but it has some caveats about the different types of file systems which are supported and it's not portable to anything outside of Linux 2.6+.
And here's another good article on the subject.
You're correct that nonblocking mode has no benefit for regular files, and is not allowed to. It would be nice if there were a secondary flag that could be set, along with O_NONBLOCK, to change this, but due to the way cache and virtual memory work, it's actually not an easy task to define what correct "non-blocking" behavior for ordinary files would mean. Certainly there would be race conditions unless you allowed programs to lock memory associated with the file. (In fact, one way to implement a sort of non-sleeping IO for ordinary files would be to mmap the file and mlock the map. After that, on any reasonable implementation, read and write would never sleep as long as the file offset and buffer size remained within the bounds of the mapped region.)

how to monitor the syslog(printk) in a LKM

deal all,
i am a newbie for writing Linux Kernel Module.
i used printk function in linux kernel source code (2.4.29) for debugging and display messages.
now, i have to read all the messages i added via httpd.
i tried to write the messages into a file instead of printk function, so i can read the file directly.
but it's not work very well.
so, i have a stupid question...
is it possible to write a LKM to monitor the syslog and rewrite into another file??
i mean is that possible to let a LKM the aware the messages when each time the linux kernel execute "printk"??
thanks a lot
That is the wrong way to do it, because printk already does this : it writes in the file /proc/kmsg.
What you want is klogd, a user space utility dealing with /proc/kmsg.
Another options is to use dmesg, which will output the whole content of the kernel buffers holding the printk messages, but I suggest you first read the linked article
You never, ever, ever want to try to open a file on a user space mounted block file system from within the kernel. Imagine if the FS aborted and the kernel was still trying to write to it .. kaboom (amongst MANY other reasons why its a bad idea) :) As shodanex said, for your purposes, its much better to use klogd.
Now, generally speaking, you have several ways to communicate meaningful data to userspace programs, such as:
Create a character device driver that causes userspace readers to block while waiting for data. Provide an ioctl() interface to it which lets other programs find out how many messages have been sent, etc.
Create a node in /proc/yourdriver to accomplish the same thing
Really, the most practical means is to just use printk()
