File Writing through C program - linux

When we open a file using fopen() in C(Ubuntu platform and gcc compiler) and write to it,does the content is directly written to the hard disk address where the file resides or is it first brought into primary memory?
What is the actual process with which a file could be written or read from its location in hard disk through a C program in Linux.

The C library does not make the actual write to disk. It is the job of operating system. C library will make a system call to kernel to write it to the disk. It may even implement a buffer to minimize the number of system calls. And kernel also implement buffer to optimize real writing to disk. In general when you are working with C you don't think this much low level. However, you need to ensure that you have closed the file correctly. The actual disk management is the job of OS.
The Design of the UNIX Operating System by Maurice J. Bach contains nice explanation of Unix kernel. You may have a look as a beginning.

Under UNIX-like systems, generally, there are two levels of caching when writing information to a file on disk.
The first is in the C run time libraries, where it's likely to be buffered (unless you turn off buffering in some manner). You can use a C call like fflush to flush these buffers.
The second is at the operating system level, where buffers are held, before being written to the physical disk. A call to fsync can force these buffers to be flushed to disk.

Related

Ensuring completenes of file writes on Linux ext4 file system

Our (embedded) Linux system has an ext4 file system. Now, one of our apps there needs to modify data files using simple file write APIs. The requirement there is that the file updates should be atomic - not in the sense of parallel writes from different apps (we don't have that), but in the sense that each write can't be partially executed in case of a power failure - it can either be fully executed or not. Is this guaranteed? I'm aware of the fact that file writes may not be executed immediately due to caching, but I'm not sure whether these writes can be split by the cache in a way they may become partial, hence my question.
I can alternatively use a copy-write-rename method to copy the original file to a temporary one, make the changes there and then rename the file back to the original one, counting on the atomic nature of the rename operation. But even then I'm not sure that these operation are guaranteed to be ordered the way I want (especially the write and rename).
A possibly might be to use (in your user-mode application) the sync(2) system call. Before that, use fflush(3) if using stdio
To ensure atomicity, you may need to check a lot of code (perhaps even inside the kernel) with static analysis tools like Frama-C, Bismon, or the DECODER project. Of course, this is very costly (above 100k€ or US$ in 2021). Feel free to contact me by email about them. Be aware of Rice's theorem.
At the kernel (or hardware) level, atomicity cannot be guaranteed: for example, a successful write(2) system call of four megabytes (by your application) is very probably involving (on the SATA cable to your hard disk) many frames or packets. If power is lost, data will be lost.
Don't forget that the Linux kernel and GNU libc are open source. You are allowed to study their source code and improve them.
Consider also a hardware approach : adding some UPS.
Another possibility is to extend your C compiler, e.g. coding your GCC plugin, to semi-automatically add calls to sync(2)
Yet another possibility is to generate your C code (e.g. with RefPerSys or GPP or your own C code generator). Jacques Pitrat's last book Artificial Beings, the conscience of a conscious machine explain in details how to do so.
See also my sync-periodically.c program (GPLv3+ licensed; so no warranty).
You could also improve some open source compiler generating C (like Bigloo) to emit at suitable places calls to sync(2).
PS. Things are more complex if your embedded software is multi-threaded (using several pthreads or processes), or if your hardware has several disks or SSD, or is in space (cosmic rays?) or inside a nuclear power station (radioactivity?)

Telling Linux not to keep a file in the cache when it is written to disk

I am writing a large file to disk from a user-mode application. In parallel to it, I am writing one or more smaller files. The large file won't be read back anytime soon, but the small files could be. I have enough RAM for the application + smaller files, but not enough for the large file. Can I tell the OS not to keep parts of the large file in cache after they are written to disk so that more cache is available for smaller files? I still want writes to the large file be fast enough.
Can I tell the OS not to keep parts of the large file in cache ?
Yes, you probably want to use some system call like posix_fadvise(2) or madvise(2). In weird cases, you might use readahead(2) or userfaultfd(2) or Linux-specific flags to mmap(2). Or very cleverly handle SIGSEGV (see signal(7), signal-safety(7) and eventfd(2) and signalfd(2)) You'll need to write your C program doing that.
But I am not sure that it is worth your development efforts. In many cases, the behavior of a recent Linux kernel is good enough.
See also proc(5) and linuxatemyram.com
You many want to read the GC handbook. It is relevant to your concerns
Conbsider studying for inspiration the source code of existing open-source software such as GCC, Qt, RefPerSys, PostGreSQL, GNU Bash, etc...
Most of the time, it is simply not worth the effort to explicitly code something to manage your page cache.
I guess that mount(2) options in your /etc/fstab file (see fstab(5)...) are in practice more important. Or changing or tuning your file system (e.g. ext4(5), xfs(5)..). Or read(2)-ing in large pieces (1Mbytes).
Play with dd(1) to measure. See also time(7)
Most applications are not disk-bound, and for those who are disk bound, renting more disk space is cheaper that adding and debugging extra code.
don't forget to benchmark, e.g. using strace(1) and time(1)
PS. Don't forget your developer costs. They often are a lot above the price of a RAM module (or of some faster SSD disk).

If the size of the file exceeds the maximum size of the file system, what happens?

For example, In FAT32 partition, The maximum file size is 4GB. but I was able to create a 5GB file with vim and I saved the file and opened it again, the console output was broken like a staircase. I have three questions.
If the size of the file exceeds the maximum size of the file system, what happens?
In my case, Why break?
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1). Does this have anything to do with the file system? Is there a relationship between the limits of data in stat() and the limits of each feature in the file system?
If the size of the file exceeds the maximum size of the file system, what happens?
By definition, that can never happens. What really happens is that some system call (probably write(2) ...) is failing, and the code doing that should take care of that case.
Notice that FAT32 filesystems restrict the maximal size of files to 2Gigabytes. Use a better file system on your USB key if you want more (or split(1) large files in smaller chunks before copying them to your FAT32-formatted USB key).
If using <stdio.h> notice that fflush(3), fprintf(3), fclose(3) (and most other standard functions) can fail (e.g. because they will do some failing write(2)).
the console output was broken like a staircase
probably because your pseudoterminal was in some broken state. See stty(1), reset(1), termios(3) and read the tty demystified.
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1)
You are misunderstanding stat(2). Read again its documentation
Read Advanced Linux Programming then syscalls(2).
I was able to create a 5GB file with vim
To understand the behavior of vim read first its documentation then study its source code (it is free software, and you can and perhaps should study its code).
You could also use strace(1) to understand what system calls are done by some command or process.

Why are these special device file reads a minimum of PAGE SIZE bytes?

I am coding my 2nd kernel module ever. I am attempting to provide user-space access to a firmware core, as a demo. The demo is under petalinux (an embedded OS specifically tailored to Zynq or Microblaze). I added virtual file system hooks to go between user space and the kernel module, and it seems to work, both on read and write. The only hiccup is that, somewhere between my user application and my kernel module, the OS balloons the size of my request up to PAGE SIZE (4096).
A co-worker commented that I might be mounting the module as a block device rather than a character device. This makes a lot of sense. Someone upstream of my module is certainly caching my results (which, if my understanding of block drivers is accurate, would make perfect sense for, say, the hard drive), but we're tied to a volatile device, so this isn't appropriate. But all the diagnostics I've been able to find suggest that it is mounted as a character device...
mknod /dev/myModule **c** (Dynamically specified Major Number) (Zero)
ls -la /dev/myModule
**c**rw-r--r-- 1 root root 252, - Jan 1 01:05 myModule
Here is the module source I am using to register the virtual file IO hooks.....
alloc_chrdev_region (&moduleMajorNumber, 0, 1, "moduleLayerCDMA");
register_chrdev_region (&moduleMajorNumber, 1, "moduleLayerCDMA");
cdevP = cdev_alloc();
cdevP->ops = &moduleLayerCDMA_fileOperations;
cdevP->owner = THIS_MODULE;
cdev_add(cdevP, moduleMajorNumber, 1);
Any clues?
Your problem comes from the fact that the standard C library buffered I/O routines (fopen, fclose, fread, fgetch & their friends) keep a user-space buffer for every opened file/device, and when your program tries to read from that file/device, the library routines try to do read-ahead, to prepare for later read calls, to increase the efficiency of the I/O. Similarly, writes with fwrite go through a write buffer, and only get flushed to the system with a system call when the buffer gets full or when closing the file/device or explicitly doing fflush.
There are two ways to solve the issue:
The easier might be to simply convert your user-space program to use non-buffered I/O (open, close, read, write & their friends), these are simply making the corresponding system call on a 1:1 basis.
Or handle the problem in your kernel module: disregard the number of bytes asked in a read if it is more than what you'd like to return in a single system call. You can look at that value as the length of the buffer provided by the caller, and you don't neccessarily have to fill it up completely. Of course, in the return value, you have to indicate how many bytes were actually read.

How to ensure read() to read data from the real device each time?

I'm periodically reading from a file and checking the readout to decide subsequent action. As this file may be modified by some mechanism which will bypass the block file I/O manipulation layer in the Linux kernel, I need to ensure the read operation reading data from the real underlying device instead of the kernel buffer.
I know fsync() can make sure all I/O write operations completed with all data written to the real device, but it's not for I/O read operations.
The file has to be kept opened.
So could anyone please kindly tell me how I can do to meet such requirement in Linux system? is there such a API similar to fsync() that can be called?
Really appreciate your help!
I believe that you want to use the O_DIRECT flag to open().
I think memory mapping in combination with madvise() and/or posix_fadvise() should satisfy your requirements... Linus contrasts this with O_DIRECT at http://kerneltrap.org/node/7563 ;-).
You are going to be in trouble if another device is writing to the block device at the same time as the kernel.
The kernel assumes that the block device won't be written by any other party than itself. This is true even if the filesystem is mounted readonly.
Even if you used direct IO, the kernel may cache filesystem metadata, so a change in the location of those blocks of the file may result in incorrect behaviour.
So in short - don't do that.
If you wanted, you could access the block device directly - which might be a more successful scheme, but still potentially allowing harmful race-conditions (you cannot guarantee the order of the metadata and data updates by the other device). These could cause you to end up reading junk from the device (if the metadata were updated before the data). You'd better have a mechanism of detecting junk reads in this case.
I am of course, assuming some very simple braindead filesystem such as FAT. That might reasonably be implemented in userspace (mtools, for instance, does)

Resources