Getting disk sector size without raw filesystem permission - linux

I'm trying to get the sector size, specifically so I can correctly size the buffer for reading/writing with O_DIRECT.
The following code works when my app's run as root:
int fd = open("/dev/xvda1", O_RDONLY|O_NONBLOCK);
size_t blockSize;
int rc = ioctl(fd, BLKSSZGET, &blockSize);
How can I get the sector size without it being run as root?

According to the Linux manpage for open():
In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).
So it looks like you may be able to obtain this information using xfsctl()... if you are using xfs.
Since your underlying block device is a Xen virtual block device and there might be any number of layers below that (LVM, dm-crypt, another filesystem, etc...) I'm not sure how meaningful all of this will really be for you.

You could use the stat(2) and related syscall (perhaps on some particular file), then use the st_blksize field. However this would give a file-system related blocksize, not the size of the sector as preferred by the hardware. But for O_DIRECT input (from a file on filesystem!) that st_blocksize might be more relevant.
Otherwise, I would suggest a power-of-two size, perhaps 8Kbytes or 64Kbytes, as the size of your O_DIRECT-ed reads (and you may want to align your read buffer to the page size, usually 4Kbytes).

Related

Efficiently inserting blocks into the middle of a file

I'm looking for, essentially, the ext4 equivalent of mremap().
I have a big mmap()'d file that I'm allocating arrays in, and the arrays need to grow. So I want to make the first array larger at its current location, and budge all the other arrays along in the file and the address space to make room.
If this was just anonymous memory, I could use mremap() to budge over whole pages in constant time, as long as I'm inserting a whole number of memory pages. But this is a disk-backed file, so the data needs to move in the file as well as in memory.
I don't actually want to read and then rewrite whole blocks of data to and from the physical disk. I want the data to stay on disk in the physical sectors it is in, and to induce the filesystem to adjust the file metadata to insert new sectors where I need the extra space. If I have to keep my inserts to some multiple of a filesystem-dependent disk sector size, that's fine. If I end up having to copy O(N) sector or extent references around to make room for the inserted extent, that's fine. I just don't want to have 2 gigabytes move from and back to the disk in order to insert a block in the middle of a 4 gigabyte file.
How do I accomplish an efficient insert by manipulating file metadata? Is a general API for this actually exposed in Linux? Or one that works if the filesystem happens to be e.g. ext4? Will a write() call given a source address in the memory-mapped file reduce to the sort of efficient shift I want under the right circumstances?
Is there a C or C++ API function with the semantics "copy bytes from here to there and leave the source with an undefined value" that I should be calling in case this optimization gets added to the standard library and the kernel in the future?
I've considered just always allocating new pages at the end of the file, and mapping them at the right place in memory. But then I would need to work out some way to reconstruct that series of mappings when I reload the file. Also, shrinking the data structure would be a nontrivial problem. At that point, I would be writing a database page manager.
I think I actually may have figured it out.
I went looking for "linux make a file sparse", which led me to this answer on Unix & Linux Stack Exchange which mentioned the fallocate command line tool. The fallocate tool has a --dig-holes option, which turns parts of a file that could be represented by holes into holes.
I then went looking for "fallocate dig holes" to find out how that works, and I got the fallocate man page. I noticed it also offers a way to insert a hole of some size:
-i, --insert-range
Insert a hole of length bytes from offset, shifting existing
data.
If a command line tool can do it, Linux can do it, so I dug into the source code for fallocate, which you can find on Github:
case 'i':
mode |= FALLOC_FL_INSERT_RANGE;
break;
It looks like the fallocate tool accomplishes a cheap hole insert (and a move of all the other file data) by calling the fallocate() Linux-specific function with the FALLOC_FL_INSERT_RANGE flag, added in Linux 4.1. This flag won't work on all filesystems, but it does work on ext4 and it does exactly what I want: adjust the file metadata to efficiently free up some space in the file's offset space at a certain point.
It's not immediately clear to me how this interacts with currently memory-mapped pages, but I think I can work with this.

Will Linux kernel read before writing data which small than filesystem block size?

For example, File system block size is 4k, but I only write 1 byte to the file using Direct IO, will kernel read this block to page cache before writing?
Maybe. Direct IO on Linux is highly variable, and depends on the underlying filesystem.
A 1-byte direct IO write to a file system with a 4k block size may fail entirely, may fall back on using the page cache, or it may go via direct IO right to the file.
And unless it fails, you won't be able to tell very easily.

If the size of the file exceeds the maximum size of the file system, what happens?

For example, In FAT32 partition, The maximum file size is 4GB. but I was able to create a 5GB file with vim and I saved the file and opened it again, the console output was broken like a staircase. I have three questions.
If the size of the file exceeds the maximum size of the file system, what happens?
In my case, Why break?
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1). Does this have anything to do with the file system? Is there a relationship between the limits of data in stat() and the limits of each feature in the file system?
If the size of the file exceeds the maximum size of the file system, what happens?
By definition, that can never happens. What really happens is that some system call (probably write(2) ...) is failing, and the code doing that should take care of that case.
Notice that FAT32 filesystems restrict the maximal size of files to 2Gigabytes. Use a better file system on your USB key if you want more (or split(1) large files in smaller chunks before copying them to your FAT32-formatted USB key).
If using <stdio.h> notice that fflush(3), fprintf(3), fclose(3) (and most other standard functions) can fail (e.g. because they will do some failing write(2)).
the console output was broken like a staircase
probably because your pseudoterminal was in some broken state. See stty(1), reset(1), termios(3) and read the tty demystified.
In Unix system call, stat() can succeed up to a 2GB(2^31 - 1)
You are misunderstanding stat(2). Read again its documentation
Read Advanced Linux Programming then syscalls(2).
I was able to create a 5GB file with vim
To understand the behavior of vim read first its documentation then study its source code (it is free software, and you can and perhaps should study its code).
You could also use strace(1) to understand what system calls are done by some command or process.

Why are these special device file reads a minimum of PAGE SIZE bytes?

I am coding my 2nd kernel module ever. I am attempting to provide user-space access to a firmware core, as a demo. The demo is under petalinux (an embedded OS specifically tailored to Zynq or Microblaze). I added virtual file system hooks to go between user space and the kernel module, and it seems to work, both on read and write. The only hiccup is that, somewhere between my user application and my kernel module, the OS balloons the size of my request up to PAGE SIZE (4096).
A co-worker commented that I might be mounting the module as a block device rather than a character device. This makes a lot of sense. Someone upstream of my module is certainly caching my results (which, if my understanding of block drivers is accurate, would make perfect sense for, say, the hard drive), but we're tied to a volatile device, so this isn't appropriate. But all the diagnostics I've been able to find suggest that it is mounted as a character device...
mknod /dev/myModule **c** (Dynamically specified Major Number) (Zero)
ls -la /dev/myModule
**c**rw-r--r-- 1 root root 252, - Jan 1 01:05 myModule
Here is the module source I am using to register the virtual file IO hooks.....
alloc_chrdev_region (&moduleMajorNumber, 0, 1, "moduleLayerCDMA");
register_chrdev_region (&moduleMajorNumber, 1, "moduleLayerCDMA");
cdevP = cdev_alloc();
cdevP->ops = &moduleLayerCDMA_fileOperations;
cdevP->owner = THIS_MODULE;
cdev_add(cdevP, moduleMajorNumber, 1);
Any clues?
Your problem comes from the fact that the standard C library buffered I/O routines (fopen, fclose, fread, fgetch & their friends) keep a user-space buffer for every opened file/device, and when your program tries to read from that file/device, the library routines try to do read-ahead, to prepare for later read calls, to increase the efficiency of the I/O. Similarly, writes with fwrite go through a write buffer, and only get flushed to the system with a system call when the buffer gets full or when closing the file/device or explicitly doing fflush.
There are two ways to solve the issue:
The easier might be to simply convert your user-space program to use non-buffered I/O (open, close, read, write & their friends), these are simply making the corresponding system call on a 1:1 basis.
Or handle the problem in your kernel module: disregard the number of bytes asked in a read if it is more than what you'd like to return in a single system call. You can look at that value as the length of the buffer provided by the caller, and you don't neccessarily have to fill it up completely. Of course, in the return value, you have to indicate how many bytes were actually read.

open limitation based on file size

Is there any limitation on "open" based on file size. ?
My file size is 2 GB will it open successfully and is there any timing issue can come ?
filesystem is rootfs.
From the open man page:
O_LARGEFILE
(LFS) Allow files whose sizes cannot be represented in an off_t
(but can be represented in an off64_t) to be opened. The
_LARGEFILE64_SOURCE macro must be defined in order to obtain
this definition. Setting the _FILE_OFFSET_BITS feature test
macro to 64 (rather than using O_LARGEFILE) is the preferred
method of obtaining method of accessing large files on 32-bit
systems (see feature_test_macros(7)).
On a 64-bit system, off_t will be 64 bits and you'll have no problem. On a 32-bit system, you'll need the suggested workaround to allow for files larger than 2 GB.
rootfs may not support large files; consider using a proper filesystem instead (tmpfs is almost the same as rootfs, but with more features).
rootfs is intended only for booting and early use.

Resources