I tried to ask on FUSE's mailing list but I haven't received any response so far... I have a couple of questions. I'm going to implement a low-level FUSE file system and watch over fuse_chan's descriptor with epoll.
I have to fake inodes for all
objects in my file system right? Are
there any rules about choosing
inodes for objects in VFS (e.g. do I
have to use only positive values or
can I use values in some range)?
Can I make fuse_chan's descriptor
nonblocking? If yes, please tell me
whether I can assume that
fuse_chan_recv()/fuse_chan_send()
will receive/send a whole request
structure, or do I have to override them
with functions handling partial send
and receive?
What about buffer size? I see that
in fuse_loop() a new buffer is
allocated for each call, so I assume
that the buffer size is not fixed.
However maybe there is some maximum
possible buffer size? I can then
allocate a larger buffer and reduce
memory allocation operations.
(1) Inodes are defined as unsigned integers, so in theory, you could use any values.
However, since there could be programs which are not careful, I'd play it safe and only use non-zero, positive integers up to INT_MAX.
(2) Fuse uses a special kernel device. While fuse_chan_recv() do not support partial reads, this may not be required, as kernel should not return partial packets anyway.
(3) Filenames in Linux are max 4096 chars. This puts a limit on a buffer size:
$ grep PATH_MAX /usr/include/linux/limits.h
#define PATH_MAX 4096 /* # chars in a path name including nul */
Related
man 2 read says:
EINVAL fd is attached to an object which is unsuitable for reading; or
the file was opened with the O_DIRECT flag, and either the address
specified in buf, the value specified in count, or the current file
offset is not suitably aligned.
Non-direct I/O has no such limits, but why direct I/O requires alignments?
(Kernel 2.6+) This is because direct I/O can be zero-copy from the kernel's perspective (i.e. no more copying takes place in the kernel of the data itself) and disks have a minimum addressable size of I/O known as their "logical block size" (often 512 bytes but might be 4096 bytes or even more). This O_DIRECT requirement (must obey logical block size alignments) is actually described in the man page for open() (see the O_DIRECT section under NOTES).
In the buffered I/O case, the kernel copies data out of the userspace addresses and into its own internal page cache address (which obey all the alignment rules and will do read-modify-write if necessary to make sure everything is aligned) and then tells the device to do I/O to/from the page cache locations. In the direct I/O case when everything goes correctly, the memory allocated to your userspace program is the same memory handed to the device to do I/O to/from and thus your program must obey the alignment because there's nothing in between that will fix things up.
I'm creating a linux device driver that create a character device.
The data that it returns on reads is logically divided into 16-byte units.
I was planning on implementing this division by returning however many units fit into the read buffer, but I'm not sure what to do if the read buffer is too small (<16 bytes).
What should I do here? Or is there a better way to achieve the division I'm trying to represent?
You could act like the datagram socket device driver: it always returns just a single datagram. If the read buffer is smaller, the excess is discarded -- it's the caller's responsibility to provide enough space for a whole datagram (typically, the application protocol specifies the maximum datagram size).
The documentation of your device should specify that it works in 16-byte units, so there's no reason why a caller would want to provide a buffer smaller than this. So any lost data due to the above discarding could be considered a bug in the calling application.
However, it would also be reasonable to return more than 16 at a time if the caller asks for it -- that suggests that the application will split it up into units itself. This could be more performance, since it minimizes system calls. But if the buffer isn't a multiple of 16, you could discard the remainder of the last unit. Just make sure this is documented, so they know to make it a multiple.
If you're worried about generic applications like cat, I don't think you need to. I would expect them to use very large input buffers, simply for performance reasons.
I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.
The application basically does 1M-sized direct writes directly into the device:
fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);
This code causes two SCSI WRITEs to be sent, 512K each.
However, if I issue a direct SCSI command, without the block layer, the write is not split.
I issue the following command from the command line:
sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct
I can see one single 1M-sized SCSI WRITE.
The question is, what is splitting the write and, more importantly, is it configurable?
Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.
As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/):
max_hw_sectors_kb maximum size of a single I/O the hardware can accept
max_sectors_kb the maximum size the block layer will send
max_segment_size and max_segments the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)
The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments.
The question is, what is splitting the write
As you guessed the Linux block layer.
and, more importantly, is it configurable?
You can fiddle with max_sectors_kb but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).
512K seems too arbitrary a number not to be some sort of a configurable parameter
The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments of 128 so:
4096 * 128 / 1024 = 512
and that's where 512K could come from.
Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...
The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).
You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.
There's a max sectors per request attribute of the block driver. I'd have to check how to modify it. You used to be able to get this value via blockdev --getmaxsect but I'm not seeing the --getmaxsect option on my machine's blockdev.
Looking at the following files should tell you if the logical block size is different, possibly 512 in your case. I am not however sure if you can write to these files to change those values. (the logical block size that is)
/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size
try ioctl(fd, BLKSECTSET, &blocks)
I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph
In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).
Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.
Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements
Just curious about the maximum file size limit provided by some popular file systems on Linux, I have seen some are up to TB scale.
My question is what if the file pointer is 32 bits wide, like most Linux we meet today, doesn't that mean that the maximum distance we can address is 2^32-1 bytes? Then how can we store a file larger than 4GB?
Furthermore, even if we can store such a file, how can we locate a position beyond the 2^32 range?
To use files larger than 4 GB, you need "large file support" (LFS) on Linux. One of the changes LFS introduced was that file offsets are 64bit numbers. This is independent of whether Linux itself is running in 32 or 64bit mode (e.g. x86 vs. x86-64). See e.g. http://www.suse.de/~aj/linux_lfs.html
LFS was introduced mostly in glibc 2.2 and kernel 2.4.0 (roughly in 2000-2001), so any recent Linux distribution will have it.
To use it on Linux, you can either use special functions (e.g. lseek64 instead of lseek), or set #define _FILE_OFFSET_BITS 64, then the regular functions will use 64bit offsets.
In Linux, at least, it's trivial to write programs to work with larger files explicitly (i.e., not just using a streaming approach as suggested by kohlehydrat).
See this page, for instance. The trick usually comes down to having a magic #define before including some of the system headers, which "turn on" the "large file support". This typically doubles the size of the file offset type to 64 bits, which is quite a lot.
There is no relation whatsoever. The FILE * pointer from C stdio is an opaque handle that has no relation to the size of the on-disk file, and the memory it points too can be much bigger than the pointer itself. The function fseek(), to reposition where we read from and write to, already takes a long, and fgetpos() and fsetpos() use an opaque fpos_t.
What can make working with large files difficult is off_t used as an offset in various system calls. Fortunately, people realized this would be an issue, and came up with "Large File Support" (LFS), which is an altered ABI with a wider width for the offset type off_t. (Typically this is done by introducing a new API, and #defineing the old names to invoke this new API.)
You can use lseek64 to handle big files. Ext4 can handle 16 TiB files.
Just call repeatedly read(int fd, void *buf, size_t count);
(So there's no need for a 'pointer' into the file.)
From the filesystem-design-point-of-view, you're basically having an index tree (Inodes), which points to several pieces of that data (blocks), that form the actual file. Using this model, you can theoretically have infinte sizes of files.
UNIX has actual physical limits to file size determined by the number of bytes a 32 bit file pointer can index, about 2.4 GB.
consider closing the first file just before it reaches 0x7fffffff bytes in length, and opening an additional new file.
The reason for some limits of the ext2-file system are the file format of the data and the operating system's kernel. Mostly these factors will be determined once when the file system is built. They depend on the block size and the ratio of the number of blocks and inodes. In Linux the block size is limited by the architecture page size.
There are also some userspace programs that can't handle files larger than 2 GB.
The maximum file size is limited to min( (b/4)3+(b/4)2+b/4+12, 232*b ) due to the i_block (an array of EXT2_N_BLOCKS) and i_blocks( 32-bits integer value ) representing the amount of b-bytes "blocks" in the file.