What is the purpose of structure iov_iter in Linux?

What is the purpose of structure iov_iter in Linux? - linux

What is the purpose of struct iov_iter ? This structure is being used in Linux kernel instead of struct iovec. There is no any good documentation for iter interface. I had found one document on LWN but I am not able to understand that. Could anyone please help me to understand the iter interface which is being used in Linux kernel ?

One purpose of iovec, which the LWN article states up front, is to process data in multiple chunks.
If you have a number of discrete buffers, chained with pointers, and want to read/write them in one go, you could simply replace this with several read/write ops, but in some cases semantics are associated with read/write boundaries - so ops can't simply be split without changing the meaning. An alternative is to copy all the data in and out of a contiguous buffer, which is wasteful and we want to avoid at all costs.
Using the POSIX readv/writev or, in our case the iov_iter API, reduces the number of system calls, and hence the overhead involved. While in the kernel this doesn't translate to expensive ops like context switches, it is still a minor concern. Drivers also might handle larger chunks of data more efficiently than they would lots of smaller chunks when they have no way to know if there's more to come in the near future - this is especially true with network drivers, although I'm not aware of iov_iter being used there atm.
Another instance of the same situation is I/O to raw disk
devices, which only allow I/O to start and end of block
boundaries. A user might occasionally want to perform random access or overwrite a small piece of the buffer at, say, the start of a block and/or zero the rest.
Scenarios like that is exactly what iovec aimed to address; you can construct an iovec which enables you to do a whole block operation spread over several discrete buffers, which might even include a "scratch" buffer for dumping the parts of a block you read and don't care about processing, and a pre-zeroed buffer for chaining at the end of writev to zero out the rest of a block. Again, I should point out you can use a contiguous buffer with associated copying and/or zeroing, but the iov_iter API provides an alternative abstraction with less overhead, and perhaps easier to reason with when reading the code.
The term for operations like these in vector processing, or parallel computing, is "scatter/gather processing".

Related

What is the case of using Buffer.allocUnsafe() and Buffer.alloc()?

I am confused about using Buffer.allocUnsafe() and Buffer.alloc() , I know that Buffer.allocUnsafe() creates a buffer with pre-filled data or old buffers, but why do i need such thing if Buffer.alloc() creates a buffer with zero filled data

In Node.js Buffer is an abstraction over RAM, therefore if you allocate it in an unsafe way, there is a high risk of having even some source code in the buffer instance. Try running console.log(Buffer.allocUnsafe(10000).toString('utf-8')) and I guarantee that you will see some code in your stdout.
Allocation is a synchronous operation and we know that single threaded Node.js doesn't really feel good about synchronous stuff. Unsafe allocation is much faster than safe, because the buffer santarization step takes time. Safe allocation is, well, safe, but there is a performance trade off.
I'd suggest sticking to safe allocation first and if you end up with low performance, you can think of ways to implement unsafe allocation, without exposing private stuff. Just keep in mind that allocUnsafe method has the word unsafe for a reason. E.g, if you are going to pass some compliance certification like PCI DSS, I'm pretty sure QSA will notice that and will have a lot of questions.

Buffer.alloc(size, fill, encoding) -> returns a new initialized Buffer
of the specified size. This method is slower than Buffer.allocUnsafe(size) but guarantees that newly created Buffer instances never contain old data that is potentially sensitive.
Buffer.allocUnsafe(size) -> the Buffer is uninitialized, the allocated
segment of memory might contain old data that is potentially
sensitive. Using a Buffer created by Buffer.allocUnsafe() without completely overwriting the memory can allow this old data to be leaked when the Buffer memory is read.
Note: While there are clear performance advantages to using Buffer.allocUnsafe(), extra care must be taken in order to avoid introducing security vulnerabilities into an application

why fill data into buffer grow from low to high address

When we call a function, its stack is something like:
LOW MEMORY ADDRESS
localvariables
saved frame pointer
return address
....
HIGH MEMORY ADDRESS
Why does it fill data into a buffer the direction is from low to high memory address?
Many people tell me: "because this is how it works", but I think someone in some book or other has written why we have this behavior but I'm unable to find a good resource about.

I think you are misunderstanding or confusing a few things.
in you example you seem to mix up, operation system functionality with program and compiler operation.
If you allocate multiple memory addresses there is always a lower address and a higher address. You can only change that by writing everything to the same address which might result in a very limited or useless program.
there are many buffer implementations, depending on your programming language, framework, ...
which one you choose is up to you, if you use a buffer that is already implemented in a library, of course you have the follow the rules this buffer adds data, because that is how THIS specific buffer works. If you are not happy by how this is done, you need to change the chosen buffer or even the whole library or in extreme cases write your own buffer.
how to add data to a buffer
some buffers allow you to add data anywhere in the buffer, at the cost of e.g. performance or reliability. If you wish to do it that way its up to you.

Fastest way to copy a large file locally

I was asked this in an interview.
I said lets just use cp. Then I was asked to mimic implementation cp itself.
So I thought okay, lets open the file, read one by one and write it to another file.
Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn't have a good answer about what would be good chunk size. Please help me out with that.
Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel.
But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
Does that even improve performance? (I mean not for small files. it would be more overhead but for large files)
Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?

"The fastest way to copy a file" is going to depend on the system - all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media - but it doesn't have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into...
In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.
So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.
That will likely get you to 95% or even better of the system's maximum copy performance. Even standard C buffered fopen()/fread()/fwrite() probably gets at least 80-90% of the best possible performance.
You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system's block size so that you're always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that's not going to matter much, if it's even measurable.
You can use various dd options to test copying a file like this. Try using direct, or notrunc.
You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.
Pre-allocating the destination file may or may not improve copy performance - that will depend on the filesystem. If the filesystem doesn't support sparse files, though, preallocating the file to a specific length might very well be very, very slow.
There just isn't all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk - those disk heads will dance, and that will take time.
SSDs are much easier - to get maximal IO rates, just use parallel IO via multiple threads. But again, the "normal" IO will probably be at 80-90% of maximal.
Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software's IO patterns to the characteristics of the storage, and that can be quite complex.
Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It's really easy to drag a physical disk down to a few KB/sec IO rates - read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won't move much data at all while doing that.
In fact, not "killing yourself" with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.
Because if IO is a bottleneck on a simple system, just go buy a faster disk.

But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others.
Multithreading is not normally going to speed up a process like this. Any performance benefit you may gain could be wiped out by the synchronization overhead.
So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one.
That's only going to give an advantage on a system that supports asychronous I/O.
To get the maximum speed you'd want to write in buffer sizes that are increments of the cluster factor of the disk (assuming a hard file system). This could be sped up on systems that permit queuing asynchronous I/O (as does, say, Windoze).
You'd also want to create the output file with its initial size being the same as the input file. That ways your write operations never have to extend the file.
Probably the fastest file copy possible would be to memory map the input and output files and did a memory copy. This is especially efficient in systems that treat mapped files as page files.

Block based storage

I would like to store a couple of entries to a file (optimized for reading) and a good data structure for that seems to be a B+ tree. It offers a O(log(n)/log(b)) access time where b is the number of entries in one block.
There are many papers etc. describing B+ trees, but I still have some troubles understaning block based storage systems in general. Maybe someone can point me to the right direction or answer a couple of questions:
Do (all common) file systems create new files at the beginning of a new block? So, can I be sure that seek(0) will set the read/write head to a multiply of the device's block size?
Is it right that I only should use calls like pread(fd, buf, n * BLOCK_SIZE, p * BLOCK_SIZE) (with n, p being integers) to ensure that I always read full blocks?
Is it better to read() BLOCK_SIZE bytes into an array or mmap() those instead? Or is there only a difference if I mmap many blocks and access only a few? What is better?
Should I try to avoid keys spawning multiple blocks by adding padding bytes at the end of each block? Should I do the same for the leaf nodes by adding padding bytes between the data too?
Many thanks,
Christoph

In general, file systems create new files at the beginning of a new block because that is how the underlying device works. Hard disks are block devices and thus cannot handle anything less than a "block" or "sector". Additionally, operating systems treat memory and memory mappings in terms of pages, which are usually even larger (sectors are often 512 or 1024 bytes, pages usually 4096 bytes).
One exception to this rule that comes to mind would be ReiserFS, which puts small files directly into the filesystem structure (which, if I remember right, is incidentially a B+ tree!). For very small files this can actually a viable optimization since the data is already in RAM without another seek, but it can equally be an anti-optimization, depending on the situation.
It does not really matter, because the operating system will read data in units of full pages (normally 4kB) into the page cache anyway. Reading one byte will transfer 4kB and return a byte, reading another byte will serve you from the page cache (if it's the same page or one that was within the readahead range).
read is implemented by copying data from the page cache whereas mmap simply remaps the pages into your address space (possibly marking them copy-on-write, depending on your protection flags). Therefore, mmap will always be at least as fast and usually faster. mmap is more comfortable too, but has the disadvantage that it may block at unexpected times when it needs to fetch more pages that are not in RAM (though, that is generally true for any application or data that is not locked into memory). readon the other hand blocks when you tell it, not otherwise.
The same is true under Windows with the exception that memory mapped files under pre-Vista Windows don't scale well under high concurrency, as the cache manager serializes everything.
Generally one tries to keep data compact, because less data means fewer pages, and fewer pages means higher likelihood they're in the page cache and fit within the readahead range. Therefore I would not add padding, unless it is necessary for other reasons (alignment).

Filesystems which support delayed allocation don't create new files anywhere on disc. Lots of newer filesystems support packing very small files into their own pages or sharing them with metadata (For example, reiser puts very tiny files into the inode?). But for larger files, mostly, yes.
You can do this, but the OS page cache will always read an entire block in, and just copy the bits you requested into your app's memory.
It depends on whether you're using direct IO or non-direct IO.
If you're using direct IO, which bypasses the OS's cache, you don't use mmap. Most databases do not use mmap and use direct IO.
Direct IO means that the pages don't go through the OS's page cache, they don't get cached at all by the OS and don't push other blocks out of the OS cache. It also means that all reads and writes need to be done on block boundaries. Block boundaries can sometimes be determined by a statfs call on the filesystem.
Most databases seem to take the view that they should manage their own page cache themselves, and use the OS only for physical reads/writes. Therefore they typically use direct and synchronous IO.
Linus Torvalds famously disagrees with this approach. I think the vendors really do it to achieve better consistency of behaviour across different OSs.

Yes. Doing otherwise would cause unnecessary complications in FS design.
And the options (as an alternative to "only") are ...?
In Windows memory-mapped files work faster than file API (ReadFile). I guess on Linux it's the same, but you can conduct your own measurements

Writing to adjacent array elements from different threads?

Are there any modern, common CPUs where it is unsafe to write to adjacent elements of an array concurrently from different threads? I'm especially interested in x86. You may assume that the compiler doesn't do anything obviously ridiculous to increase memory granularity, even if it's technically within the standard.
I'm interested in the case of writing arbitrarily large structs, not just native types.
Note:
Please don't mention the performance issues with regard to false sharing. I'm well aware of these, but they're of no practical importance for my use cases. I'm also aware of visibility issues with regard to data written from threads other than the reader. This is addressed in my code.
Clarification: This issue came up because on some processors (for example, old DEC Alphas) memory could only be addressed at word level. Therefore, writing to memory in non-word size increments (for example, single bytes) actually involved read-modify-write of the byte to be written plus some adjacent bytes under the hood. To visualize this, think about what's involved in writing to a single bit. You read the byte or word in, perform a bitwise operation on the whole thing, then write the whole thing back. Therefore, you can't safely write to adjacent bits concurrently from different threads.
It's also theoretically possible, though utterly silly, for a compiler to implement memory writes this way when the hardware doesn't require it. x86 can address single bytes, so it's mostly not an issue, but I'm trying to figure out if there's any weird corner case where it is. More generally, I want to know if writing to adjacent elements of an array from different threads is still a practical issue or mostly just a theoretical one that only applies to obscure/ancient hardware and/or really strange compilers.
Yet another edit: Here's a good reference that describes the issue I'm talking about:
http://my.safaribooksonline.com/book/programming/java/0321246780/threads-and-locks/ch17lev1sec6

Writing a native sized value (i.e. 1, 2, 4, or 8 bytes) is atomic (well, 8 bytes is only atomic on 64-bit machines). So, no. Writing a native type will always write as expected.
If you're writing multiple native types (i.e. looping to write an array) then it's possible to have an error if there's a bug in the operating system kernel or an interrupt handler that doesn't preserve the required registers.

Yes, definitely, writing a mis-aligned word that straddles the CPU cache line boundary is not atomic.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string