I'm creating a linux device driver that create a character device.
The data that it returns on reads is logically divided into 16-byte units.
I was planning on implementing this division by returning however many units fit into the read buffer, but I'm not sure what to do if the read buffer is too small (<16 bytes).
What should I do here? Or is there a better way to achieve the division I'm trying to represent?
You could act like the datagram socket device driver: it always returns just a single datagram. If the read buffer is smaller, the excess is discarded -- it's the caller's responsibility to provide enough space for a whole datagram (typically, the application protocol specifies the maximum datagram size).
The documentation of your device should specify that it works in 16-byte units, so there's no reason why a caller would want to provide a buffer smaller than this. So any lost data due to the above discarding could be considered a bug in the calling application.
However, it would also be reasonable to return more than 16 at a time if the caller asks for it -- that suggests that the application will split it up into units itself. This could be more performance, since it minimizes system calls. But if the buffer isn't a multiple of 16, you could discard the remainder of the last unit. Just make sure this is documented, so they know to make it a multiple.
If you're worried about generic applications like cat, I don't think you need to. I would expect them to use very large input buffers, simply for performance reasons.
Related
When we call a function, its stack is something like:
LOW MEMORY ADDRESS
localvariables
saved frame pointer
return address
....
HIGH MEMORY ADDRESS
Why does it fill data into a buffer the direction is from low to high memory address?
Many people tell me: "because this is how it works", but I think someone in some book or other has written why we have this behavior but I'm unable to find a good resource about.
I think you are misunderstanding or confusing a few things.
in you example you seem to mix up, operation system functionality with program and compiler operation.
If you allocate multiple memory addresses there is always a lower address and a higher address. You can only change that by writing everything to the same address which might result in a very limited or useless program.
there are many buffer implementations, depending on your programming language, framework, ...
which one you choose is up to you, if you use a buffer that is already implemented in a library, of course you have the follow the rules this buffer adds data, because that is how THIS specific buffer works. If you are not happy by how this is done, you need to change the chosen buffer or even the whole library or in extreme cases write your own buffer.
how to add data to a buffer
some buffers allow you to add data anywhere in the buffer, at the cost of e.g. performance or reliability. If you wish to do it that way its up to you.
I have python code that sends data to socket (a rather large file). Should I divide it into 1kb chunks, or would just conn.sendall(file.read()) be acceptable?
It will make little difference to the sending operation. (I assume you are using a TCP socket for the purposes of this discussion.)
When you attempt to send 1K, the kernel will take that 1K, copy it into kernel TCP buffers, and return success (and probably begin sending to the peer at the same time). At which point, you will send another 1K and the same thing happens. Eventually if the file is large enough, and the network can't send it fast enough, or the receiver can't drain it fast enough, the kernel buffer space used by your data will reach some internal limit and your process will be blocked until the receiver drains enough data. (This limit can often be pretty high with TCP -- depending on the OSes, you may be able to send a megabyte or two without ever hitting it.)
If you try to send in one shot, pretty much the same thing will happen: data will be transferred from your buffer into kernel buffers until/unless some limit is reached. At that point, your process will be blocked until data is drained by the receiver (and so forth).
However, with the first mechanism, you can send a file of any size without using undue amounts of memory -- your in-memory buffer (not including the kernel TCP buffers) only needs to be 1K long. With the sendall approach, file.read() will read the entire file into your program's memory. If you attempt that with a truly giant file (say 40G or something), that might take more memory than you have, even including swap space.
So, as a general purpose mechanism, I would definitely favor the first approach. For modern architectures, I would use a larger buffer size than 1K though. The exact number probably isn't too critical; but you could choose something that will fit several disk blocks at once, say, 256K.
What is the purpose of struct iov_iter ? This structure is being used in Linux kernel instead of struct iovec. There is no any good documentation for iter interface. I had found one document on LWN but I am not able to understand that. Could anyone please help me to understand the iter interface which is being used in Linux kernel ?
One purpose of iovec, which the LWN article states up front, is to process data in multiple chunks.
If you have a number of discrete buffers, chained with pointers, and want to read/write them in one go, you could simply replace this with several read/write ops, but in some cases semantics are associated with read/write boundaries - so ops can't simply be split without changing the meaning. An alternative is to copy all the data in and out of a contiguous buffer, which is wasteful and we want to avoid at all costs.
Using the POSIX readv/writev or, in our case the iov_iter API, reduces the number of system calls, and hence the overhead involved. While in the kernel this doesn't translate to expensive ops like context switches, it is still a minor concern. Drivers also might handle larger chunks of data more efficiently than they would lots of smaller chunks when they have no way to know if there's more to come in the near future - this is especially true with network drivers, although I'm not aware of iov_iter being used there atm.
Another instance of the same situation is I/O to raw disk
devices, which only allow I/O to start and end of block
boundaries. A user might occasionally want to perform random access or overwrite a small piece of the buffer at, say, the start of a block and/or zero the rest.
Scenarios like that is exactly what iovec aimed to address; you can construct an iovec which enables you to do a whole block operation spread over several discrete buffers, which might even include a "scratch" buffer for dumping the parts of a block you read and don't care about processing, and a pre-zeroed buffer for chaining at the end of writev to zero out the rest of a block. Again, I should point out you can use a contiguous buffer with associated copying and/or zeroing, but the iov_iter API provides an alternative abstraction with less overhead, and perhaps easier to reason with when reading the code.
The term for operations like these in vector processing, or parallel computing, is "scatter/gather processing".
What disadvantages (or advantages) are there to setting different values for a send and receive buffer on opposite sides of a connection? It seems to make the most sense (and the norm) to keep these values the same. But if one side (say the sender side) has the resources to double their buffer size, what implications could this have?
I guess a related question is, what disadvantages are there to setting a larger-than-required buffer size? From what I've read, it sounds like you could potentially overflow the receive buffer if your send buffer is larger. Additionally, it seems like there may not be a need to increase buffer sizes as long as your applications are keeping up with the load and can handle max-size messages. It doesn't necessarily mean you could handle more data throughput because you are still limited by the opposite endpoint. Is this correct?
The specific kernel settings in question are as follows:
net.core.wmem_max
net.core.rmem_max
a large buffer size can have a negative effect on performance in some cases. If the TCP/IP buffers are too large and applications are not processing data fast enough, paging can increase. The goal is to specify a value large enough to avoid flow control, but not so large that the buffer accumulates more data than the system can process
A TCP send buffer size smaller than the receiver's receive buffer size will prevent you from using the maximum available bandwidth
a UDP send buffer larger than the receiver's receive buffer size will prevent you from finding out at source about datagrams that are too large.
Neither of these problems is major, unless you're transmitting large amounts of data in the TCP case. In the UDP case you shouldn't attempt to send datagrams larger than 534 (or 576 or whatever that magic number is) anyway.
I am developing a active messaging protocol for parallel computation that replaces TCP/IP. My goal is to decrease the latency of a packet. Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency. I am not writing any device driver and i am just trying to replace the TCP/IP stack with something simpler. Now I wanted to avoid copying of a packet's data from user space to kernel space and vice-versa. I heard of the mmap(). Is it the best way to do this? If yes, it will be nice if you can give links to some examples. I am a linux newbie and i really appreciate your help.. Thank you...
Thanks,
Bala
You should use UDP, that is already pretty fast. At least it was fast enough for W32/SQLSlammer to spread through the whole internet.
About your initial question, see the (vm)splice and tee Linux system calls.
From the manpage:
The three system calls splice(2),
vmsplice(2), and tee(2)), provide
userspace programs with full control
over an arbitrary kernel buffer,
implemented within the kernel using
the same type of buffer that is used
for a pipe. In overview, these system
calls perform the following tasks:
splice(2)
moves data from the buffer to an arbitrary file descriptor, or vice
versa, or from one buffer to another.
tee(2)
"copies" the data from one buffer to another.
vmsplice(2)
"copies" data from user space into the buffer.
Though we talk of copying, actual
copies are generally avoided. The
kernel does this by implementing a
pipe buffer as a set of
reference-counted pointers to pages of
kernel memory. The kernel creates
"copies" of pages in a buffer by
creating new pointers (for the output
buffer) referring to the pages, and
increasing the reference counts for
the pages: only pointers are copied,
not the pages of the buffer.
Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency
Generally, even in LAN UDP packets tend to be lost, also they will be lost if client
do not have enough time to consume it...
SO no, do not replace TCP with something else (UDP). Because if you do need reliable delivery TCP would be the fastest (because everything connected to acknowledgments and retransmission is done in kernel space).
Generally in normal case there is no latency drawbacks using TCP (of course do not forget TCP_NODELAY option)
About sharing the memory. Actually all memory you allocate is created with mmap. So the kernel will need to copy it somehow in any case when it creates a packet from driver.
If you are talking about reducing copying it is usually done for files/sockets and
sendfile() used that indeed prevents copying data between kernel and user. But I assume
you do not need to send files.