I have an RPC server that transfers a large amount of variable length data to the client. The .x file looks something like this
struct file
{
opaque data<>
};
In the server routine, I have
struct file *transfer_1_svc(...)
{
struct file;
file.data.data_val = malloc(...);
return &file;
}
My question is who frees the data allocated in the server routine?
This depends on your server code. If you use rpcgen to produce server stub, then you can use xdr_free function which will free the result. Check SunRPC developer guide for details: https://docs.oracle.com/cd/E19683-01/816-1435/rpcgenpguide-21470/index.html
Related
I need to understand which files consumes iops of my hard disc. Just using "strace" will not solve my problem. I want to know, which files are really written to disc, not to page cache. I tried to use "systemtap", but I cannot understand how to find out which files (filenames or inodes) consumes my iops. Is there any tools, which will solve my problem?
Yeah, you can definitely use SystemTap for tracing that. When upper-layer (usually, a VFS subsystem) wants to issue I/O operation, it will call submit_bio and generic_make_request functions. Note that these doesn't necessary mean a single physical I/O operation. For example, writes from adjacent sectors can be merged by I/O scheduler.
The trick is how to determine file path name in generic_make_request. It is quite simple for reads, as this function will be called in the same context as read() call. Writes are usually asynchronous, so write() will simply update page cache entry and mark it as dirty, while submit_bio gets called by one of the writeback kernel threads which doesn't have info of original calling process:
Writes can be deduced by looking at page reference in bio structure -- it has mapping of struct address_space. struct file which corresponds to an open file also contains f_mapping which points to the same address_space instance and it also points to dentry containing name of the file (this can be done by using task_dentry_path)
So we would need two probes: one to capture attempts to read/write a file and save path and address_space into associative array and second to capture generic_make_request calls (this is performed by probe ioblock.request).
Here is an example script which counts IOPS:
// maps struct address_space to path name
global paths;
// IOPS per file
global iops;
// Capture attempts to read and write by VFS
probe kernel.function("vfs_read"),
kernel.function("vfs_write") {
mapping = $file->f_mapping;
// Assemble full path name for running task (task_current())
// from open file "$file" of type "struct file"
path = task_dentry_path(task_current(), $file->f_path->dentry,
$file->f_path->mnt);
paths[mapping] = path;
}
// Attach to generic_make_request()
probe ioblock.request {
for (i = 0; i < $bio->bi_vcnt ; i++) {
// Each BIO request may have more than one page
// to write
page = $bio->bi_io_vec[i]->bv_page;
mapping = #cast(page, "struct page")->mapping;
iops[paths[mapping], rw] <<< 1;
}
}
// Once per second drain iops statistics
probe timer.s(1) {
println(ctime());
foreach([path+, rw] in iops) {
printf("%3d %s %s\n", #count(iops[path, rw]),
bio_rw_str(rw), path);
}
delete iops
}
This example script is works for XFS, but needs to be updated to support AIO and volume managers (including btrfs). Plus I'm not sure how it will handle metadata reads and writes, but it is a good start ;)
If you want to know more on SystemTap you can check out my book: http://myaut.github.io/dtrace-stap-book/kernel/async.html
Maybe iotop gives you a hint about which process are doing I/O, in consequence you have an idea about the related files.
iotop --only
the --only option is used to see only processes or threads actually doing I/O, instead of showing all processes or threads
I am trying to write a FUSE interface for a REST API in Rust. I am using the rust-fuse library. I need the dir path in the readdir callback function when implementing the Filesystem trait, but the function only takes an inode!
How can I find the path to the file? Is it somehow embedded in the Request?
I could create an inode <-> path map, but that makes things too complicated. The Python and Haskell FUSE libraries both pass the path as a parameter to the callback functions rather than an inode.
fn readdir(&mut self,
req: &Request,
ino: u64,
_fh: u64,
offset: u64,
mut reply: ReplyDirectory) {
// ...
}
It appears the library doesn't provide this yet:
From the README (emphasis mine):
To Do
There's still a lot of stuff to be done. Feel free to contribute.
Interrupting a filesystem operation isn't handled yet. An additional
more high level API would be nice. It should provide pathnames instead
inode numbers and automatically handle concurrency and interruption
(like the FUSE C library's high level API).
It appears you will need to assign a unique inode when you open / list the directory/file, keep track of a mapping of inodes to paths, and use that later on.
Depending on your API structure, you may also be able to encode some amount of information into the inode directly. For example, maybe you have < 32 endpoints, so you can encode each endpoint as a 5-bit number and decode that later. Then only a subset of inodes need to have arbitrary values.
is it possible to obtain socket ID in linux kernel in sk_buff struct?
I know i could get socket using this code:
const struct tcphdr *th = tcp_hdr(skb);
struct sock *sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
if (sk)
struct socket* = sk->sk_socket;
Where could i find ID and what maximum value of this id?
A socket is a file.
You'll find, inside the struct socket, a struct file *file member.
I recommend you to look at this question, specifically the link "things you never should do in the Kernel" on the accepted answer, because I'm worried about the reason why you're trying to retrieve the file descriptor from a socket structure in the kernel (usually, you want to do the exact opposite).
To retrieve the file descriptor from a given file under the kernel, you'll need to iterate the fdtable (search for files_fdtable())... this is a tremendous amount of work to do, specially if there is a huge amount of open files.
The maximum value for a file descriptor value will be the maximum number of files allowed in the system, and can be retrieved with something like:
files_fdtable(current->files)->max_fds;
I am writing an application server that processes images (large data). I am trying to minimize copies when sending image data back to clients. The processed images I need to send to clients are in buffers obtained from jemalloc. The ways I have thought of sending the data back to the client is:
1) Simple write call.
// Allocate buffer buf.
// Store image data in this buffer.
write(socket, buf, len);
2) I obtain the buffer through mmap instead of jemalloc, though I presume jemalloc already creates the buffer using mmap. I then make a simple call to write.
buf = mmap(file, len); // Imagine proper options.
// Store image data in this buffer.
write(socket, buf, len);
3) I obtain a buffer through mmap like before. I then use sendfile to send the data:
buf = mmap(in_fd, len); // Imagine proper options.
// Store image data in this buffer.
int rc;
rc = sendfile(out_fd, file, &offset, count);
// Deal with rc.
It seems like (1) and (2) will probably do the same thing given jemalloc probably allocates memory through mmap in the first place. I am not sure about (3) though. Will this really lead to any benefits? Figure 4 on this article on Linux zero-copy methods suggests that a further copy can be prevented using sendfile:
no data is copied into the socket buffer. Instead, only descriptors
with information about the whereabouts and length of the data are
appended to the socket buffer. The DMA engine passes data directly
from the kernel buffer to the protocol engine, thus eliminating the
remaining final copy.
This seems like a win if everything works out. I don't know if my mmaped buffer counts as a kernel buffer though. Also I don't know when it is safe to re-use this buffer. Since the fd and length is the only thing appended to the socket buffer, I assume that the kernel actually writes this data to the socket asynchronously. If it does what does the return from sendfile signify? How would I know when to re-use this buffer?
So my questions are:
What is the fastest way to write large buffers (images in my case) to a socket? The images are held in memory.
Is it a good idea to call sendfile on a mmapped file? If yes, what are the gotchas? Does this even lead to any wins?
It seems like my suspicions were correct. I got my information from this article. Quoting from it:
Also these network write system calls, including sendfile, might and
in many cases do return before the data sent over TCP by the method
call has been acknowledged. These methods return as soon as all data
is written into the socket buffers (sk buff) and is pushed to the TCP
write queue, the TCP engine can manage alone from that point on. In
other words at the time sendfile returns the last TCP send window is
not actually sent to the remote host but queued. In cases where
scatter-gather DMA is supported there is no seperate buffer which
holds these bytes, rather the buffers(sk buffs) just hold pointers to
the pages of OS buffer cache, where the contents of file is located.
This might lead to a race condition if we modify the content of the
file corresponding to the data in the last TCP send window as soon as
sendfile is returned. As a result TCP engine may send newly written
data to the remote host instead of what we originally intended to
send.
Provided the buffer from a mmapped file is even considered "DMA-able", seems like there is no way to know when it is safe to re-use it without an explicit acknowledgement (over the network) from the actual client. I might have to stick to simple write calls and incur the extra copy. There is a paper (also from the article) with more details.
Edit: This article on the splice call also shows the problems. Quoting it:
Be aware, when splicing data from a mmap'ed buffer to a network
socket, it is not possible to say when all data has been sent. Even if
splice() returns, the network stack may not have sent all data yet. So
reusing the buffer may overwrite unsent data.
For cases 1 and 2 - does the operation you marked as // Store image data in this buffer require any conversion? Is it just plain copy from the memory to buf?
If it's just plain copy, you can use write directly on the pointer obtained from jemalloc.
Assuming that img is a pointer obtained from jemalloc and size is a size of your image, just run following code:
int result;
int sent=0;
while(sent<size) {
result=write(socket,img+sent,size-sent);
if(result<0) {
/* error handling here */
break;
}
sent+=result;
}
It is working correctly for blocking I/O (the default behavior). If you need to write a data in a non-blocking manner, you should be able to rework the code on your own, but now you have the idea.
For case 3 - sendfile is for sending data from one descriptor to another. That means you can, for example, send data from file directly to tcp socket and you don't need to allocate any additional buffer. So, if the image you want to send to a client is in a file, just go for a sendfile. If you have it in memory (because you processed it somehow, or just generated), use the approach I mentioned earlier.
I am working on a Linux kernel module that requires me to check data right before it is written to a local disk. The data to be written is fetched from a remote disk. Therefore, I know that the data from the fetch is stored in the page cache. I also know that Linux has a data structure that manages block I/O requests in-flight called the bio struct.
The bio struct contains a list of structures called bio_vecs.
struct bio_vec {
/* pointer to the physical page on which this buffer resides */
struct page *bv_page;
/* the length in bytes of this buffer */
unsigned int bv_len;
/* the byte offset within the page where the buffer resides */
unsigned int bv_offset;
};
It has a list of these because the block representation in memory may not be physically contiguous. What I want to do is grab each piece of the buffer using the list of bio_vecs and put them together as one so that I could take an MD5 hash of the block. How do I use the pointer to the page, the length of the buffer and its offset to get the raw data in the buffer? Are there already functions for this or do I have to write my own?
you can use bio_data(struct bio *bio) function for accessing the data.
Accessing the data from bio_data could be troublesome as its return type is void*(so %S wont work),but it can be successfully tackle by, little type casting.
Following is the piece of code that will do the job:
char *ptr;
ptr=(char *)bio_data(bio);
for(i=0;i<4096;i++) //4096 as bio is going to be in 4kb chunk
{
printk("%c",*ptr);
ptr++;
}