get all pages from the page cache - linux

I am trying to write a function in Linux kernel space that walks over a page cache, and searches for a page that contains a specific block.
I don't know how to get the pages in the page-cache one-by-one.
I saw that find_get_page is a function that can help me, but I don't know how to get the first page offset and how to continue.
As I said, I am trying to do something like that:
for(every page in struct address_space *mapping)
{
for(every struct buffer_head in current_page->buffers)
{
check if(my_sector == current_buffer_head->b_blocknr)
...
}
}
Can anyone help to find how to walk over all the page-cache?
I believe that there is a code in Linux kernel that does something like this (for example: when there is a write to a page and the page is searched in the cache), but I didn't find it...
Thanks!

The address_space structure holds all the pages in radix_tree (mapping->page_tree in your case). So all you need is to iterate over that tree. Linux kernel has radix tree API (see here) including the for_each iterators. For eaxmple:
396 /**
397 * radix_tree_for_each_chunk_slot - iterate over slots in one chunk
398 *
399 * #slot: the void** variable, at the beginning points to chunk first slot
400 * #iter: the struct radix_tree_iter pointer
401 * #flags: RADIX_TREE_ITER_*, should be constant
402 *
403 * This macro is designed to be nested inside radix_tree_for_each_chunk().
404 * #slot points to the radix tree slot, #iter->index contains its index.
405 */
406 #define radix_tree_for_each_chunk_slot(slot, iter, flags) \
407 for (; slot ; slot = radix_tree_next_slot(slot, iter, flags))
408

Related

Where do we define the type of structure returned by the kernel when using the "perf_event_open" system call using mmap?

I'm trying to use the syscall perf_event_open to get some performance data from the system.
I am currently working on periodic data retrieval using shared memory with a ring buffer.
But I can't find what structure is returned in each section of the ring buffer. The manual page enumerate all possibilities, but that's all.
I can't figure out which member of the perf_event_attr structure to fill in to control what type of structure will be returned to the ring buffer.
If you have some informations about that, I'll be happy to read it !
The https://github.com/torvalds/linux/blob/master/tools/perf/design.txt documentation has description of mmaped ring, and perf script / perf script -D can decode ring data when it is saved as perf.data file. Some parts of the doc are outdated, but it is still useful for perf_event_open syscall description.
First mmap page is metadata page, rest 2^n pages are filled with events where every event has header struct perf_event_header of 8 bytes.
Like stated, asynchronous events, like counter overflow or PROT_EXEC
mmap tracking are logged into a ring-buffer. This ring-buffer is
created and accessed through mmap().
The mmap size should be 1+2^n pages, where the first page is a
meta-data page (struct perf_event_mmap_page) that contains various
bits of information such as where the ring-buffer head is.
/*
* Structure of the page that can be mapped via mmap
*/
struct perf_event_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
...
}
The following 2^n pages are the ring-buffer which contains events of the form:
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (1 << 1)
#define PERF_RECORD_MISC_OVERFLOW (1 << 2)
struct perf_event_header {
__u32 type;
__u16 misc;
__u16 size;
};
enum perf_event_type
The design.txt doc has incorrect values for enum perf_event_type, check actual perf_events kernel subsystem source codes - https://github.com/torvalds/linux/blob/master/include/uapi/linux/perf_event.h#L707. That uapi/linux/perf_event.h file also has some struct hints in comments, like
* #
* # The RAW record below is opaque data wrt the ABI
* #
* # That is, the ABI doesn't make any promises wrt to
* # the stability of its content, it may vary depending
* # on event, hardware, kernel version and phase of
* # the moon.
* #
* # In other words, PERF_SAMPLE_RAW contents are not an ABI.
* #
*
* { u32 size;
* char data[size];}&& PERF_SAMPLE_RAW
*...
* { u64 size;
* char data[size];
* u64 dyn_size; } && PERF_SAMPLE_STACK_USER
*...
PERF_RECORD_SAMPLE = 9,

netfilter hook is not retrieving complete packet

I'm writing a netfilter module, that deeply inspect the packet. However, during tests I found that netfilter module is not receiving the packet in full.
To verify this, I wrote the following code to dump packet retrieved on port 80 and write the result to dmesg buffer:
const struct iphdr *ip_header = ip_hdr(skb);
if (ip_header->protocol == IPPROTO_TCP)
{
const struct tcphdr *tcp_header = tcp_hdr(skb);
if (ntohs(tcp_header->dest) != 80)
{
return NF_ACCEPT;
}
buff = (char *)kzalloc(skb->len * 10, GFP_KERNEL);
if (buff != NULL)
{
int pos = 0, i = 0;
for (i = 0; i < skb->len; i ++)
{
pos += sprintf(buff + pos, "%02X", skb->data[i] & 0xFF);
}
pr_info("(%pI4):%d --> (%pI4):%d, len=%d, data=%s\n",
&ip_header->saddr,
ntohs(tcp_header->source),
&ip_header->daddr,
ntohs(tcp_header->dest),
skb->len,
buff
);
kfree (buff);
}
}
In virtual machine running locally, I can retrieve the full HTTP request; On Alibaba cloud, and some other OpenStack based VPS provider, the packet is cut in the middle.
To verify this, I execute curl http://VPS_IP on another VPS, and I got the following output in dmesg buffer:
[ 1163.370483] (XXXX):5007 --> (XXXX):80, len=237, data=451600ED000040003106E3983D87A950AC11D273138F00505A468086B44CE19E80180804269300000101080A1D07500A000D2D90474554202F20485454502F312E310D0A486F73743A2033392E3130372E32342E37370D0A4163636570743A202A2F2A0D0A557365722D4167656E743A204D012000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000001E798090F5FFFF8C0000007B00000000E0678090F5FFFF823000003E00000040AE798090F5FFFF8C0000003E000000000000000000000000000000000000000000000000000000000000
When decoded, the result is like this
It's totally weird, everything after User-Agent: M is "gone" or zero-ed. Although the skb->len is 237, but half of the packet is missing.
Any ideas? Tried both PRE_ROUTING and LOCAL_IN, no changes.
It appears that sometimes you are getting a linear skb, and sometimes your skb is not linear. In the latter case you are not reading the full data contents of an skb.
If skb->data_len is zero, then your skb is linear and the full data contents of the skb is in skb->data. If skb->data_len is not zero, then your skb is not linear, and skb->data contains just the the first (linear) part of the data. The length of this area is skb->len - skb->data_len. skb_headlen() helper function calculates that for convenience. skb_is_nonlinear() helper function tells in an skb is linear or not.
The rest of the data can be in paged fragments, and in skb fragments, in this order.
skb_shinfo(skb)->nr_frags tells the number of paged fragments. Each paged fragment is described by a data structure in the array of structures skb_shinfo(skb)->frags[0..skb_shinfo(skb)->nr_frags]. skb_frag_size() and skb_frag_address() helper functions help dealing with this data. They accept the address of the structure that describes a paged fragment. There are other useful helper functions depending on your kernel version.
If the total size of data in paged fragments is less than skb->data_len, then the rest of the data is in skb fragments. It's the list of skb which is attached to this skb at skb_shinfo(skb)->frag_list (see skb_walk_frags() in the kernel).
Please note that there may be that there's no data in the linear part and/or there's no data in the paged fragments. You just need to process data piece by piece in the order just described.

register_kprobe() returns EINVAL without additional memory on containing struct

I've written a kernel module (a character device) that registers new KProbes whenever I write to the module.
I have a structure that contains struct kprobe. When I call register_kprobe(), it returns -EINVAL. But when I add a dummy character array to the (possibly some other data types as well), the KProbe registration succeeds.
Probe Registration
struct my_struct *container = kmalloc(sizeof(struct my_struct));
(container->probe).addr = (kprobe_opcode_t *) kallsyms_lookup_name("my_exported_fn"); /* my_exported_fn is in code section */
(container->probe).pre_handler = Pre_Handler;
(container->probe).post_handler = Post_Handler;
register_probe(&container->probe);
/* Returns -EINVAL if my_struct contains only `struct kprobe`. */
Not working:
struct my_struct {
struct kprobe probe;
}
Working:
struct my_struct {
char dummy[512]; /* At 512, it gets consistently registered. At 256, sometimes (maybe one out of 5 - 10 times get registered) */
struct kprobe probe;
}
Why does it need this extra bit of memory to be present in the struct?
This could be unaligned memory access or not, but in this particular case (I mean your original code before the edit) I suspect that the data is not properly initialised. Namely, register_kprobe() calls kprobe_addr() function which in turn implies the following check:
if ((symbol_name && addr) || (!symbol_name && !addr))
goto invalid;
...
invalid:
return ERR_PTR(-EINVAL);
So, if you indeed initialise addr and don't initialise symbol_name, the latter could be a garbage pointer under certain circumstances. Namely, kmalloc() doesn't zeroise allocated memory and, furthermore, depending on requested size, it may take memory object of a suitable size from a different pool (there are different pools to provide objects of different sizes), and when you artificially increase the size of the struct, kmalloc() has to allocate a larger object from a suitable pool. From this perspective, the probability is that such an object may not contain garbage by occasion (since larger chunks are requested less often).
All in all, I suggest zeroising the memory chunk or using kzalloc().

Configure kern.log to give more info about a segfault

Currently I can find in kern.log entries like this:
[6516247.445846] ex3.x[30901]: segfault at 0 ip 0000000000400564 sp 00007fff96ecb170 error 6 in ex3.x[400000+1000]
[6516254.095173] ex3.x[30907]: segfault at 0 ip 0000000000400564 sp 00007fff0001dcf0 error 6 in ex3.x[400000+1000]
[6516662.523395] ex3.x[31524]: segfault at 7fff80000000 ip 00007f2e11e4aa79 sp 00007fff807061a0 error 4 in libc-2.13.so[7f2e11dcf000+180000]
(You see, apps causing segfault are named ex3.x, means exercise 3 executable).
Is there a way to ask kern.log to log the complete path? Something like:
[6...] /home/user/cclass/ex3.x[3...]: segfault at 0 ip 0564 sp 07f70 error 6 in ex3.x[4...]
So I can easily figure out from who (user/student) this ex3.x is?
Thanks!
Beco
That log message comes from the kernel with a fixed format that only includes the first 16 letters of the executable excluding the path as per show_signal_msg, see other relevant lines for segmentation fault on non x86 architectures.
As mentioned by Makyen, without significant changes to the kernel and a recompile, the message given to klogd which is passed to syslog won't have the information you are requesting.
I am not aware of any log transformation or injection functionality in syslog or klogd which would allow you to take the name of the file and run either locate or file on the filesystem in order to find the full path.
The best way to get the information you are looking for is to use crash interception software like apport or abrt or corekeeper. These tools store the process metadata from the /proc filesystem including the process's commandline which would include the directory run from, assuming the binary was run with a full path, and wasn't already in path.
The other more generic way would be to enable core dumps, and then to set /proc/sys/kernel/core_pattern to include %E, in order to have the core file name including the path of the binary.
The short answer is: No, it is not possible without making code changes and recompiling the kernel. The normal solution to this problem is to instruct your students to name their executable <student user name>_ex3.x so that you can easily have this information.
However, it is possible to get the information you desire from other methods. Appleman1234 has provided some alternatives in his answer to this question.
How do we know the answer is "Not possible to the the full path in the kern.log segfault messages without recompiling the kernel":
We look in the kernel source code to find out how the message is produced and if there are any configuration options.
The files in question are part of the kernel source. You can download the entire kernel source as an rpm package (or other type of package) for whatever version of linux/debian you are running from a variety of places.
Specifically, the output that you are seeing is produced from whichever of the following files is for your architecture:
linux/arch/sparc/mm/fault_32.c
linux/arch/sparc/mm/fault_64.c
linux/arch/um/kernel/trap.c
linux/arch/x86/mm/fault.c
An example of the relevant function from one of the files(linux/arch/x86/mm/fault.c):
/*
* Print out info about fatal segfaults, if the show_unhandled_signals
* sysctl is set:
*/
static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
if (!unhandled_signal(tsk, SIGSEGV))
return;
if (!printk_ratelimit())
return;
printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
tsk->comm, task_pid_nr(tsk), address,
(void *)regs->ip, (void *)regs->sp, error_code);
print_vma_addr(KERN_CONT " in ", regs->ip);
printk(KERN_CONT "\n");
}
From that we see that the variable passed to printout the process identifier is tsk->comm where struct task_struct *tsk and regs->ip where struct pt_regs *regs
Then from linux/include/linux/sched.h
struct task_struct {
...
char comm[TASK_COMM_LEN]; /* executable name excluding path
- access with [gs]et_task_comm (which lock
it with task_lock())
- initialized normally by setup_new_exec */
The comment makes it clear that the path for the executable is not stored in the structure.
For regs->ip where struct pt_regs *regs, it is defined in whichever of the following are appropriate for your architecture:
arch/arc/include/asm/ptrace.h
arch/arm/include/asm/ptrace.h
arch/arm64/include/asm/ptrace.h
arch/cris/include/arch-v10/arch/ptrace.h
arch/cris/include/arch-v32/arch/ptrace.h
arch/metag/include/asm/ptrace.h
arch/mips/include/asm/ptrace.h
arch/openrisc/include/asm/ptrace.h
arch/um/include/asm/ptrace-generic.h
arch/x86/include/asm/ptrace.h
arch/xtensa/include/asm/ptrace.h
From there we see that struct pt_regs is defining registers for the architecture. ip is just: unsigned long ip;
Thus, we have to look at what print_vma_addr() does. It is defined in mm/memory.c
/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
/*
* Do not print if we are in atomic
* contexts (in exception stacks, etc.):
*/
if (preempt_count())
return;
down_read(&mm->mmap_sem);
vma = find_vma(mm, ip);
if (vma && vma->vm_file) {
struct file *f = vma->vm_file;
char *buf = (char *)__get_free_page(GFP_KERNEL);
if (buf) {
char *p;
p = d_path(&f->f_path, buf, PAGE_SIZE);
if (IS_ERR(p))
p = "?";
printk("%s%s[%lx+%lx]", prefix, kbasename(p),
vma->vm_start,
vma->vm_end - vma->vm_start);
free_page((unsigned long)buf);
}
}
up_read(&mm->mmap_sem);
}
Which shows us that a path was available. We would need to check that it was the path, but looking a bit further in the code gives a hint that it might not matter. We need to see what kbasename() did with the path that is passed to it. kbasename() is defined in include/linux/string.h as:
/**
* kbasename - return the last part of a pathname.
*
* #path: path to extract the filename from.
*/
static inline const char *kbasename(const char *path)
{
const char *tail = strrchr(path, '/');
return tail ? tail + 1 : path;
}
Which, even if the full path is available prior to it, chops off everything except for the last part of a pathname, leaving the filename.
Thus, no amount of runtime configuration options will permit printing out the full pathname of the file in the segment fault messages you are seeing.
NOTE: I've changed all of the links to kernel source to be to archives, rather than the original locations. Those links will get close to the code as it was at the time I wrote this, 2104-09. As should be no surprise, the code does evolve over time, so the code which is current when you're reading this may or may not be similar or perform in the way which is described here.

fuse: Setting offsets for the filler function in readdir

I am implementing a virtual filesystem using the fuse, and need some understanding regarding the offset parameter in readdir.
Earlier we were ignoring the offset and passing 0 in the filler function, in which case the kernel should take care.
Our filesystem database, is storing: directory name, filelength, inode number and parent inode number.
How do i calculate get the offset?
Then is the offset of each components, equal to their size sorted in incremental form of their inode number? What happens is there is a directory inside a directory, is the offset in that case equal to the sum of the files inside?
Example: in case the dir listing is - a.txt b.txt c.txt
And inode number of a.txt=3, b.txt=5, c.txt=7
Offset of a.txt= directory offset
Offset of b.txt=dir offset + size of a.txt
Offset of c.txt=dir offset + size of b.txt
Is the above assumption correct?
P.S: Here are the callbacks of fuse
The selected answer is not correct
Despite the lack of upvotes on this answer, this is the correct answer. Cracking into the format of the void buffer should be discouraged, and that's the intent behind declaring such things void in C code - you shouldn't write code that assumes knowledge of the format of the data behind void pointers, use whatever API is provided properly instead.
The code below is very simple and straightforward, as it should be. No knowledge of the format of the Fuse buffer is required.
Fictitious API
This is a contrived example of what some device's API could look
like. This is not part of Fuse.
// get_some_file_names() -
// returns a struct with buffers holding the names of files.
// PARAMETERS
// * path - A path of some sort that the fictitious device groks.
// * offset - Where in the list of file names to start.
// RETURNS
// * A name_list, it has some char buffers holding the file names
// and a couple other auxiliary vars.
//
name_list *get_some_file_names(char *path, size_t offset);
Listing the files in parts
Here's a Fuse callback that can be registered with the Fuse system to
list the filenames provided by get_some_file_names(). It's arbitrarily named readdir_callback() so its purpose is obvious.
int readdir_callback( char *path,
void *buf, // This is meant to be "opaque".
fuse_fill_dir_t *filler, // filler takes care of buf.
off_t off, // Last value given to filler.
struct fuse_file_info *fi )
{
// Call the fictitious API to get a list of file names.
name_list *list = get_some_file_names(path, off);
for (int i = 0; i < list->length; i++)
{
// Feed the file names to filler() one at a time.
if (filler(buf, list->names[i], NULL, off + i + 1))
{
break; // filler() returned 1, requesting a break.
}
incr_num_files_listed(list);
}
if (all_files_listed(list))
{
return 1; // Tell Fuse we're done.
}
return 0;
}
The off (offset) value is not used by the filler function to fill its opaque buffer, buf. The off value is, however, meaningful to the callback as an offset base as it provides file names to filler(). Whatever value was last passed to filler() is what gets passed back to readdir_callback() on its next invocation. filler()
itself only cares whether the off value is 0 or not-0.
Indicating "I'm done listing!" to Fuse
To signal to the Fuse system that your readdir_callback() is done listing file names in parts (when the last of the list of names has been given to filler()), simply return 1 from it.
How off Is Used
The off, offset, parameter should be non-0 to perform the partial listings. That's its only requirement as far as filler() is concerned. If off is 0, that indicates to Fuse that you're going to do a full listing in one shot (see below).
Although filler() doesn't care what the off value is beyond it being non-0, the value can still be meaningfully used. The code above is using the index of the next item in its own file list as its value. Fuse will keep passing the last off value it received back to the read dir callback on each invocation until the listing is complete (when readdir_callback() returns 1).
Listing the files all at once
int readdir_callback( char *path,
void *buf,
fuse_fill_dir_t *filler,
off_t off,
struct fuse_file_info *fi )
{
name_list *list = get_all_file_names(path);
for (int i = 0; i < list->length; i++)
{
filler(buf, list->names[i], NULL, 0);
}
return 0;
}
Listing all the files in one shot, as above, is simpler - but not by much. Note that off is 0 for the full listing. One may wonder, 'why even bother with the first approach of reading the folder contents in parts?'
The in-parts strategy is useful where a set number of buffers for file names is allocated, and the number of files within folders may exceed this number. For instance, the implementation of name_list above may only have 8 allocated buffers (char names[8][256]). Also, buf may fill up and filler() start returning 1 if too many names are given at once. The first approach avoids this.
The offset passed to the filler function is the offset of the next item in the directory. You can have the entries in the directory in any order you want. If you don't want to return an entire directory at once, you need to use the offset to determine what gets asked for and stored. The order of items in the directory is up to you, and doesn't matter what order the names or inodes or anything else is.
Specifically, in the readdir call, you are passed an offset. You want to start calling the filler function with entries that will be at this callback or later. In the simplest case, the length of each entry is 24 bytes + strlen(name of entry), rounded up to the nearest multiple of 8 bytes. However, see the fuse source code at http://sourceforge.net/projects/fuse/ for when this might not be the case.
I have a simple example, where I have a loop (pseudo c-code) in my readdir function:
int my_readdir(const char *path, void *buf, fuse_fill_dir_t filler, off_t offset, struct fuse_file_info *fi)
{
(a bunch of prep work has been omitted)
struct stat st;
int off, nextoff=0, lenentry, i;
char namebuf[(long enough for any one name)];
for (i=0; i<NumDirectoryEntries; i++)
{
(fill st with the stat information, including inode, etc.)
(fill namebuf with the name of the directory entry)
lenentry = ((24+strlen(namebuf)+7)&~7);
off = nextoff; /* offset of this entry */
nextoff += lenentry;
/* Skip this entry if we weren't asked for it */
if (off<offset)
continue;
/* Add this to our response until we are asked to stop */
if (filler(buf, namebuf, &st, nextoff))
break;
}
/* All done because we were asked to stop or because we finished */
return 0;
}
I tested this within my own code (I had never used the offset before), and it works fine.

Resources