Where are ioctl parameters (such as 0x1268 / BLKSSZGET) actually specified? - linux

I am looking for a definitive specification describing the expected arguments and behavior of ioctl 0x1268 (BLKSSZGET).
This number is declared in many places (none of which contain a definitive reference source), such as linux/fs.h, but I can find no specification for it.
Surely, somebody at some point in the past decided that 0x1268 would get the physical sector size of a device and documented that somewhere. Where does this information come from and where can I find it?
Edit: I am not asking what BLKSSZGET does in general, nor am I asking what header it is defined in. I am looking for a definitive, standardized source that states what argument types it should take and what its behavior should be for any driver that implements it.
Specifically, I am asking because there appears to be a bug in blkdiscard in util-linux 2.23 (and 2.24) where the sector size is queried in to a uint64_t, but the high 32-bits are untouched since BLKSSZGET appears to expect a 32-bit integer, and this leads to an incorrect sector size, incorrect alignment calculations, and failures in blkdiscard when it should succeed. So before I submit a patch, I need to determine, with absolute certainty, if the problem is that blkdiscard should be using a 32-bit integer, or if the driver implementation in my kernel should be using a 64-bit integer.
Edit 2: Since we're on the topic, the proposed patch presuming blkdiscard is incorrect is:
--- sys-utils/blkdiscard.c-2.23 2013-11-01 18:28:19.270004947 -0400
+++ sys-utils/blkdiscard.c 2013-11-01 18:29:07.334002382 -0400
## -71,7 +71,8 ##
{
char *path;
int c, fd, verbose = 0, secure = 0;
- uint64_t end, blksize, secsize, range[2];
+ uint64_t end, blksize, range[2];
+ uint32_t secsize;
struct stat sb;
static const struct option longopts[] = {
## -146,8 +147,8 ##
err(EXIT_FAILURE, _("%s: BLKSSZGET ioctl failed"), path);
/* align range to the sector size */
- range[0] = (range[0] + secsize - 1) & ~(secsize - 1);
- range[1] &= ~(secsize - 1);
+ range[0] = (range[0] + (uint64_t)secsize - 1) & ~((uint64_t)secsize - 1);
+ range[1] &= ~((uint64_t)secsize - 1);
/* is the range end behind the end of the device ?*/
end = range[0] + range[1];
Applied to e.g. https://www.kernel.org/pub/linux/utils/util-linux/v2.23/.

The answer to "where is this specified?" does seem to be the kernel source.
I asked the question on the kernel mailing list here: https://lkml.org/lkml/2013/11/1/620
In response, Theodore Ts'o wrote (note: he mistakenly identified sys-utils/blkdiscard.c in his list but it's inconsequential):
BLKSSZGET returns an int. If you look at the sources of util-linux
v2.23, you'll see it passes an int to BLKSSZGET in
sys-utils/blkdiscard.c
lib/blkdev.c
E2fsprogs also expects BLKSSZGET to return an int, and if you look at
the kernel sources, it very clearly returns an int.
The one place it doesn't is in sys-utils/blkdiscard.c, where as you
have noted, it is passing in a uint64 to BLKSSZGET. This looks like
it's a bug in sys-util/blkdiscard.c.
He then went on to submit a patch¹ to blkdiscard at util-linux:
--- a/sys-utils/blkdiscard.c
+++ b/sys-utils/blkdiscard.c
## -70,8 +70,8 ## static void __attribute__((__noreturn__)) usage(FILE *out)
int main(int argc, char **argv)
{
char *path;
- int c, fd, verbose = 0, secure = 0;
- uint64_t end, blksize, secsize, range[2];
+ int c, fd, verbose = 0, secure = 0, secsize;
+ uint64_t end, blksize, range[2];
struct stat sb;
static const struct option longopts[] = {
I had been hesitant to mention the blkdiscard tool in both my mailing list post and the original version of this SO question specifically for this reason: I know what's in my kernel's source, it's already easy enough to modify blkdiscard to agree with the source, and this ended up distracting from the real question of "where is this documented?".
So, as for the specifics, somebody more official than me has also stated that the BLKSSZGET ioctl takes an int, but the general question regarding documentation remained. I then followed up with https://lkml.org/lkml/2013/11/3/125 and received another reply from Theodore Ts'o (wiki for credibility) answering the question. He wrote:
> There was a bigger question hidden behind the context there that I'm
> still wondering about: Are these ioctl interfaces specified and
> documented somewhere? From what I've seen, and from your response, the
> implication is that the kernel source *is* the specification, and not
> document exists that the kernel is expected to comply with; is this
> the case?
The kernel source is the specification. Some of these ioctl are
documented as part of the linux man pages, for which the project home
page is here:
https://www.kernel.org/doc/man-pages/
However, these document existing practice; if there is a discrepancy
between what is in the kernel has implemented and the Linux man pages,
it is the Linux man pages which are buggy and which will be changed.
That is man pages are descriptive, not perscriptive.
I also asked about the use of "int" in general for public kernel APIs, his response is there although that is off-topic here.
Answer: So, there you have it, the final answer is: The ioctl interfaces are specified by the kernel source itself; there is no document that the kernel adheres to. There is documentation to describe the kernel's implementations of various ioctls, but if there is a mismatch, it is an error in the documentation, not in the kernel.
¹ With all the above in mind, I want to point out that an important difference in the patch Theodore Ts'o submitted, compared to mine, is the use of "int" rather than "uint32_t" -- BLKSSZGET, as per kernel source, does indeed expect an argument that is whatever size "int" is on the platform, not a forced 32-bit value.

Related

Atomicity of writev() system call in Linux

I've looked in the kernel source for linux kernel 4.4.0-57-generic and don't see any locks in the writev() source. Is there something I'm missing? I don't see how writev() is atomic or thread-safe.
Not a kernel expert here, but I'll share my point of view anyway. Feel free to spot any mistakes.
Browsing the kernel (v4.9 though I wouldn't expect it to be so different), and trying to trace the writev(2) system call, I can observe subsequent function calls that create the following path:
SYSCALL_DEFINE3(writev, ..)
do_writev(..)
vfs_writev(..)
do_readv_writev(..)
Now the path branches, depending on whether a write_iter method is implemented and hooked on the struct file_operations field of the struct file that the system call is referring to.
If it's not NULL, the path is:
5a. do_iter_readv_writev(..), which calls the method filp->f_op->write_iter(..) at this point.
If it is NULL, the path is:
5b. do_loop_readv_writev(..), which calls repeatedly in a loop the method filp->f_op->write at this point.
So, as far as I understand, the writev() system call is as thread safe as the underlying write() (or write_iter()) is, which of course can be implemented in various ways, e.g. in a device driver, and may or may not use locks according to its needs and its design.
EDIT:
In kernel v4.4 the paths look pretty similar:
SYSCALL_DEFINE3(writev, ..)
vfs_writev(..)
do_readv_writev(..)
and then it depends on whether the write_iter method as a field in struct file_operations of the struct file is NULL or not, just like the case in v4.9, described above.
VFS (Virtual File System) by itself doesn't garantee atomicity of writev() call. It just calls filesystem-specific .write_iter method of struct file_operations.
It is responsibility of specific filesystem implementation for make method atomically write to the file.
For example, in ext4 filesystem function ext4_file_write_iter uses
mutex_lock(&inode->i_mutex);
for make writting atomic.
Found it in fs.h:
static inline void file_start_write(struct file *file)
{
if (!S_ISREG(file_inode(file)->i_mode))
return;
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}
and then in super.c:
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
*/
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool force_trylock = false;
int ret = 1;
#ifdef CONFIG_LOCKDEP
/*
* We want lockdep to tell us about possible deadlocks with freezing
* but it's it bit tricky to properly instrument it. Getting a freeze
* protection works as getting a read lock but there are subtle
* problems. XFS for example gets freeze protection on internal level
* twice in some cases, which is OK only because we already hold a
* freeze protection also on higher level. Due to these cases we have
* to use wait == F (trylock mode) which must not fail.
*/
if (wait) {
int i;
for (i = 0; i < level - 1; i++)
if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
force_trylock = true;
break;
}
}
#endif
if (wait && !force_trylock)
percpu_down_read(sb->s_writers.rw_sem + level-1);
else
ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
WARN_ON(force_trylock & !ret);
return ret;
}
EXPORT_SYMBOL(__sb_start_write);
Thanks again.

Linux OS: /proc/[pid]/smaps vs /proc/[pid]/statm

I would like calculate the memory usage for single process. So after a little bit of research I came across over smaps and statm.
First of all what is smaps and statm? What is the difference?
statm has a field RSS and in smaps I sum up all RSS values. But those values are different for the same process. I know that statm measures in pages. For comparison purposes I converted that value in kb as in smaps. But those values are not equal.
Why do these two values differ, even though they represent the rss value for the same process?
statm
232214 80703 7168 27 0 161967 0 (measured in pages, pages size is 4096)
smaps
Rss 1956
My aim is to calculate the memory usage for a single process. I am interested in two values. USS and PSS.
Can I gain those two values by just using smaps? Is that value correct?
Also, I would like to return that value as percentage.
I think statm is an approximated simplification of smaps, which is more expensive to get. I came to this conclusion after I looked at the source:
smaps
The information you see in smaps is defined in /fs/proc/task_mmu.c:
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
(...)
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
.mm = vma->vm_mm,
.private = &mss,
};
memset(&mss, 0, sizeof mss);
walk_page_vma(vma, &smaps_walk);
show_map_vma(m, vma, is_pid);
seq_printf(m,
(...)
"Rss: %8lu kB\n"
(...)
mss.resident >> 10,
The information in mss is used by walk_page_vma defined in /mm/pagewalk.c. However, the mss member resident is not filled in walk_page_vma - instead, walk_page_vma calls callback specified in smaps_walk:
.pmd_entry = smaps_pte_range,
.private = &mss,
like this:
if (walk->pmd_entry)
err = walk->pmd_entry(pmd, addr, next, walk);
So what does our callback, smaps_pte_range in /fs/proc/task_mmu.c, do?
It calls smaps_pte_entry and smaps_pmd_entry in some circumstances, out of which both call statm_account(), which in turn... upgrades resident size! All of these functions are defined in the already linked task_mmu.c so I didn't post relevant code snippets as they can be easily seen in the linked sources.
PTE stands for Page Table Entry and PMD is Page Middle Directory. So basically we iterate through the page entries associated with given process and update RAM usage depending on the circumstances.
statm
The information you see in statm is defined in /fs/proc/array.c:
int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
struct mm_struct *mm = get_task_mm(task);
if (mm) {
size = task_statm(mm, &shared, &text, &data, &resident);
mmput(mm);
}
seq_put_decimal_ull(m, 0, size);
seq_put_decimal_ull(m, ' ', resident);
seq_put_decimal_ull(m, ' ', shared);
seq_put_decimal_ull(m, ' ', text);
seq_put_decimal_ull(m, ' ', 0);
seq_put_decimal_ull(m, ' ', data);
seq_put_decimal_ull(m, ' ', 0);
seq_putc(m, '\n');
return 0;
}
This time, resident is filled by task_statm. This one has two implementations, one in /fs/proc/task_mmu.c and second in /fs/proc/task_nomm.c. Since they're almost surely mutually exclusive, I'll focus on the implementation in task_mmu.c (which also contained task_smaps). In this implementation we see that
unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
*shared = get_mm_counter(mm, MM_FILEPAGES);
(...)
*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
it queries some counters, namely, MM_FILEPAGES and MM_ANONPAGES. These counters are modified during different operations on memory such as do_wp_page defined at /mm/memory.c. All of the modifications seem to be done by the files located in /mm/ and there seem to be quite a lot of them, so I didn't include them here.
Conclusion
smaps does complicated iteration through all referenced memory regions and updates resident size using the collected information. statm uses data that was already calculated by someone else.
The most important part is that while smaps collects the data each time in an independent manner, statm uses counters that get incremented or decremented during process life cycle. There are a lot of places that need to do the bookkeeping, and perhaps some places don't upgrade the counters like they should. That's why IMO statm is inferior to smaps, even if it takes fewer CPU cycles to complete.
Please note that this is the conclusion I drew based on common sense, but I might be wrong - perhaps there are no internal inconsistencies in counter decrementing and incrementing, and instead, they might count some pages differently than smaps. At this point I believe it'd be wise to take it to some experienced kernel maintainers.

Configure kern.log to give more info about a segfault

Currently I can find in kern.log entries like this:
[6516247.445846] ex3.x[30901]: segfault at 0 ip 0000000000400564 sp 00007fff96ecb170 error 6 in ex3.x[400000+1000]
[6516254.095173] ex3.x[30907]: segfault at 0 ip 0000000000400564 sp 00007fff0001dcf0 error 6 in ex3.x[400000+1000]
[6516662.523395] ex3.x[31524]: segfault at 7fff80000000 ip 00007f2e11e4aa79 sp 00007fff807061a0 error 4 in libc-2.13.so[7f2e11dcf000+180000]
(You see, apps causing segfault are named ex3.x, means exercise 3 executable).
Is there a way to ask kern.log to log the complete path? Something like:
[6...] /home/user/cclass/ex3.x[3...]: segfault at 0 ip 0564 sp 07f70 error 6 in ex3.x[4...]
So I can easily figure out from who (user/student) this ex3.x is?
Thanks!
Beco
That log message comes from the kernel with a fixed format that only includes the first 16 letters of the executable excluding the path as per show_signal_msg, see other relevant lines for segmentation fault on non x86 architectures.
As mentioned by Makyen, without significant changes to the kernel and a recompile, the message given to klogd which is passed to syslog won't have the information you are requesting.
I am not aware of any log transformation or injection functionality in syslog or klogd which would allow you to take the name of the file and run either locate or file on the filesystem in order to find the full path.
The best way to get the information you are looking for is to use crash interception software like apport or abrt or corekeeper. These tools store the process metadata from the /proc filesystem including the process's commandline which would include the directory run from, assuming the binary was run with a full path, and wasn't already in path.
The other more generic way would be to enable core dumps, and then to set /proc/sys/kernel/core_pattern to include %E, in order to have the core file name including the path of the binary.
The short answer is: No, it is not possible without making code changes and recompiling the kernel. The normal solution to this problem is to instruct your students to name their executable <student user name>_ex3.x so that you can easily have this information.
However, it is possible to get the information you desire from other methods. Appleman1234 has provided some alternatives in his answer to this question.
How do we know the answer is "Not possible to the the full path in the kern.log segfault messages without recompiling the kernel":
We look in the kernel source code to find out how the message is produced and if there are any configuration options.
The files in question are part of the kernel source. You can download the entire kernel source as an rpm package (or other type of package) for whatever version of linux/debian you are running from a variety of places.
Specifically, the output that you are seeing is produced from whichever of the following files is for your architecture:
linux/arch/sparc/mm/fault_32.c
linux/arch/sparc/mm/fault_64.c
linux/arch/um/kernel/trap.c
linux/arch/x86/mm/fault.c
An example of the relevant function from one of the files(linux/arch/x86/mm/fault.c):
/*
* Print out info about fatal segfaults, if the show_unhandled_signals
* sysctl is set:
*/
static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
if (!unhandled_signal(tsk, SIGSEGV))
return;
if (!printk_ratelimit())
return;
printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
tsk->comm, task_pid_nr(tsk), address,
(void *)regs->ip, (void *)regs->sp, error_code);
print_vma_addr(KERN_CONT " in ", regs->ip);
printk(KERN_CONT "\n");
}
From that we see that the variable passed to printout the process identifier is tsk->comm where struct task_struct *tsk and regs->ip where struct pt_regs *regs
Then from linux/include/linux/sched.h
struct task_struct {
...
char comm[TASK_COMM_LEN]; /* executable name excluding path
- access with [gs]et_task_comm (which lock
it with task_lock())
- initialized normally by setup_new_exec */
The comment makes it clear that the path for the executable is not stored in the structure.
For regs->ip where struct pt_regs *regs, it is defined in whichever of the following are appropriate for your architecture:
arch/arc/include/asm/ptrace.h
arch/arm/include/asm/ptrace.h
arch/arm64/include/asm/ptrace.h
arch/cris/include/arch-v10/arch/ptrace.h
arch/cris/include/arch-v32/arch/ptrace.h
arch/metag/include/asm/ptrace.h
arch/mips/include/asm/ptrace.h
arch/openrisc/include/asm/ptrace.h
arch/um/include/asm/ptrace-generic.h
arch/x86/include/asm/ptrace.h
arch/xtensa/include/asm/ptrace.h
From there we see that struct pt_regs is defining registers for the architecture. ip is just: unsigned long ip;
Thus, we have to look at what print_vma_addr() does. It is defined in mm/memory.c
/*
* Print the name of a VMA.
*/
void print_vma_addr(char *prefix, unsigned long ip)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
/*
* Do not print if we are in atomic
* contexts (in exception stacks, etc.):
*/
if (preempt_count())
return;
down_read(&mm->mmap_sem);
vma = find_vma(mm, ip);
if (vma && vma->vm_file) {
struct file *f = vma->vm_file;
char *buf = (char *)__get_free_page(GFP_KERNEL);
if (buf) {
char *p;
p = d_path(&f->f_path, buf, PAGE_SIZE);
if (IS_ERR(p))
p = "?";
printk("%s%s[%lx+%lx]", prefix, kbasename(p),
vma->vm_start,
vma->vm_end - vma->vm_start);
free_page((unsigned long)buf);
}
}
up_read(&mm->mmap_sem);
}
Which shows us that a path was available. We would need to check that it was the path, but looking a bit further in the code gives a hint that it might not matter. We need to see what kbasename() did with the path that is passed to it. kbasename() is defined in include/linux/string.h as:
/**
* kbasename - return the last part of a pathname.
*
* #path: path to extract the filename from.
*/
static inline const char *kbasename(const char *path)
{
const char *tail = strrchr(path, '/');
return tail ? tail + 1 : path;
}
Which, even if the full path is available prior to it, chops off everything except for the last part of a pathname, leaving the filename.
Thus, no amount of runtime configuration options will permit printing out the full pathname of the file in the segment fault messages you are seeing.
NOTE: I've changed all of the links to kernel source to be to archives, rather than the original locations. Those links will get close to the code as it was at the time I wrote this, 2104-09. As should be no surprise, the code does evolve over time, so the code which is current when you're reading this may or may not be similar or perform in the way which is described here.

Is there any kill_proc() replacement for proprietary Linux kernel drivers?

I'm in the process of porting 4 proprietary (read: non-GPL) Linux kernel drivers (that I didn't write) from RHEL 5.x to RHEL 6.x (2.6.32 kernel). The drivers all use kill_proc() for signalling the user-space "session", but this function has been removed from the more recent kernels (somewhere between 2.6.18 and 2.6.32). I've seen this question asked many times here and elsewhere and I've searched fairly extensively, but of the many suggested solutions, none work due to either the functions no longer being exported, or requrieing a GPL-only function (see below). Does anyone know of a solution that could work for a proprietary driver?
given: kill_proc(pid, sig, 1);
The simplest solution I found was to use: kill_proc_info(sig, SEND_SIG_PRIV, pid); however kill_proc_info is no longer exported so it can't be used.
kill_pid_info() has been suggested (this is called by kill_proc_info() after setting an rcu_read_lock(). kill_pid_info() requires a struct pid* so I could use: kill_pid_info(sig, SEND_SIG_PRIV, find_vpid(pid)); however find_vpid() is exported for GPL use only and this is a proprietary driver. Is there another way to get the struct pid*?
kill_pid_info() also sets up an rcu_read_lock() and then calls group_send_sig_info(). Unfortunately, group_send_siginfo() is not exported, and also it requires a struct task_struct*, but the required find_task_by_vpid() function is not exported either.
Another suggestion was kill_pid(), but this also requires a struct pid*, and again, the function find_vpid() is only exported for GPL.
There were also suggestions for send_sig() and send_sig_info(), but these also require a struct task_struct*, and again, find_task_by_pid() is not exported, and pid_task() requires that (GPLd) find_vpid() to get a struct pid*. Also, these function don't set an rcu_read_lock() and they also pass a FALSE value for the group flag (whereas kill_proc ended up using a TRUE value) - so there could be some subtle differences.
That's all that I could find. Does anyone have a suggestion that will work for my case? Thanks in advance.
Since there have been no responses to my question, I've been
reading much of the kernel code and I think I've found a
solution.
It seems that the only exported function that provides the
same semantics as kill_proc() is kill_pid(). We can't use
the GPL find_vpid() function to get the needed struct pid*,
but if we can get the struct task_struct*, then we can get
the struct pid* from there as:
task->pids[PIDTYPE_PID].pid
Since find_task_by_vpid() is no longer exported, it seems
the only way to find the task is to go through the entire
task list looking for it. So, the proposed solution is:
int my_kill_proc(pid_t pid, int sig) {
int error = -ESRCH; /* default return value */
struct task_struct* p;
struct task_struct* t = NULL;
struct pid* pspid;
rcu_read_lock();
p = &init_task; /* start at init */
do {
if (p->pid == pid) { /* does the pid (not tgid) match? */
t = p;
break;
}
p = next_task(p); /* "this isn't the task you're looking for" */
} while (p != &init_task); /* stop when we get back to init */
if (t != NULL) {
pspid = t->pids[PIDTYPE_PID].pid;
if (pspid != NULL) error = kill_pid(pspid,sig,1);
}
rcu_read_unlock();
return error;
}
I know it will take a lot more time to search the whole task list rather
than using the hash tables, but it's all I've got. Some concerns/questions
that I have:
Is the rcu_read_lock() sufficient for this? Would
it be better to use something like preempt_disable() instead?
Can the struct task_struct ever NOT have a PIDTYPE_PID entry
in the pids array? And if so, is checking for NULL sufficient?
I'm new to working with the kernel, are there any other
suggestions to make this better?

Variant type storage and alignment issues

I've made a variant type to use instead of boost::variant. Mine works storing an index of the current type on a list of the possible types, and storing data in a byte array with enough space to store the biggest type.
unsigned char data[my_types::max_size];
int type;
Now, when I write a value to this variant type comes the trouble. I use the following:
template<typename T>
void set(T a) {
int t = type_index(T);
if (t != -1) {
type = t;
puts("writing atom data");
*((T *) data) = a; //THIS PART CRASHES!!!!
puts("did it!");
} else {
throw atom_bad_assignment;
}
}
The line that crashes is the one that stores data to the internal buffer. As you can see, I just cast the byte array directly to a pointer of the desired type. This gives me bad address signals and bus errors when trying to write some values.
I'm using GCC on a 64-bit system. How do I set the alignment for the byte array to make sure the address of the array is 64-bit aligned? (or properly aligned for any architecture I might port this project to).
EDIT: Thank you all, but the mistake was somewhere else. Apparently, Intel doesn't really care about alignment. Aligned stuff is faster but not mandatory, and the program works fine this way. My problem was I didn't clear the data buffer before writing stuff and this caused trouble with the constructors of some types. I will not, however, mark the question as answered, so more people can give me tips on alignment ;)
See http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Variable-Attributes.html
unsigned char data[my_types::max_size] __attribute__ ((aligned));
int type;
I believe
#pragma pack(64)
will work on all modern compilers; it definitely works on GCC.
A more correct solution (that doesn't mess with packing globally) would be:
#pragma pack(push, 64)
// define union here
#pragma pack(pop)

Resources