mutex unlocking and request_module() behaviour - linux

I've observed the following code pattern in the Linux kernel, for example net/sched/act_api.c or many other places as well :
rtnl_lock();
rtnetlink_rcv_msg(skb, ...);
replay:
ret = process_msg(skb);
...
/* try to obtain symbol which is in module. */
/* if fail, try to load the module, otherwise use the symbol */
a = get_symbol();
if (a == NULL) {
rtnl_unlock();
request_module();
rtnl_lock();
/* now verify that we can obtain symbols from requested module and return EAGAIN.*/
a = get_symbol();
module_put();
return -EAGAIN;
}
...
if (ret == -EAGAIN)
goto replay;
...
rtnl_unlock();
After request_module has succeeded, the symbol we are interested in, becomes available in kernel memory space, and we can use it. However I don't understand why return EAGAIN and re-read the symbol, why can't just continue right after request_module()?

If you look at the current implementation in the Linux kernel, there is a comment right after the 2nd call equivalent to get_symbol() in your above code (it is tc_lookup_action_n()) that explains exactly why:
rtnl_unlock();
request_module("act_%s", act_name);
rtnl_lock();
a_o = tc_lookup_action_n(act_name);
/* We dropped the RTNL semaphore in order to
* perform the module load. So, even if we
* succeeded in loading the module we have to
* tell the caller to replay the request. We
* indicate this using -EAGAIN.
*/
if (a_o != NULL) {
err = -EAGAIN;
goto err_mod;
}
Even though the module could be requested and loaded, since the semaphore was dropped in order to load the module which is an operation that can sleep (and is not the "standard way" this function is executed, the function returns EAGAIN to signal it.
EDIT for clarification:
If we look at the call sequence when a new action is added (which could cause a required module to be loaded) we have this sequence: tc_ctl_action() -> tcf_action_add() -> tcf_action_init() -> tcf_action_init_1().
Now if "move back" the EAGAIN error back up to tc_ctl_action() in the case RTM_NEWACTION:, we see that with the EAGAIN ret value the call to tcf_action_add is repeated.

Related

Using fetch-and-add as lock

I am trying to understand how fetch-and-add can be used as a lock. Here is what the book (OS's: 3 Easy pieces) says:
The basic operation is pretty simple: when
a thread wishes to acquire a lock, it first does an atomic fetch-and-add
on the ticket value; that value is now considered this thread’s “turn”
(myturn). The globally shared lock->turn is then used to determine
which thread’s turn it is; when (myturn == turn) for a given thread,
it is that thread’s turn to enter the critical section.
What I do not understand is how the thread checks if the lock held by another process before entering the cretical seection. All I can read that the value will be incremented, no mention of checks!
Another part says:
Unlock is accomplished
simply by incrementing the turn such that the next waiting thread (if
there is one) can now enter the critical section.
Which I can not interpret in a way where checks will not be performed, which can not be true becuase it compremises the whole porpose of locking cretical sections. What am I fmissing here? Thanks.
What I do not understand is how the thread checks if the lock held by another process before entering the cretical seection.
You need an "atomic fetch" for this, maybe something like "while( atomic_fetch(currently_serving) != my_ticket) { /* wait */ }".
If you have "atomic fetch and add", then you can implement "atomic fetch" by doing "atomic fetch and add the value zero", maybe something like "while( atomic_fetch_and_add(currently_serving, 0) != my_ticket) { /* wait */ }".
For reference; the full sequence could be something like:
my_ticket = atomic_fetch_and_add(ticket_counter, 1);
while( atomic_fetch_and_add(currently_serving, 0) != my_ticket) {
/* wait */
}
/* Critical section (lock successfully acquired). */
atomic_fetch_and_add(currently_serving, 1); /* Release the lock */
Of course you might have a better atomic fetch you can use instead (e.g. for some CPUs any normal aligned load is atomic).

linux socket: lifetime of ancillary data for sendmsg

I use cmsg to activate timestamping on linux socket tx.
ssize_t sendWithOptions
(int sd, std::vector<uint8_t> &payload, uint32_t destIP, int flags)
{
msghdr msg { };
.... // filling standard
std::array<uint8_t, CMSG_LEN(sizeof(__u32))> buf;
msg.msg_control = buf.data();
msg.msg_controlen = buf.size();
auto cmsg { CMSG_FIRSTHDR ( &msg ) };
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SO_TIMESTAMPING;
cmsg->cmsg_len = buf.size();
*(reinterpret_cast<__u32>(CMSG_DATA (cmsg)) = static_cast<__u32>(flags);
return sendmsg ( sd, &msg, MSG_DONTWAIT );
}
Leaving the function, "buf" is automatically destroyed, but does sendmsg need this buffer to live longer?
Do I have a guarantee that the function does not need this buffer once it has returned the number of bytes sent.
Except for specific interfaces, it is generally the case that operating system calls do not rely on user-space to maintain data structures affecting their operation after they are finished. The exceptions will be spelled out in the manual pages.
With sendmsg, in particular, you can rely on the call to complete immediately - whether successful or not. It's fine therefore to use a dynamically allocated buffer as you're doing, and destroy it immediately after the call.
As an example of one exception, aio_write(2) is specifically intended to allow user-space to queue a write operation that will be completed asynchronously. For this call, the data is not consumed until it can be successfully written. Hence, you must not modify the data structures provided in the call until you have confirmed it is complete. That caveat is called out in the NOTES section of the manual page:
... The control block must not be changed while the write operation is in progress. The buffer area being written out must not be accessed during the operation or undefined results may occur. The memory areas involved must remain valid.
In summary: check the manual page for the system call. But most of the time, you don't need to worry about it.

wsasend lpnumberofbytesSent

I am using wsasend on an IOCP structured server.
There is one problem.
wsabuf [bufcount - 1] .buf = pPacket-> GetPacketBufferPtr ();
wsabuf [bufcount - 1] .len = (int) pPacket-> Get_PacketSize ();
iSendSize + = wsabuf [bufcount - 1] .len;
bufcount ++;
int retval = WSASend (pSession-> socket, wsabuf, bufcount-1, & sendbytes,flag, & pSession-> overlapped_Send, NULL);
if (retval == SOCKET_ERROR)
{
if (WSAGetLastError ()! = WSA_IO_PENDING)
{
......
}
}
if (retval == 0)
{
if (sendbytes! = iSendSize)
{
........
}
}
.....
In the code above, I save the packet to send to wsabuf and I send it through wsasend.
And finally, I compared sendbytes and iSendSize .
By the way, sendbytes and iSendSize are Different.
I do not know why.
the actual number of transferred bytes returned from driver, only when operation is completed. io subsystem copy this value to IO_STATUS_BLOCK.Information transmitted to io operation. as result user get back this value. but of course only after operation is completed.
win32 api use OVERLAPPED in place IO_STATUS_BLOCK - reinterpret cast OVERLAPPED to IO_STATUS_BLOCK and pass this pointer to kernel. so InternalHigh will be containing actual number of transferred bytes, but only after operation will be completed (in case error synchronous returned - io subsystem not fill this field, so it value undefined on error. by sense of course 0).
WSASend get value (after call to kernel) from OVERLAPPED.InternalHigh and if lpNumberOfBytesSent not 0 - copy it here. if you use synchronous socket handle - at this moment io operation already will be completed (io subsystem internal wait for this, before return to caller) and valid value from OVERLAPPED.InternalHigh will be copied to *lpNumberOfBytesSent
in code this will be look like
if (!lpOverlapped)
{
OVERLAPPED Overlapped = {};
lpOverlapped = &Overlapped;
}
ZwDeviceIoControlFile(.. reinterpret_cast<IO_STATUS_BLOCK*>(lpOverlapped) ..)
if (lpNumberOfBytesSent)
{
*lpNumberOfBytesSent = (ULONG)lpOverlapped->InternalHigh;
}
in case asynchronous socket handle, operation usually yet not finished after return from kernel. as result lpOverlapped->InternalHigh yet not filled with correct numbers of bytes. and
*lpNumberOfBytesSent = (ULONG)lpOverlapped->InternalHigh;
got incorrect (undefined, if you and system not init it, say to 0) result.
conclusion - you can not use sendbytes for asynchronous io operation. what here is undefined. you can and need got this value when io is completed. how you got it already depend from how you notified about completion.
if you use BindIoCompletionCallback - you got it in
FileIOCompletionRoutine in dwNumberOfBytesTransfered
argument
if you use CreateThreadpoolIo- you got it in
IoCompletionCallback in NumberOfBytesTransferred argument
if you use own IOCP and GetQueuedCompletionStatus - you got
back pointer to your lpOverlapped used in call to WSASend (or
some another io function - this is already your task determinate
where this lpOverlapped used ) after operation was completed. at
this point you can call GetOverlappedResult for this
lpOverlapped (bWait you can set to any value - does not matter because operation already completed - the api will return
immediately in any case without wait) and you got actual number of
transferred bytes in lpNumberOfBytesTransferred. however
GetOverlappedResult simply copy lpOverlapped->InternalHigh
value to *lpNumberOfBytesTransferred so you can and direct,
yourself use InternalHigh without call to GetOverlappedResult

Atomicity of writev() system call in Linux

I've looked in the kernel source for linux kernel 4.4.0-57-generic and don't see any locks in the writev() source. Is there something I'm missing? I don't see how writev() is atomic or thread-safe.
Not a kernel expert here, but I'll share my point of view anyway. Feel free to spot any mistakes.
Browsing the kernel (v4.9 though I wouldn't expect it to be so different), and trying to trace the writev(2) system call, I can observe subsequent function calls that create the following path:
SYSCALL_DEFINE3(writev, ..)
do_writev(..)
vfs_writev(..)
do_readv_writev(..)
Now the path branches, depending on whether a write_iter method is implemented and hooked on the struct file_operations field of the struct file that the system call is referring to.
If it's not NULL, the path is:
5a. do_iter_readv_writev(..), which calls the method filp->f_op->write_iter(..) at this point.
If it is NULL, the path is:
5b. do_loop_readv_writev(..), which calls repeatedly in a loop the method filp->f_op->write at this point.
So, as far as I understand, the writev() system call is as thread safe as the underlying write() (or write_iter()) is, which of course can be implemented in various ways, e.g. in a device driver, and may or may not use locks according to its needs and its design.
EDIT:
In kernel v4.4 the paths look pretty similar:
SYSCALL_DEFINE3(writev, ..)
vfs_writev(..)
do_readv_writev(..)
and then it depends on whether the write_iter method as a field in struct file_operations of the struct file is NULL or not, just like the case in v4.9, described above.
VFS (Virtual File System) by itself doesn't garantee atomicity of writev() call. It just calls filesystem-specific .write_iter method of struct file_operations.
It is responsibility of specific filesystem implementation for make method atomically write to the file.
For example, in ext4 filesystem function ext4_file_write_iter uses
mutex_lock(&inode->i_mutex);
for make writting atomic.
Found it in fs.h:
static inline void file_start_write(struct file *file)
{
if (!S_ISREG(file_inode(file)->i_mode))
return;
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}
and then in super.c:
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
*/
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool force_trylock = false;
int ret = 1;
#ifdef CONFIG_LOCKDEP
/*
* We want lockdep to tell us about possible deadlocks with freezing
* but it's it bit tricky to properly instrument it. Getting a freeze
* protection works as getting a read lock but there are subtle
* problems. XFS for example gets freeze protection on internal level
* twice in some cases, which is OK only because we already hold a
* freeze protection also on higher level. Due to these cases we have
* to use wait == F (trylock mode) which must not fail.
*/
if (wait) {
int i;
for (i = 0; i < level - 1; i++)
if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
force_trylock = true;
break;
}
}
#endif
if (wait && !force_trylock)
percpu_down_read(sb->s_writers.rw_sem + level-1);
else
ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
WARN_ON(force_trylock & !ret);
return ret;
}
EXPORT_SYMBOL(__sb_start_write);
Thanks again.

How does seccomp-bpf filter syscalls?

I'm investigating the implementation detail of seccomp-bpf, the syscall filtration mechanism that was introduced into Linux since version 3.5.
I looked into the source code of kernel/seccomp.c from Linux 3.10 and want to ask some questions about it.
From seccomp.c, it seems that seccomp_run_filters() is called from __secure_computing() to test the syscall called by the current process.
But looking into seccomp_run_filters(), the syscall number that is passed as an argument is not used anywhere.
It seems that sk_run_filter() is the implementation of BPF filter machine, but sk_run_filter() is called from seccomp_run_filters() with the first argument (the buffer to run the filter on) NULL.
My question is: how can seccomp_run_filters() filter syscalls without using the argument?
The following is the source code of seccomp_run_filters():
/**
* seccomp_run_filters - evaluates all seccomp filters against #syscall
* #syscall: number of the current system call
*
* Returns valid seccomp BPF response codes.
*/
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
u32 ret = SECCOMP_RET_ALLOW;
/* Ensure unexpected behavior doesn't result in failing open. */
if (WARN_ON(current->seccomp.filter == NULL))
return SECCOMP_RET_KILL;
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
for (f = current->seccomp.filter; f; f = f->prev) {
u32 cur_ret = sk_run_filter(NULL, f->insns);
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
ret = cur_ret;
}
return ret;
}
When a user process enters the kernel, the register set is stored to a kernel variable.
The function sk_run_filter implements the interpreter for the filter language. The relevant instruction for seccomp filters is BPF_S_ANC_SECCOMP_LD_W. Each instruction has a constant k, and in this case it specifies the index of the word to be read.
#ifdef CONFIG_SECCOMP_FILTER
case BPF_S_ANC_SECCOMP_LD_W:
A = seccomp_bpf_load(fentry->k);
continue;
#endif
The function seccomp_bpf_load uses the current register set of the user thread to determine the system call information.

Resources