register_kprobe() returns EINVAL without additional memory on containing struct - linux

I've written a kernel module (a character device) that registers new KProbes whenever I write to the module.
I have a structure that contains struct kprobe. When I call register_kprobe(), it returns -EINVAL. But when I add a dummy character array to the (possibly some other data types as well), the KProbe registration succeeds.
Probe Registration
struct my_struct *container = kmalloc(sizeof(struct my_struct));
(container->probe).addr = (kprobe_opcode_t *) kallsyms_lookup_name("my_exported_fn"); /* my_exported_fn is in code section */
(container->probe).pre_handler = Pre_Handler;
(container->probe).post_handler = Post_Handler;
register_probe(&container->probe);
/* Returns -EINVAL if my_struct contains only `struct kprobe`. */
Not working:
struct my_struct {
struct kprobe probe;
}
Working:
struct my_struct {
char dummy[512]; /* At 512, it gets consistently registered. At 256, sometimes (maybe one out of 5 - 10 times get registered) */
struct kprobe probe;
}
Why does it need this extra bit of memory to be present in the struct?

This could be unaligned memory access or not, but in this particular case (I mean your original code before the edit) I suspect that the data is not properly initialised. Namely, register_kprobe() calls kprobe_addr() function which in turn implies the following check:
if ((symbol_name && addr) || (!symbol_name && !addr))
goto invalid;
...
invalid:
return ERR_PTR(-EINVAL);
So, if you indeed initialise addr and don't initialise symbol_name, the latter could be a garbage pointer under certain circumstances. Namely, kmalloc() doesn't zeroise allocated memory and, furthermore, depending on requested size, it may take memory object of a suitable size from a different pool (there are different pools to provide objects of different sizes), and when you artificially increase the size of the struct, kmalloc() has to allocate a larger object from a suitable pool. From this perspective, the probability is that such an object may not contain garbage by occasion (since larger chunks are requested less often).
All in all, I suggest zeroising the memory chunk or using kzalloc().

Related

Why don't multiple threads have to share a lock to call mmap like they do malloc/calloc/sbrk?

I'm working with ptmalloc, and something interesting I came across is when an arena runs out of available chunks (and the top chunk is not large enough) and has to either extend the arena using sbrk() or allocate a non-contiguous region using mmap(). What particularly stood out to me is that in order to allocate more memory using sbrk(), a lock had to be acquired before being able to call it (in addition to the lock previously obtained to be in sole possession of the current arena). However, no lock needs to be acquired before calling mmap(). I have included the specific parts of the sys_alloc() function from the malloc.c file included in the ptmalloc implementation (for reference) below:
Call to extend arena using sbrk():
if (HAVE_MORECORE && tbase == CMFAIL) { /* Try noncontiguous MORECORE */
size_t asize = granularity_align(nb + TOP_FOOT_SIZE + SIZE_T_ONE);
if (asize < HALF_MAX_SIZE_T) {
char* br = CMFAIL;
char* end = CMFAIL;
ACQUIRE_MORECORE_LOCK(); /* LOCK */
br = (char*)(CALL_MORECORE(asize));
end = (char*)(CALL_MORECORE(0));
RELEASE_MORECORE_LOCK(); /* UNLOCK */
if (br != CMFAIL && end != CMFAIL && br < end) {
size_t ssize = end - br;
if (ssize > nb + TOP_FOOT_SIZE) {
tbase = br;
tsize = ssize;
}
}
}
}
Call to extend arena using mmap():
if (HAVE_MMAP && tbase == CMFAIL) { /* Try MMAP */
size_t req = nb + TOP_FOOT_SIZE + SIZE_T_ONE;
size_t rsize = granularity_align(req);
if (rsize > nb) { /* Fail if wraps around zero */
char* mp = (char*)(CALL_MMAP(rsize));
if (mp != CMFAIL) {
tbase = mp;
tsize = rsize;
mmap_flag = IS_MMAPPED_BIT;
}
}
}
Any help understanding why this is able to work even with multiple threads that have the exact same memory pattern (and thus have to extend their arenas at the same time) without having to use locks (i.e., how mmap() is guaranteed to return distinct addresses, even if called simultaneously with a NULL suggested address) would be greatly appreciated.
In the code snippet using sbrk(). It is used to increased the process global heap area. Two calls are issued: the 1st one extends the heap area by rsize bytes and the second gets the resulting address of the new top of the heap (the so-called program's break). The heap area is shared by all the threads of the process. The cuurent top is a global variable for all the threads. Hence, it is protected by a mutex whenever a thread modifies it (shrink/grow operations);
In the code snippet using mmap(), the current thread is allocating a single memory mapped area for itself. The resulting address is only for the calling thread. Hence, no mutual exclusion is necessary from the ptmalloc global data structures point of view as the latter are not modified. A flag IS_MMAPPED_BIT is set in the internal allocated header to indicate to ptmalloc that this is a memory mapped region when it is requested to free it. Concerning mmap() internals, the mutual exclusion is managed inside the kernel.

Local memory for each CUDA thread

I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}
My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.

linux socket: lifetime of ancillary data for sendmsg

I use cmsg to activate timestamping on linux socket tx.
ssize_t sendWithOptions
(int sd, std::vector<uint8_t> &payload, uint32_t destIP, int flags)
{
msghdr msg { };
.... // filling standard
std::array<uint8_t, CMSG_LEN(sizeof(__u32))> buf;
msg.msg_control = buf.data();
msg.msg_controlen = buf.size();
auto cmsg { CMSG_FIRSTHDR ( &msg ) };
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SO_TIMESTAMPING;
cmsg->cmsg_len = buf.size();
*(reinterpret_cast<__u32>(CMSG_DATA (cmsg)) = static_cast<__u32>(flags);
return sendmsg ( sd, &msg, MSG_DONTWAIT );
}
Leaving the function, "buf" is automatically destroyed, but does sendmsg need this buffer to live longer?
Do I have a guarantee that the function does not need this buffer once it has returned the number of bytes sent.
Except for specific interfaces, it is generally the case that operating system calls do not rely on user-space to maintain data structures affecting their operation after they are finished. The exceptions will be spelled out in the manual pages.
With sendmsg, in particular, you can rely on the call to complete immediately - whether successful or not. It's fine therefore to use a dynamically allocated buffer as you're doing, and destroy it immediately after the call.
As an example of one exception, aio_write(2) is specifically intended to allow user-space to queue a write operation that will be completed asynchronously. For this call, the data is not consumed until it can be successfully written. Hence, you must not modify the data structures provided in the call until you have confirmed it is complete. That caveat is called out in the NOTES section of the manual page:
... The control block must not be changed while the write operation is in progress. The buffer area being written out must not be accessed during the operation or undefined results may occur. The memory areas involved must remain valid.
In summary: check the manual page for the system call. But most of the time, you don't need to worry about it.

Is it possible to store pointers in shared memory without using offsets?

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.
On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//
This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}
We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.
You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).
Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Looking for a lock-free RT-safe single-reader single-writer structure

I'm looking for a lock-free design conforming to these requisites:
a single writer writes into a structure and a single reader reads from this structure (this structure exists already and is safe for simultaneous read/write)
but at some time, the structure needs to be changed by the writer, which then initialises, switches and writes into a new structure (of the same type but with new content)
and at the next time the reader reads, it switches to this new structure (if the writer multiply switches to a new lock-free structure, the reader discards these structures, ignoring their data).
The structures must be reused, i.e. no heap memory allocation/free is allowed during write/read/switch operation, for RT purposes.
I have currently implemented a ringbuffer containing multiple instances of these structures; but this implementation suffers from the fact that when the writer has used all the structures present in the ringbuffer, there is no more place to change from structure... But the rest of the ringbuffer contains some data which don't have to be read by the reader but can't be re-used by the writer. As a consequence, the ringbuffer does not fit this purpose.
Any idea (name or pseudo-implementation) of a lock-free design? Thanks for having considered this problem.
Here's one. The keys are that there are three buffers and the reader reserves the buffer it is reading from. The writer writes to one of the other two buffers. The risk of collision is minimal. Plus, this expands. Just make your member arrays one element longer than the number of readers plus the number of writers.
class RingBuffer
{
RingBuffer():lastFullWrite(0)
{
//Initialize the elements of dataBeingRead to false
for(unsigned int i=0; i<DATA_COUNT; i++)
{
dataBeingRead[i] = false;
}
}
Data read()
{
// You may want to check to make sure write has been called once here
// to prevent read from grabbing junk data. Else, initialize the elements
// of dataArray to something valid.
unsigned int indexToRead = lastFullWriteIndex;
Data dataCopy;
dataBeingRead[indexToRead] = true;
dataCopy = dataArray[indexToRead];
dataBeingRead[indexToRead] = false;
return dataCopy;
}
void write( const Data& dataArg )
{
unsigned int writeIndex(0);
//Search for an unused piece of data.
// It's O(n), but plenty fast enough for small arrays.
while( true == dataBeingRead[writeIndex] && writeIndex < DATA_COUNT )
{
writeIndex++;
}
dataArray[writeIndex] = dataArg;
lastFullWrite = &dataArray[writeIndex];
}
private:
static const unsigned int DATA_COUNT;
unsigned int lastFullWrite;
Data dataArray[DATA_COUNT];
bool dataBeingRead[DATA_COUNT];
};
Note: The way it's written here, there are two copies to read your data. If you pass your data out of the read function through a reference argument, you can cut that down to one copy.
You're on the right track.
Lock free communication of fixed messages between threads/processes/processors
fixed size ring buffers can be used in lock free communications between threads, processes or processors if there is one producer and one consumer. Some checks to perform:
head variable is written only by producer (as an atomic action after writing)
tail variable is written only by consumer (as an atomic action after reading)
Pitfall: introduction of a size variable or buffer full/empty flag; these are typically written by both producer and consumer and hence will give you an issue.
I generally use ring buffers for this purpoee. Most important lesson I've learned is that a ring buffer of can never contain more than elements. This way a head and tail variable are written by producer respectively consumer.
Extension for large/variable size blocks
To use buffers in a real time environment, you can either use memory pools (often available in optimized form in real time operating systems) or decouple allocation from usage. The latter fits to the question, I believe.
If you need to exchange large blocks, I suggest to use a pool with buffer blocks and communicate pointers to buffers using a queue. So use a 3rd queue with buffer pointers. This way the allocates can be done in application (background) and you real time portion has access to a variable amount of memory.
Application
while (blockQueue.full != true)
{
buf = allocate block of memory from heap or buffer pool
msg = { .... , buf };
blockQueue.Put(msg)
}
Producer:
pBuf = blockQueue.Get()
pQueue.Put()
Consumer
if (pQueue.Empty == false)
{
msg=pQueue.Get()
// use info in msg, with buf pointer
// optionally indicate that buf is no longer used
}

Resources