Large size kmalloc in the linux kernel kmalloc - linux

I am looking at Linux version 4.9.31
And a kmalloc() function of slab and slub
The following is the kmalloc() function of include/linux/slab.h
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
return kmalloc_large(size, flags);
#ifndef CONFIG_SLOB
if (!(flags & GFP_DMA)) {
int index = kmalloc_index(size);
if (!index)
return ZERO_SIZE_PTR;
return kmem_cache_alloc_trace(kmalloc_caches[index],
flags, size);
}
#endif
}
return __kmalloc(size, flags);
}
In the above code, kmalloc_large() is called when __builtin_constant_p(size) is true.
First question. What is the relationship between __builtin_constant_p(size) and kmalloc_large()? Should not kmalloc_large() be called in runtime, not compile time?
The following is the __kmalloc() and __do_kmalloc() of mm/slab.c
static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
unsigned long caller)
{
struct kmem_cache *cachep;
void *ret;
cachep = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(cachep)))
return cachep;
ret = slab_alloc(cachep, flags, caller);
kasan_kmalloc(cachep, ret, size, flags);
trace_kmalloc(caller, ret,
size, cachep->size, flags);
return ret;
}
void *__kmalloc(size_t size, gfp_t flags)
{
return __do_kmalloc(size, flags, _RET_IP_);
}
The following is the __kmalloc() of mm/slub.c
void *__kmalloc(size_t size, gfp_t flags)
{
struct kmem_cache *s;
void *ret;
if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
return kmalloc_large(size, flags);
s = kmalloc_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;
ret = slab_alloc(s, flags, _RET_IP_);
trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
kasan_kmalloc(s, ret, size, flags);
return ret;
}
Second question. Why do slub __kmalloc() check "size > KMALLOC_MAX_CACHE_SIZE" and call kmalloc_large() at runtime ?

Your two question are actually parts of the single question:
What is __builtin_constant_p(size)?
Operator __builtin_constant_p is gcc-specific extension, which checks whether its argument can be evaluated at compile time. E.g., if you call
p = kmalloc(100, GFP_KERNEL);
then the operator returns true.
But with
size_t size = 100;
p = kmalloc(size, GFP_KERNEL);
the operator returns false*.
By knowing that some function's parameter is known at compile time, one may check it at compile time, and perform some optimizations.
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
While size > KMALLOC_MAX_CACHE_SIZE seems to be runtime-check here, it is actually compile-time check, because outer condition garantees that size is known at compile time. With that knowledge, compiler may optimize out inner branch, if it is false (if the branch true, compiler may optimize out other branches).
E.g.,
p = kmalloc(100000, GFP_KERNEL);
will be compiled into
kmalloc_large(100000, GFP_KERNEL);
and
p = kmalloc(100, GFP_KERNEL);
will be compiled into
__kmalloc(100, GFP_KERNEL);
But
size_t size = 100000;
p = kmalloc(size, GFP_KERNEL);
will be compiled into
size_t size = 100000;
__kmalloc(size, GFP_KERNEL);
because compiler cannot predict the branch at compile time.
Implementation of "fall-back" function __kmalloc checks its parameters anywhere, for the case when compile-time checks cannot be performed.
*- in my recent tests compiler actually doesn't try to predict value of size variable which has been assigned directly with a constant. But this may be changed in future gcc versions.

Related

Where could I find the code of "sched_getcpu()"

Recently I'm using the function sched_getcpu() from the header file sched.h on Linux.
However, I'm wondering where could I find the source code of this function?
Thanks.
Under Linux, the sched_getcpu() function is a glibc wrapper to sys_getcpu() system call, which is architecture specific.
For the x86_64 architecture, it is defined under arch/x86/include/asm/vgtod.h as __getcpu() (tree 4.x):
#ifdef CONFIG_X86_64
#define VGETCPU_CPU_MASK 0xfff
static inline unsigned int __getcpu(void)
{
unsigned int p;
/*
* Load per CPU data from GDT. LSL is faster than RDTSCP and
* works on all CPUs. This is volatile so that it orders
* correctly wrt barrier() and to keep gcc from cleverly
* hoisting it out of the calling function.
*/
asm volatile ("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
return p;
}
#endif /* CONFIG_X86_64 */
Being this function called by __vdso_getcpu() declared in arch/entry/vdso/vgetcpu.c:
notrace long
__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
{
unsigned int p;
p = __getcpu();
if (cpu)
*cpu = p & VGETCPU_CPU_MASK;
if (node)
*node = p >> 12;
return 0;
}
(See vDSO for details regarding what vdso prefix is).
EDIT 1: (in reply to arm code location)
ARM code location
It can be found in the arch/arm/include/asm/thread_info.h file:
static inline struct thread_info *current_thread_info(void)
{
return (struct thread_info *)
(current_stack_pointer & ~(THREAD_SIZE - 1));
}
This function is used by raw_smp_processor_id() that is defined in the file arch/arm/include/asm/smp.h as:
#define raw_smp_processor_id() (current_thread_info()->cpu)
And it's called by getcpu system call declared in the file kernel/sys.c:
SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, struct getcpu_cache __user *, unused)
{
int err = 0;
int cpu = raw_smp_processor_id();
if (cpup)
err |= put_user(cpu, cpup);
if (nodep)
err |= put_user(cpu_to_node(cpu), nodep);
return err ? -EFAULT : 0;
}

how to implement splice_read for a character device file with uncached DMA buffer

I have a character device driver. It includes a 4MB coherent DMA buffer. The buffer is implemented as a ring buffer. I also implemente the splice_read call for the driver to improve the performance. But this implementation does not work well. Below is the using example:
(1)splice the 16 pages of device buffer data to a pipefd[1]. (the DMA buffer is managed as in page unit).
(2)splice the pipefd[0] to the socket.
(3)the receiving side (tcp client) receives the data, and then check the correctness.
I found that the tcp client got errors. The splice_read implementation is show below (I steal it from the vmsplice implementation):
/* splice related functions */
static void rdma_ring_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
put_page(buf->page);
buf->flags &= ~PIPE_BUF_FLAG_LRU;
}
void rdma_ring_spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
{
put_page(spd->pages[i]);
}
static const struct pipe_buf_operations rdma_ring_page_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
.confirm = generic_pipe_buf_confirm,
.release = rdma_ring_pipe_buf_release,
.steal = generic_pipe_buf_steal,
.get = generic_pipe_buf_get,
};
/* in order to simplify the caller work, the parameter meanings of ppos, len
* has been changed to adapt the internal ring buffer of the driver. The ppos
* indicate wich page is refferred(shoud start from 1, as the csr page are
* not allowed to do the splice), The len indicate how many pages are needed.
* Also, we constrain that maximum page number for each splice shoud not
* exceed 16 pages, if else, a EINVAL will return. If a high speed device
* need a more big page number, it can rework this routing. The off is also
* used to return the total bytes shoud be transferred, use can compare it
* with the return value to determint whether all bytes has been transfered.
*/
static ssize_t do_rdma_ring_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
struct rdma_ring *priv = to_rdma_ring(in->private_data);
struct rdma_ring_buf *data_buf;
struct rdma_ring_dstatus *dsta_buf;
struct page *pages[PIPE_DEF_BUFFERS];
struct partial_page partial[PIPE_DEF_BUFFERS];
ssize_t total_sz = 0, error;
int i;
unsigned offset;
struct splice_pipe_desc spd = {
.pages = pages,
.partial = partial,
.nr_pages_max = PIPE_DEF_BUFFERS,
.flags = flags,
.ops = &rdma_ring_page_pipe_buf_ops,
.spd_release = rdma_ring_spd_release_page,
};
/* init the spd, currently we omit the packet header, if a control
* is needed, it may be implemented by define a control variable in
* the device struct */
spd.nr_pages = len;
for (i = 0; i < len; i++) {
offset = (unsigned)(*ppos) + i;
data_buf = get_buf(priv, offset);
dsta_buf = get_dsta_buf(priv, offset);
pages[i] = virt_to_page(data_buf);
get_page(pages[i]);
partial[i].offset = 0;
partial[i].len = dsta_buf->bytes_xferred;
total_sz += partial[i].len;
}
error = _splice_to_pipe(pipe, &spd);
/* use the ppos to return the theory total bytes shoud transfer */
*ppos = total_sz;
return error;
}
/* splice read */
static ssize_t rdma_ring_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags)
{
ssize_t ret;
MY_PRINT("%s: *ppos = %lld, len = %ld\n", __func__, *ppos, (long)len);
if (unlikely(len > PIPE_DEF_BUFFERS))
return -EINVAL;
ret = do_rdma_ring_splice_read(in, ppos, pipe, len, flags);
return ret;
}
The _splice_to_pipe is just the same one as the splice_to_pipe in kernel. As this function is not an exported symbol, so I re-implemented it.
I think the main cause is that the some kind of lock of pages are omitted, but
I don't know where and how.
My kernel version is 3.10.

IOCTL Method - Linux

I have an exam question and I can't quite see how to solve it.
A driver that needs the ioctl method to be implemented and tested.
I have to write the ioctl() method, the associated test program as well as the common IOCTL definitions.
The ioctl() method should only handle one command. In this command, I need to transmit a data structure from user space to kernel space.
Below is the structure shown:
struct data
{
     char label [10];
     int value;
}
The driver must print the IOCTL command data, using printk();
Device name is "/dev/mydevice"
The test program must validate driver mode using an initialized data structure.
Hope there are some that can help
thanks in advance
My suggestion:
static int f_on_ioctl(struct inode *inode, struct file *file, unsigned int cmd,
unsigned long arg)
{
int ret;
switch (cmd)
{
case PASS_STRUCT:
struct data pass_data;
ret = copy_from_user(&pass_data, arg, sizeof(*pass_data));
if(ret < 0)
{
printk("PASS_STRUCT\n");
return -1;
}
printk(KERN ALERT "Message PASS_STRUCT : %d and %c\n",pass_data.value, pass_data.label);
break;
default:
return ENOTTY;
}
return 0;
}
Definitions:
Common.h
#define SYSLED_IOC_MAGIC 'k'
#define PASS_STRUCT _IOW(SYSLED_IOC_MAGIC, 1, struct data)
The test program:
int main()
{
int fd = open("/dev/mydevice", O_RDWR);
data data_pass;
data_pass.value = 2;
data_pass.label = "hej";
ioctl(fd, PASS_STRUCT, &data_pass);
close(fd);
return 0;
}
Is this completely wrong??

Check signature of Linux shared-object before load

Goal: Load .so or executable that has been verified to be signed (or verified against an arbitrary algorithm).
I want to be able to verify a .so/executable and then load/execute that .so/executable with dlopen/...
The wrench in this is that there seems to be no programmatic way to check-then-load. One could check the file manually and then load it after.. however there is a window-of-opportunity within which someone could swap out that file for another.
One possible solution that I can think of is to load the binary, check the signature and then dlopen/execvt the /proc/$PID/fd.... however I do not know if that is a viable solution.
Since filesystem locks are advisory in Linux they are not so useful for this purpose... (well, there's mount -o mand ... but this is something for userlevel, not root use).
Many dynamic linkers (including Glibc's) support setting LD_AUDIT environment variable to a colon-separated list of shared libraries. These libraries are allowed to hook into various locations in the dynamic library loading process.
#define _GNU_SOURCE
#include <dlfcn.h>
#include <link.h>
unsigned int la_version(unsigned int v) { return v; }
unsigned int la_objopen(struct link_map *l, Lmid_t lmid, uintptr_t *cookie) {
if (!some_custom_check_on_name_and_contents(l->l_name, l->l_addr))
abort();
return 0;
}
Compile this with cc -shared -fPIC -o test.so test.c or similar.
You can see glibc/elf/tst-auditmod1.c or latrace for more examples, or read the Linkers and Libraries Guide.
Very very specific to Glibc's internals, but you can still hook into libdl at runtime.
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
extern struct dlfcn_hook {
void *(*dlopen)(const char *, int, void *);
int (*dlclose)(void *);
void *(*dlsym)(void *, const char *, void *);
void *(*dlvsym)(void *, const char *, const char *, void *);
char *(*dlerror)(void);
int (*dladdr)(const void *, Dl_info *);
int (*dladdr1)(const void *, Dl_info *, void **, int);
int (*dlinfo)(void *, int, void *, void *);
void *(*dlmopen)(Lmid_t, const char *, int, void *);
void *pad[4];
} *_dlfcn_hook;
static struct dlfcn_hook *old_dlfcn_hook, my_dlfcn_hook;
static int depth;
static void enter(void) { if (!depth++) _dlfcn_hook = old_dlfcn_hook; }
static void leave(void) { if (!--depth) _dlfcn_hook = &my_dlfcn_hook; }
void *my_dlopen(const char *file, int mode, void *dl_caller) {
void *result;
fprintf(stderr, "%s(%s, %d, %p)\n", __func__, file, mode, dl_caller);
enter();
result = dlopen(file, mode);
leave();
return result;
}
int my_dlclose(void *handle) {
int result;
fprintf(stderr, "%s(%p)\n", __func__, handle);
enter();
result = dlclose(handle);
leave();
return result;
}
void *my_dlsym(void *handle, const char *name, void *dl_caller) {
void *result;
fprintf(stderr, "%s(%p, %s, %p)\n", __func__, handle, name, dl_caller);
enter();
result = dlsym(handle, name);
leave();
return result;
}
void *my_dlvsym(void *handle, const char *name, const char *version, void *dl_caller) {
void *result;
fprintf(stderr, "%s(%p, %s, %s, %p)\n", __func__, handle, name, version, dl_caller);
enter();
result = dlvsym(handle, name, version);
leave();
return result;
}
char *my_dlerror(void) {
char *result;
fprintf(stderr, "%s()\n", __func__);
enter();
result = dlerror();
leave();
return result;
}
int my_dladdr(const void *address, Dl_info *info) {
int result;
fprintf(stderr, "%s(%p, %p)\n", __func__, address, info);
enter();
result = dladdr(address, info);
leave();
return result;
}
int my_dladdr1(const void *address, Dl_info *info, void **extra_info, int flags) {
int result;
fprintf(stderr, "%s(%p, %p, %p, %d)\n", __func__, address, info, extra_info, flags);
enter();
result = dladdr1(address, info, extra_info, flags);
leave();
return result;
}
int my_dlinfo(void *handle, int request, void *arg, void *dl_caller) {
int result;
fprintf(stderr, "%s(%p, %d, %p, %p)\n", __func__, handle, request, arg, dl_caller);
enter();
result = dlinfo(handle, request, arg);
leave();
return result;
}
void *my_dlmopen(Lmid_t nsid, const char *file, int mode, void *dl_caller) {
void *result;
fprintf(stderr, "%s(%lu, %s, %d, %p)\n", __func__, nsid, file, mode, dl_caller);
enter();
result = dlmopen(nsid, file, mode);
leave();
return result;
}
static struct dlfcn_hook my_dlfcn_hook = {
.dlopen = my_dlopen,
.dlclose = my_dlclose,
.dlsym = my_dlsym,
.dlvsym = my_dlvsym,
.dlerror = my_dlerror,
.dladdr = my_dladdr,
.dlinfo = my_dlinfo,
.dlmopen = my_dlmopen,
.pad = {0, 0, 0, 0},
};
__attribute__((constructor))
static void init(void) {
old_dlfcn_hook = _dlfcn_hook;
_dlfcn_hook = &my_dlfcn_hook;
}
__attribute__((destructor))
static void fini(void) {
_dlfcn_hook = old_dlfcn_hook;
}
$ cc -shared -fPIC -o hook.so hook.c
$ cat > a.c
#include <dlfcn.h>
int main() { dlopen("./hook.so", RTLD_LAZY); dlopen("libm.so", RTLD_LAZY); }
^D
$ cc -ldl a.c
$ ./a.out
my_dlopen(libm.so, 1, 0x80484bd)
Unfortunately, my investigations are leading me to conclude that even if you could hook into glibc/elf/dl-load.c:open_verify() (which you can't), it's not possible to make this race-free against somebody writing over segments of your library.
The problem is essentially unsolvable in the form you've given, because shared objects are loaded by mmap()ing to process memory space. So even if you could make sure that the file that dlopen() operated on was the one you'd examined and declared OK, anyone who can write to the file can modify the loaded object at any time after you've loaded it. (This is why you don't upgrade running binaries by writing to them - instead you delete-then-install, because writing to them would likely crash any running instances).
Your best bet is to ensure that only the user you are running as can write to the file, then examine it, then dlopen() it. Your user (or root) can still sneak different code in, but processes with those permissions could just ptrace() you to do their bidding anyhow.
This project supposedly solves this on kernel level.
DigSig currently offers:
run time signature verification of ELF binaries and shared libraries.
support for file's signature revocation.
a signature caching mechanism to enhance performances.
I propose the following solution that should work without libraries *)
int memfd = memfd_create("for-debugging.library.so", MFD_CLOEXEC | MFD_ALLOW_SEALING);
assert(memfd != -1);
// Use any way to read the library from disk and copy the content into memfd
// e.g. write(2) or ftruncate(2) and mmap(2)
// Important! if you use mmap, you have to unmap it before the next step
// fcntl( , , F_SEAL_WRITE) will fail if there exists a writeable mapping
int seals_to_set = F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_WRITE | F_SEAL_SEAL;
int sealing_err = fcntl(memfd, F_ADD_SEALS, seals_to_set);
assert(sealing_err == 0);
// Only now verify the contents of the loaded file
// then you can safely *) dlopen("/proc/self/fd/<memfd>");
*) Not actually tested it against attacks. Do not use in production without further investigation.

Kernel Panic after changes in sys_close

I'm doing a course on operating systems and we work in Linux Red Hat 8.0
AS part of an assignment I had to change sys close and sys open. Changes to sys close passed without an incident, but when I introduce the changes to sys close suddenly the OS encounters an error during booting, claiming it cannot mount root fs, and invokes panic. EIP is reportedly at sys close when this happens.
Here are the changes I made (look for the "HW1 additions" comment):
In fs/open.c:
asmlinkage long sys_open(const char * filename, int flags, int mode)
{
char * tmp;
int fd, error;
event_t* new_event;
#if BITS_PER_LONG != 32
flags |= O_LARGEFILE;
#endif
tmp = getname(filename);
fd = PTR_ERR(tmp);
if (!IS_ERR(tmp)) {
fd = get_unused_fd();
if (fd >= 0) {
struct file *f = filp_open(tmp, flags, mode);
error = PTR_ERR(f);
if (IS_ERR(f))
goto out_error;
fd_install(fd, f);
}
/* HW1 additions */
if (current->record_flag==1){
new_event=(event_t*)kmalloc(sizeof(event_t), GFP_KERNEL);
if (!new_event){
new_event->type=Open;
strcpy(new_event->filename, tmp);
file_queue_add(*new_event, current->queue);
}
}
/* End HW1 additions */
out:
putname(tmp);
}
return fd;
out_error:
put_unused_fd(fd);
fd = error;
goto out;
}
asmlinkage long sys_close(unsigned int fd)
{
struct file * filp;
struct files_struct *files = current->files;
event_t* new_event;
char* tmp = files->fd[fd]->f_dentry->d_name.name;
write_lock(&files->file_lock);
if (fd >= files->max_fds)
goto out_unlock;
filp = files->fd[fd];
if (!filp)
goto out_unlock;
files->fd[fd] = NULL;
FD_CLR(fd, files->close_on_exec);
__put_unused_fd(files, fd);
write_unlock(&files->file_lock);
/* HW1 additions */
if(current->record_flag == 1){
new_event=(event_t*)kmalloc(sizeof(event_t), GFP_KERNEL);
if (!new_event){
new_event->type=Close;
strcpy(new_event->filename, tmp);
file_queue_add(*new_event, current->queue);
}
}
/* End HW1 additions */
return filp_close(filp, files);
out_unlock:
write_unlock(&files->file_lock);
return -EBADF;
}
The task_struct defined in schedule.h was changed at the end to include:
unsigned int record_flag; /* when zero: do not record. when one: record. */
file_queue* queue;
And file queue as well as event t are defined in a separate file as follows:
typedef enum {Open, Close} EventType;
typedef struct event_t{
EventType type;
char filename[256];
}event_t;
typedef struct file_quque_t{
event_t queue[101];
int head, tail;
}file_queue;
file queue add works like this:
void file_queue_add(event_t event, file_queue* queue){
queue->queue[queue->head]=event;
queue->head = (queue->head+1) % 101;
if (queue->head==queue->tail){
queue->tail=(queue->tail+1) % 101;
}
}
if (!new_event) {
new_event->type = …
That's equivalent to if (new_event == NULL). I think you mean if (new_event != NULL), which the kernel folks typically write as if (new_event).
Can you please post the stackdump of the error. I don't see a place where queue_info structure is allocated memory. One more thing is you cannot be sure that process record_flag will be always zero if unassigned in kernel, because kernel is a long running program and memory contains garbage.
Its also possible to check the exact location in the function is occurring by looking at the stack trace.

Resources