I am using epoll_ctl() and epoll_wait() system calls.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
struct epoll_event {
uint32_t events; /* epoll events (bit mask) */
epoll_data_t data; /* User data */
};
typedef union epoll_data {
enter code here`void *ptr; /* Pointer to user-defined data */
int fd; /* File descriptor */
uint32_t u32; /* 32-bit integer */
uint64_t u64; /* 64-bit integer */
} epoll_data_t;
When using epoll_ctl, I can use the union epoll_data to specify the fd. One way is to specify it in "fd" member. Other way is to specify it in my own structure, "ptr" member will point to the structure.
What is the use of "u32" and "u64" ?
I went through the kernel system call implementation and found the following:
1. epoll_ctl initializes the epoll_event and stores it (in some RB tree format)
2. when fd is ready, epoll_wait returns epoll_event that was filled in epoll_ctl. After this I can identify the fd which becomes ready. I don't understand the purpose of "u32" and "u64".
The predefined union is for convinient usages and you will usually use only one of them:
use ptr if you want to store a pointer
use fd if you want to store socket descriptor
use u32/u64 if you want to store a general/opaque 32/64 bit number
Actually epoll_data is 64bit data associated with a socket event for you to store anything to find event handler as epoll_wait returns just 2 things: the event & associated epoll_data.
The fields of epoll_data do not have any predefined meaning; your application assigns a meaning.
Any of the fields can be used to stored the information needed by your application to identify the event.
(u32 or u64 might be useful for something like an array index.)
Related
I've experienced a smashing stack (= buffer overflow) problem recently when trying to run iperf3. I pinpointed the reason to the getsockname() call (https://github.com/esnet/iperf/blob/master/src/net.c#L463) that makes the kernel copy more data (sizeof(sin_addr)) at the designed address (&sa) than the size of the variable on the stack at that address.
getsockname() redirects the call to getname() (AF_INET family) :
https://github.com/torvalds/linux/blob/master/net/ipv4/af_inet.c#L698
If I believe the manpage (ubuntu) it says:
int getsockname(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
The addrlen argument should be initialized to indicate the amount of space (in bytes) pointed to by addr. On return it contains the actual size of the socket address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
But in the previous code excerpt, getname() does not care about the addrlen input value and uses the parameter as an output value only.
I had found a link (can't find it anymore) saying that BSD respects the previous manpage excerpt contrary to linux.
Am I missing something? I find it awkward that the documentation would be that much off, I've checked other linux XXX_getname calls and all I saw didn't care about the input length.
Short answer
I believe that the addrlen value is not checked in kernel just to not waste some CPU cycles, because it should always be of known type (e.g. struct sockaddr), therefore it should always has known and fixed size (which is 16 bytes). So kernel just rewrites addrlen to 16, no matter what.
Regarding the issue you are having: I'm not sure why it's happening, but it doesn't actually seem that it's about size mismatch. I'm pretty sure kernel and userspace both have the same size of that structure which should be passed to getsockname() syscall (proof is below). So basically the situation you are describing here:
...that makes the kernel copy more data (sizeof(sin_addr)) at the designed address (&sa) than the size of the variable on the stack at that address
is not the case. I could only imagine how many application would fail if it was true.
Detailed explanation
Userspace side
In iperf sources you have next definition of sockaddr struct (/usr/include/bits/socket.h):
/* Structure describing a generic socket address. */
struct sockaddr
{
__SOCKADDR_COMMON (sa_); /* Common data: address family and length. */
char sa_data[14]; /* Address data. */
};
And __SOCKADDR_COMMON macro defined as follows (/usr/include/bits/sockaddr.h):
/* This macro is used to declare the initial common members
of the data types used for socket addresses, `struct sockaddr',
`struct sockaddr_in', `struct sockaddr_un', etc. */
#define __SOCKADDR_COMMON(sa_prefix) \
sa_family_t sa_prefix##family
And sa_family_t defined as:
/* POSIX.1g specifies this type name for the `sa_family' member. */
typedef unsigned short int sa_family_t;
So basically sizeof(struct sockaddr) is always 16 bytes (= sizeof(char[14]) + sizeof(short)).
Kernel side
In inet_getname() function you see that addrlen param is rewritten by next value:
*uaddr_len = sizeof(*sin);
where sin is:
DECLARE_SOCKADDR(struct sockaddr_in *, sin, uaddr);
So you see that sin has type of struct sockaddr_in *. This structure is defined as follows (include/uapi/linux/in.h):
/* Structure describing an Internet (IP) socket address. */
#define __SOCK_SIZE__ 16 /* sizeof(struct sockaddr) */
struct sockaddr_in {
__kernel_sa_family_t sin_family; /* Address family */
__be16 sin_port; /* Port number */
struct in_addr sin_addr; /* Internet address */
/* Pad to size of `struct sockaddr'. */
unsigned char __pad[__SOCK_SIZE__ - sizeof(short int) -
sizeof(unsigned short int) - sizeof(struct in_addr)];
};
So sin variable is also 16 bytes long.
UPDATE
I'll try to reply to your comment:
if getsockname wants to allocate an ipv6 instead that may be why it overflows the buffer
When calling getsockname() for AF_INET6 socket, kernel will figure (in getsockname() syscall, by sockfd_lookup_light() function) that inet6_getname() should be called to handle your request. In that case, uaddr_len will be assigned with next value:
struct sockaddr_in6 *sin = (struct sockaddr_in6 *)uaddr;
...
*uaddr_len = sizeof(*sin);
So if you are using sockaddr_in6 struct in your user-space program too, the size will be the same. Of course, if your userspace application is passing sockaddr structure to getsockname for AF_INET6 socket, there will be some sort of overflow (because sizeof(struct sockaddr_in6) > sizeof(struct sockaddr)). But I believe it's not the case for iperf3 tool you are using. And if it is -- it's iperf that should be fixed in the first place, and not the kernel.
I'm having some problems getting both sides of code using uinput working.
Based on Getting started with uinput: the user level input subsystem[dead link; archived] I put together the following writer (minus error handling):
int main(int ac, char **av)
{
int fd = open("/dev/uinput", O_WRONLY | O_NONBLOCK);
int ret = ioctl(fd, UI_SET_EVBIT, EV_ABS);
ret = ioctl(fd, UI_SET_ABSBIT, ABS_X);
struct uinput_user_dev uidev = {0};
snprintf(uidev.name, UINPUT_MAX_NAME_SIZE, "uinput-rotary");
uidev.absmin[ABS_X] = 0;
uidev.absmax[ABS_X] = 255;
ret = write(fd, &uidev, sizeof(uidev));
ret = ioctl(fd, UI_DEV_CREATE);
struct input_event ev = {0};
ev.type = EV_ABS;
ev.code = ABS_X;
ev.value = 42;
ret = write(fd, &ev, sizeof(ev));
getchar();
ret = ioctl(fd, UI_DEV_DESTROY);
return EXIT_SUCCESS;
}
That seems to work, at least the full input_event structure seems to be written.
I then wrote the most naive reader of events I could come up with:
int main(int ac, char **av)
{
int fd = open(av[1], O_RDONLY);
char name[256] = "unknown";
ioctl(fd, EVIOCGNAME(sizeof(name)), name);
printf("reading from %s\n", name);
struct input_event ev = {0};
int ret = read(fd, &ev, sizeof(ev));
printf("Read an event! %i\n", ret);
printf("ev.time.tv_sec: %li\n", ev.time.tv_sec);
printf("ev.time.tv_usec: %li\n", ev.time.tv_usec);
printf("ev.type: %hi\n", ev.type);
printf("ev.code: %hi\n", ev.code);
printf("ev.value: %li\n", ev.value);
return EXIT_SUCCESS;
}
Unfortunately the reader side doesn't work at all; only manages to read 8 bytes each time, which isn't nearly a full input_event structure.
What silly mistake am I making?
You should also be writing a sync event after the actual event. In your writer side code:
struct input_event ev = {0};
ev.type = EV_ABS;
ev.code = ABS_X;
ev.value = 42;
usleep(1500);
memset(&ev, 0, sizeof(ev));
ev.type = EV_SYN;
ev.code = 0;
ev.value = 0;
ret = write(fd, &ev, sizeof(ev));
getchar();
TL;DR: The kernel expects an EV_SYN event of code SYN_REPORT because individual events may be grouped together, i.e., when they happen at the same point in time.
You can think of it as the kernel interpreting a group of events rather than individual events as specified in struct input_event. An EV_SYN event delimits these groups of events with SYN_REPORT as code, i.e., this provides a grouping mechanism for signaling events that occur simultaneously.
For example, imagine you touch with your finger on the surface of a touchpad:
Given the definition of struct input_event in linux/input.h:
struct input_event {
/* ... */
__u16 type; /* e.g., EV_ABS, EV_REL */
__u16 code; /* e.g., ABS_X for EV_ABS */
__s32 value; /* e.g., the value of the x coordinate for ABS_X */
};
It isn't possible to create a single EV_ABS event that can hold both the codes ABS_X and ABS_Y, as well as the values for specifying the x and y coordinates, respectively – there is just a single code member in struct input_event1.
Instead, two EV_ABS events2 will be created:
One with the code ABS_X and the x coordinate as value.
Another with the code ABS_Y and the y coordinate as value.
It wouldn't be entirely correct to interpret these two events as two events separated in time, i.e., one that indicates a change of the x coordinate and another that occurs later in time and indicates a change of the y coordinate.
Instead of two in-time-separated events, these events should be grouped together and interpreted by the kernel as a single unit of input data change that indicates a change of both the x and y coordinates at the same time. This is why the EV_SYN mechanism described above exists: by generating one EV_SYN event with the SYN_REPORT code just after these two EV_ABS events (i.e., ABS_X and ABS_Y), we are able to group them as a unit.
1This is directly related to the fact that a struct input_event is of fixed size and can't grow arbitrarily.
2Tecnically, more events may be created. For example, another event of type EV_KEY with the code BTN_TOUCH will likely be created. However, this is irrelevant for the point I want to make.
From what I understand, a typical buffer overflow attack occurs when an attack overflows a buffer of memory on the stack, thus allowing the attacker to inject malicious code and rewrite the return address on the stack to point to that code.
This is a common concern when using functions (such as sscanf) that blindly copy data from one area to another, checking one for a termination byte:
char str[8]; /* holds up to 8 bytes of data */
char *buf = "lots and lots of foobars"; /* way more than 8 bytes of data */
sscanf(buf, "%s", str); /* buffer overflow occurs here! */
I noticed some sysfs_ops store functions in the Linux kernel are implemented with the Linux kernel's version of the sscanf function:
static char str[8]; /* global string */
static ssize_t my_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t size)
{
sscanf(buf, "%s", str); /* buf holds more than 8 bytes! */
return size;
}
Suppose this store callback function is set to a writable sysfs attribute. Would a malicious user be able to intentionally overflow the buffer via a write call?
Normally, I would expect guards against buffer overflow attacks -- such as limiting the number of bytes read -- but I see none in a good number of functions (for example in drivers/scsi/scsi_sysfs.c).
Does the implementation of the Linux kernel version of sscanf protect against buffer overflow attacks; or is there another reason -- perhaps buffer overflow attacks are impossible given how the Linux kernel works under the hood?
The Linux sscanf() is vulnerable to buffer overflows; inspection of the source shows this. You can use width specifiers to limit the amount a %s is allowed to write. At some point your str must have had copy_from_user() run on it as well. It is possible the user space to pass some garbage pointer to the kernel.
In the version of Linux you cited, the scsi_sysfs.c does have a buffer overflow. The latest version does not. The committed fix should fix the issue you see.
Short answer:
sscanf, when well called, will not cause buffer overflow, especially in sysfs xxx_store() function. (There are a lot sscanf in sysfs XXX_store() examples), because Linux kernel add a '\0' (zero-terminated) byte after the string (buf[len] = 0;) for your XXX_store() function.
Long answer:
Normally, sysfs are defined to have a strict formatted data. Since you expect 8 bytes at most, it's reasonable to limit the size you get like this:
static char str[8]; /* global string */
static ssize_t my_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t size)
{
if (size > 8) {
printk("Error: Input size > 8: too large\n");
return -EINVAL;
}
sscanf(buf, "%s", str); /* buf holds more than 8 bytes! */
return size;
}
(Note: use 9 rather than 8, if you expect a 8-bytes string plus '\n')
(Note that you do reject some inputs such as those with many leading white spaces. However, who would send a string with many leading white spaces? Those who want to break your code, right? If they don't follow your spec, just reject them.)
Note that Linux kernel purposely inserts a '\0' at offset len (i.e. buf[len] = 0;) when the user write len bytes to sysfs purposely for safe sscanf, as said in a comment in kernel 2.6: fs/sysfs/file.c:
static int
fill_write_buffer(struct sysfs_buffer * buffer, const char __user * buf, size_t count)
{
int error;
if (!buffer->page)
buffer->page = (char *)get_zeroed_page(GFP_KERNEL);
if (!buffer->page)
return -ENOMEM;
if (count >= PAGE_SIZE)
count = PAGE_SIZE - 1;
error = copy_from_user(buffer->page,buf,count);
buffer->needs_read_fill = 1;
/* if buf is assumed to contain a string, terminate it by \0,
so e.g. sscanf() can scan the string easily */
buffer->page[count] = 0;
return error ? -EFAULT : count;
}
...
static ssize_t
sysfs_write_file(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
{
struct sysfs_buffer * buffer = file->private_data;
ssize_t len;
mutex_lock(&buffer->mutex);
len = fill_write_buffer(buffer, buf, count);
if (len > 0)
len = flush_write_buffer(file->f_path.dentry, buffer, len);
if (len > 0)
*ppos += len;
mutex_unlock(&buffer->mutex);
return len;
}
Higher kernel version keeps the same logic (though already completely rewritten).
Recently when I look into how the thread-local storage is implemented in glibc, I found the following code, which implements the API pthread_key_create()
int
__pthread_key_create (key, destr)
pthread_key_t *key;
void (*destr) (void *);
{
/* Find a slot in __pthread_kyes which is unused. */
for (size_t cnt = 0; cnt < PTHREAD_KEYS_MAX; ++cnt)
{
uintptr_t seq = __pthread_keys[cnt].seq;
if (KEY_UNUSED (seq) && KEY_USABLE (seq)
/* We found an unused slot. Try to allocate it. */
&& ! atomic_compare_and_exchange_bool_acq (&__pthread_keys[cnt].seq,
seq + 1, seq))
{
/* Remember the destructor. */
__pthread_keys[cnt].destr = destr;
/* Return the key to the caller. */
*key = cnt;
/* The call succeeded. */
return 0;
}
}
return EAGAIN;
}
__pthread_keys is a global array accessed by all threads. I don't understand why the read of its member seq is not synchronized as in the following:
uintptr_t seq = __pthread_keys[cnt].seq;
although it is syncrhonized when modified later.
FYI, __pthread_keys is an array of type struct pthread_key_struct, which is defined as follows:
/* Thread-local data handling. */
struct pthread_key_struct
{
/* Sequence numbers. Even numbers indicated vacant entries. Note
that zero is even. We use uintptr_t to not require padding on
32- and 64-bit machines. On 64-bit machines it helps to avoid
wrapping, too. */
uintptr_t seq;
/* Destructor for the data. */
void (*destr) (void *);
};
Thanks in advance.
In this case, the loop can avoid an expensive lock acquisition. The atomic compare and swap operation done later (atomic_compare_and_exchange_bool_acq) will make sure only one thread can successfully increment the sequence value and return the key to the caller. Other threads reading the same value in the first step will keep looping since the CAS can only succeed for a single thread.
This works because the sequence value alternates between even (empty) and odd (occupied). Incrementing the value to odd prevents other threads from acquiring the slot.
Just reading the value is fewer cycles than the CAS instruction typically, so it makes sense to peek at the value, before doing the CAS.
There are many wait-free and lock-free algorithms that take advantage of the CAS instruction to achieve low-overhead synchronization.
I am trying to pass a struct from user space to kernel space. I had been trying for many hours and it isn't working. Here is what I have done so far..
int device_ioctl(struct inode *inode, struct file *filep, unsigned int cmd, unsigned long arg){
int ret, SIZE;
switch(cmd){
case PASS_STRUCT_ARRAY_SIZE:
SIZE = (int *)arg;
if(ret < 0){
printk("Error in PASS_STRUCT_ARRAY_SIZE\n");
return -1;
}
printk("Struct Array Size : %d\n",SIZE);
break;
case PASS_STRUCT:
struct mesg{
int pIDs[SIZE];
int niceVal;
};
struct mesg data;
ret = copy_from_user(&data, arg, sizeof(*data));
if(ret < 0){
printk("PASS_STRUCT\n");
return -1;
}
printk("Message PASS_STRUCT : %d\n",data.niceVal);
break;
default :
return -ENOTTY;
}
return 0;
}
I have trouble defining the struct. What is the correct way to define it? I want to have int pIDs[SIZE]. Will int *pIDs do it(in user space it is defined like pIDs[SIZE])?
EDIT:
With the above change I get this error? error: expected expression before 'struct' any ideas?
There are two variants of the structure in your question.
struct mesg1{
int *pIDs;
int niceVal;
};
struct mesg2{
int pIDs[SIZE];
int niceVal;
};
They are different; in case of mesg1 you has pointer to int array (which is outside the struct). In other case (mesg2) there is int array inside the struct.
If your SIZE is fixed (in API of your module; the same value used in user- and kernel- space), you can use second variant (mesg2).
To use first variant of structure (mesg1) you may add field size to the structure itself, like:
struct mesg1{
int pIDs_size;
int *pIDs;
int niceVal;
};
and fill it with count of ints, pointed by *pIDs.
PS: And please, never use structures with variable-sized arrays in the middle of the struct (aka VLAIS). This is proprietary, wierd, buggy and non-documented extension to C language by GCC compiler. Only last field of struct can be array with variable size (VLA) according to international C standard. Some examples here: 1 2
PPS:
You can declare you struct with VLA (if there is only single array with variable size):
struct mesg2{
int niceVal;
int pIDs[];
};
but you should be careful when allocating memory for such struct with VLA