why in Linux kernel we have to do the check: if (skb_shinfo(head)->frag_list) , at line 640 in file of ip_fragment.c - linux

In Linux kernel V2.6.23, when the receiver executes ip_frag_reasm(struct ipq *qp, struct net_device *dev) function to reasamble a IP packet from its fragments, there is a check as follows:
614 static struct sk_buff *ip_frag_reasm(struct ipq *qp, struct net_device *dev)
615 {
... ...
637 /* If the first fragment is fragmented itself, we split
638 * it to two chunks: the first with data and paged part
639 * and the second, holding only fragments. */
640 if (skb_shinfo(head)->frag_list) {
... ...
I don't know the reason why kernel shall check this condition at line 640, as we know, each ip fragment received from the network is an independent packet, so skb_shinfo(head)->frag_list is NULL obviously. can any one give me a context in which that skb_shinfo(head)->frag_list is not NULL when the skb is received? thanks a lot.

Related

How to read /proc/<pid>/pagemap in a kernel driver?

I am trying to read /proc//pagemap in a kernel driver like this:
uint64_t page;
uint64_t va = 0x7FFD1BF46530;`
loff_t pos = va / PAGE_SIZE * sizeof(uint64_t);
struct file * filp = filp_open("/proc/19030/pagemap", O_RDONLY, 0);
ssize_t nread = kernel_read(filp, &page, sizeof(page), &pos);
I get error -22 in nread (EINVAL, invalid argument) and
"kernel read not supported for file /19030/pagemap (pid: 19030 comm: tester)" in dmesg.
0x7FFD1BF46530 is a virtual address in a user space process pid 19030 (tester). I assume that pos is the offset into the file like in lseek64.
Doing the precise same thing as sudo with same values in a user space process, i.e. reading /proc/19030/pagemap works fine and produces a correct physical address.
The actual thing I am trying to do here is to find the physical address of a user space virtual address. I need the physical address for a device DMA transfer operation and a user space app needs to access this memory. This app allocates 1GB DMA memory with anonymous mmap from THP (Transparent Huge Pages). And I am trying to avoid the need for sudo by reading /proc//pagemap in a kernel driver via ioctl instead.
I would be happy to allocate huge page DMA memory in the driver but don't know how to do that. dma_alloc_coherent is limited to max 4MB allocations. Is there a way to get those allocated as continuous physical memory? I need hundreds of MB or many GB of DMA memory.
Problem with anonymous mmap is that it can only allocate max 1GB huge page as physically continuous memory. Allocating more works but the memory is not physically continuous and unusable for DMA.
Any good ideas or alternative ways of allocating huge pages as DMA memory?
Tried reading file /proc//pagemap in a kernel driver. Expected same results as when reading the file in a user space application which works ok.
"kernel read not supported for file …"
Indeed, as we see in __kernel_read()
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported(file, "read");
it fails if f_op->read_iter isn't or f_op->read is wired up (implemented), which is both the case for a pagemap file.
You could try pagemap_read() instead. – not feasible for reasons in the comments
When I had the problem of getting the physical address for a virtual address in a driver, I included and copied some kernel code (not that I recommend this, but I saw no other solution); here's an extract.
static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr
, unsigned long sz)
{ return NULL; }
void p4d_clear_bad(p4d_t *p4d) { p4d_ERROR(*p4d); p4d_clear(p4d); }
#include "mm/pagewalk.c"
static int pte(pte_t *pte, unsigned long addr
, unsigned long next, struct mm_walk *walk)
{
*(pte_t **)walk->private = pte;
return 1;
}
/* Scan the real Linux page tables and return a PTE pointer for
* a virtual address in a context.
* Returns true (1) if PTE was found, zero otherwise. The pointer to
* the PTE pointer is unmodified if PTE is not found.
*/
int
get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp)
{
struct mm_walk walk = { .pte_entry = pte, .mm = mm, .private = ptep };
return walk_page_range(addr, addr+PAGE_SIZE, &walk);
}
/* Find physical address for this virtual address. Normally used by
* I/O functions, but anyone can call it.
*/
static inline unsigned long iopa(unsigned long addr)
{
unsigned long pa;
/* I don't know why this won't work on PMacs or CHRP. It
* appears there is some bug, or there is some implicit
* mapping done not properly represented by BATs or in page
* tables.......I am actively working on resolving this, but
* can't hold up other stuff. -- Dan
*/
pte_t *pte;
struct mm_struct *mm;
#if 0
/* Check the BATs */
phys_addr_t v_mapped_by_bats(unsigned long va);
pa = v_mapped_by_bats(addr);
if (pa)
return pa;
#endif
/* Allow mapping of user addresses (within the thread)
* for DMA if necessary.
*/
if (addr < TASK_SIZE)
mm = current->mm;
else
mm = &init_mm;
ATTENTION: I needed the current address space.
You'd have to use mm = file->private_data instead.
pa = 0;
if (get_pteptr(mm, addr, &pte, NULL))
pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK);
return(pa);
}

Units of tx_queue & rx_queue in /proc/net/tcp

On a Linux 2.6.32, i'm looking at /proc/net/tcp and wondering what is the unit of tx_queue and rx_queue.
I can't find this information about receive-queue and transmit-queue in https://www.kernel.org/doc/Documentation/networking/proc_net_tcp.txt
Nor in man 5 proc which shows only:
The "tx_queue" and "rx_queue" are the outgoing and incoming data queue
in terms of kernel memory usage.
Is it bytes? or number of buffers? or maybe i missed a great documentation about this?
Thanks
Short answer - These count bytes. By running netperf TCP_RR with different sizes you could see the exact value of 1 count (only 1 packet in the air in a given time). This value would always show the packet size.
More info:
According to this post:
tx_queue:rx_queue
The size of the transmit and receive queues.
This is per socket. For TCP, the values are updated in get_tcp4_sock() function. It is a bit different in 2.6.32 and 4.14, but the idea is the same. According to the socket state the rx_queue value is updated to either sk->sk_ack_backlog or tp->rcv_nxt - tp->copied_seq. The second value might be negative and is fixed in later kernels to 0 if it is. sk_ack_backlog counts the unacked segments, this is a bit strange since this doesn't seem to be in bytes. Probably missing something here.
From tcp.h:
struct tcp_sock {
...
u32 rcv_nxt; /* What we want to receive next */
u32 copied_seq; /* Head of yet unread data */
Both count in bytes, so tp->rcv_nxt - tp->copied_seq is counting the pending bytes in the receive buffer for incoming packets.
tx_queue is set to tp->write_seq - tp->snd_una. Again from tcp.h:
struct tcp_sock {
...
u32 snd_una; /* First byte we want an ack for */
u32 write_seq; /* Tail(+1) of data held in tcp send buffer */
Here is a bit more clear to see the count is in bytes.
For UDP, it is simpler. The values are updated in udp4_format_sock():
static void udp4_format_sock(struct sock *sp, struct seq_file *f,
int bucket)
{
...
seq_printf(f, "%5d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5u %8d %lu %d %pK %d",
bucket, src, srcp, dest, destp, sp->sk_state,
sk_wmem_alloc_get(sp),
sk_rmem_alloc_get(sp),
sk_wmem_alloca_get and sk_rmem_alloc_get return sk_wmem_alloc and sk_rmem_alloc respectively, both are in bytes.
Hope this helps.

Mismatch between manpage and kernel behavior about getsockname

I've experienced a smashing stack (= buffer overflow) problem recently when trying to run iperf3. I pinpointed the reason to the getsockname() call (https://github.com/esnet/iperf/blob/master/src/net.c#L463) that makes the kernel copy more data (sizeof(sin_addr)) at the designed address (&sa) than the size of the variable on the stack at that address.
getsockname() redirects the call to getname() (AF_INET family) :
https://github.com/torvalds/linux/blob/master/net/ipv4/af_inet.c#L698
If I believe the manpage (ubuntu) it says:
int getsockname(int sockfd, struct sockaddr *addr, socklen_t *addrlen);
The addrlen argument should be initialized to indicate the amount of space (in bytes) pointed to by addr. On return it contains the actual size of the socket address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
But in the previous code excerpt, getname() does not care about the addrlen input value and uses the parameter as an output value only.
I had found a link (can't find it anymore) saying that BSD respects the previous manpage excerpt contrary to linux.
Am I missing something? I find it awkward that the documentation would be that much off, I've checked other linux XXX_getname calls and all I saw didn't care about the input length.
Short answer
I believe that the addrlen value is not checked in kernel just to not waste some CPU cycles, because it should always be of known type (e.g. struct sockaddr), therefore it should always has known and fixed size (which is 16 bytes). So kernel just rewrites addrlen to 16, no matter what.
Regarding the issue you are having: I'm not sure why it's happening, but it doesn't actually seem that it's about size mismatch. I'm pretty sure kernel and userspace both have the same size of that structure which should be passed to getsockname() syscall (proof is below). So basically the situation you are describing here:
...that makes the kernel copy more data (sizeof(sin_addr)) at the designed address (&sa) than the size of the variable on the stack at that address
is not the case. I could only imagine how many application would fail if it was true.
Detailed explanation
Userspace side
In iperf sources you have next definition of sockaddr struct (/usr/include/bits/socket.h):
/* Structure describing a generic socket address. */
struct sockaddr
{
__SOCKADDR_COMMON (sa_); /* Common data: address family and length. */
char sa_data[14]; /* Address data. */
};
And __SOCKADDR_COMMON macro defined as follows (/usr/include/bits/sockaddr.h):
/* This macro is used to declare the initial common members
of the data types used for socket addresses, `struct sockaddr',
`struct sockaddr_in', `struct sockaddr_un', etc. */
#define __SOCKADDR_COMMON(sa_prefix) \
sa_family_t sa_prefix##family
And sa_family_t defined as:
/* POSIX.1g specifies this type name for the `sa_family' member. */
typedef unsigned short int sa_family_t;
So basically sizeof(struct sockaddr) is always 16 bytes (= sizeof(char[14]) + sizeof(short)).
Kernel side
In inet_getname() function you see that addrlen param is rewritten by next value:
*uaddr_len = sizeof(*sin);
where sin is:
DECLARE_SOCKADDR(struct sockaddr_in *, sin, uaddr);
So you see that sin has type of struct sockaddr_in *. This structure is defined as follows (include/uapi/linux/in.h):
/* Structure describing an Internet (IP) socket address. */
#define __SOCK_SIZE__ 16 /* sizeof(struct sockaddr) */
struct sockaddr_in {
__kernel_sa_family_t sin_family; /* Address family */
__be16 sin_port; /* Port number */
struct in_addr sin_addr; /* Internet address */
/* Pad to size of `struct sockaddr'. */
unsigned char __pad[__SOCK_SIZE__ - sizeof(short int) -
sizeof(unsigned short int) - sizeof(struct in_addr)];
};
So sin variable is also 16 bytes long.
UPDATE
I'll try to reply to your comment:
if getsockname wants to allocate an ipv6 instead that may be why it overflows the buffer
When calling getsockname() for AF_INET6 socket, kernel will figure (in getsockname() syscall, by sockfd_lookup_light() function) that inet6_getname() should be called to handle your request. In that case, uaddr_len will be assigned with next value:
struct sockaddr_in6 *sin = (struct sockaddr_in6 *)uaddr;
...
*uaddr_len = sizeof(*sin);
So if you are using sockaddr_in6 struct in your user-space program too, the size will be the same. Of course, if your userspace application is passing sockaddr structure to getsockname for AF_INET6 socket, there will be some sort of overflow (because sizeof(struct sockaddr_in6) > sizeof(struct sockaddr)). But I believe it's not the case for iperf3 tool you are using. And if it is -- it's iperf that should be fixed in the first place, and not the kernel.

Getting pid of pages that are being swapped out

My goal is to find out process-id of pages which are being swapped out. The Linux Kernel function swap_writepage() takes a pointer to struct page as a part of formal argument while swapping a page on backing store. All swap-out operations are done by "kswapd" process. I need to find out pid(s) of the processes whose page is passed as argument in the swap_writepage() function. In order to get that, I was able to find all page table entries associated with that page using rmap structures.
How can I get pid from a pte or from struct page? I have used sytemtap to get the value of struct page pointer, received in swap_writepage() function as argument. Also, the pid() function prints the pid of current process running not the pid of process to which that page belongs which always gives kswapd process.
Here is the example of how reverse mapping used in modern Linux (copied from lxr):
1435 static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
1436 {
1437 struct anon_vma *anon_vma;
1438 struct anon_vma_chain *avc;
1439 int ret = SWAP_AGAIN;
1440
1441 anon_vma = page_lock_anon_vma(page);
1442 if (!anon_vma)
1443 return ret;
1444
1445 list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
1446 struct vm_area_struct *vma = avc->vma;
1447 unsigned long address;
1448
1449 /*
1450 * During exec, a temporary VMA is setup and later moved.
1451 * The VMA is moved under the anon_vma lock but not the
1452 * page tables leading to a race where migration cannot
1453 * find the migration ptes. Rather than increasing the
1454 * locking requirements of exec(), migration skips
1455 * temporary VMAs until after exec() completes.
1456 */
1457 if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
1458 is_vma_temporary_stack(vma))
1459 continue;
1460
1461 address = vma_address(page, vma);
1462 if (address == -EFAULT)
1463 continue;
1464 ret = try_to_unmap_one(page, vma, address, flags);
1465 if (ret != SWAP_AGAIN || !page_mapped(page))
1466 break;
1467 }
1468
1469 page_unlock_anon_vma(anon_vma);
1470 return ret;
1471 }
This example shows for rmap used for unmapping pages. So each anonymous page in ->mapping field holds anon_vma object. anon_vma holds a list of vma areas page is mapped to. Having vma you have mm, having mm you have a task_struct. that's it. If you have any doubts - here is the illustraction
Daniel P. Bovet, Marco Cesati Understanding Linux Kernel chapter 17.2

how to detect a buffer over run on serial port in linux using c++

I have a big problem. At present I am accessing a serial port via the following hooks:
fd = open( "/dev/ttyS1", O_RDWR | O_NOCTTY )
then I read from it using the following chunk of code
i = select( fd + 1, &rfds, NULL, NULL, &tv )
...
iLen = read( fd, buf, MAX_PACKET_LEN )
the problem is that before I read, I need to detect if there were any buffer overruns. Both at the serial port level and the internal tty flip buffers.
We tried cat /proc/tty/driver/serial but it doesn't seem to list the overruns (see output below)
1: uart:16550A port:000002F8 irq:3 tx:70774 rx:862484 fe:44443 pe:270023 brk:30301 RTS|CTS|DTR
According to the kernel sources, you should use the TIOCGICOUNT ioctl. The third ioctl argument should be a pointer to the following struct, defined in <linux/serial.h> :
/*
* Serial input interrupt line counters -- external structure
* Four lines can interrupt: CTS, DSR, RI, DCD
*/
struct serial_icounter_struct {
int cts, dsr, rng, dcd;
int rx, tx;
int frame, overrun, parity, brk;
int buf_overrun;
int reserved[9];
};
I don't know if every driver detect all conditions however.
Dark Templer,
Your serial driver should update the icount.frame/overrun errors.
struct uart_port in serial_core.h has member struct uart_icount, which should be updated in platform serial device driver as per interrupts received.

Resources