Where do we define the type of structure returned by the kernel when using the "perf_event_open" system call using mmap? - linux

I'm trying to use the syscall perf_event_open to get some performance data from the system.
I am currently working on periodic data retrieval using shared memory with a ring buffer.
But I can't find what structure is returned in each section of the ring buffer. The manual page enumerate all possibilities, but that's all.
I can't figure out which member of the perf_event_attr structure to fill in to control what type of structure will be returned to the ring buffer.
If you have some informations about that, I'll be happy to read it !

The https://github.com/torvalds/linux/blob/master/tools/perf/design.txt documentation has description of mmaped ring, and perf script / perf script -D can decode ring data when it is saved as perf.data file. Some parts of the doc are outdated, but it is still useful for perf_event_open syscall description.
First mmap page is metadata page, rest 2^n pages are filled with events where every event has header struct perf_event_header of 8 bytes.
Like stated, asynchronous events, like counter overflow or PROT_EXEC
mmap tracking are logged into a ring-buffer. This ring-buffer is
created and accessed through mmap().
The mmap size should be 1+2^n pages, where the first page is a
meta-data page (struct perf_event_mmap_page) that contains various
bits of information such as where the ring-buffer head is.
/*
* Structure of the page that can be mapped via mmap
*/
struct perf_event_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
...
}
The following 2^n pages are the ring-buffer which contains events of the form:
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (1 << 1)
#define PERF_RECORD_MISC_OVERFLOW (1 << 2)
struct perf_event_header {
__u32 type;
__u16 misc;
__u16 size;
};
enum perf_event_type
The design.txt doc has incorrect values for enum perf_event_type, check actual perf_events kernel subsystem source codes - https://github.com/torvalds/linux/blob/master/include/uapi/linux/perf_event.h#L707. That uapi/linux/perf_event.h file also has some struct hints in comments, like
* #
* # The RAW record below is opaque data wrt the ABI
* #
* # That is, the ABI doesn't make any promises wrt to
* # the stability of its content, it may vary depending
* # on event, hardware, kernel version and phase of
* # the moon.
* #
* # In other words, PERF_SAMPLE_RAW contents are not an ABI.
* #
*
* { u32 size;
* char data[size];}&& PERF_SAMPLE_RAW
*...
* { u64 size;
* char data[size];
* u64 dyn_size; } && PERF_SAMPLE_STACK_USER
*...
PERF_RECORD_SAMPLE = 9,

Related

What is the `flags` field for in canfd_frame in SocketCAN?

A non-FD ("legacy") CAN frame has the following format in SocketCAN:
struct can_frame {
canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
__u8 can_dlc; /* frame payload length in byte (0 .. 8) */
__u8 __pad; /* padding */
__u8 __res0; /* reserved / padding */
__u8 __res1; /* reserved / padding */
__u8 data[8] __attribute__((aligned(8)));
};
The frame ID, length, and data all have clear places, plus some padding we don't worry about. For a CAN-FD frame, though, there is an extra field:
struct canfd_frame {
canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
__u8 len; /* frame payload length in byte (0 .. 64) */
__u8 flags; /* additional flags for CAN FD */
__u8 __res0; /* reserved / padding */
__u8 __res1; /* reserved / padding */
__u8 data[64] __attribute__((aligned(8)));
};
The flags field looks useful, but I have found no documentation as to what it actually holds. Is it internal (i.e. set by the kernel)? What are the possible flags, and what do they mean?
Thank you!
I found a bit of information here:
http://elixir.free-electrons.com/linux/latest/source/include/uapi/linux/can.h#L112
/*
* defined bits for canfd_frame.flags
*
* The use of struct canfd_frame implies the Extended Data Length (EDL) bit to
* be set in the CAN frame bitstream on the wire. The EDL bit switch turns
* the CAN controllers bitstream processor into the CAN FD mode which creates
* two new options within the CAN FD frame specification:
*
* Bit Rate Switch - to indicate a second bitrate is/was used for the payload
* Error State Indicator - represents the error state of the transmitting node
*
* As the CANFD_ESI bit is internally generated by the transmitting CAN
* controller only the CANFD_BRS bit is relevant for real CAN controllers when
* building a CAN FD frame for transmission. Setting the CANFD_ESI bit can make
* sense for virtual CAN interfaces to test applications with echoed frames.
*/
If I understand this right, the ESI bit is only useful for testing on virtual CAN interaces. The BRS bit is pretty low-level, though, and this does not specify whether the hardware will automatically set it or unset it.

How does seccomp-bpf filter syscalls?

I'm investigating the implementation detail of seccomp-bpf, the syscall filtration mechanism that was introduced into Linux since version 3.5.
I looked into the source code of kernel/seccomp.c from Linux 3.10 and want to ask some questions about it.
From seccomp.c, it seems that seccomp_run_filters() is called from __secure_computing() to test the syscall called by the current process.
But looking into seccomp_run_filters(), the syscall number that is passed as an argument is not used anywhere.
It seems that sk_run_filter() is the implementation of BPF filter machine, but sk_run_filter() is called from seccomp_run_filters() with the first argument (the buffer to run the filter on) NULL.
My question is: how can seccomp_run_filters() filter syscalls without using the argument?
The following is the source code of seccomp_run_filters():
/**
* seccomp_run_filters - evaluates all seccomp filters against #syscall
* #syscall: number of the current system call
*
* Returns valid seccomp BPF response codes.
*/
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
u32 ret = SECCOMP_RET_ALLOW;
/* Ensure unexpected behavior doesn't result in failing open. */
if (WARN_ON(current->seccomp.filter == NULL))
return SECCOMP_RET_KILL;
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
for (f = current->seccomp.filter; f; f = f->prev) {
u32 cur_ret = sk_run_filter(NULL, f->insns);
if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
ret = cur_ret;
}
return ret;
}
When a user process enters the kernel, the register set is stored to a kernel variable.
The function sk_run_filter implements the interpreter for the filter language. The relevant instruction for seccomp filters is BPF_S_ANC_SECCOMP_LD_W. Each instruction has a constant k, and in this case it specifies the index of the word to be read.
#ifdef CONFIG_SECCOMP_FILTER
case BPF_S_ANC_SECCOMP_LD_W:
A = seccomp_bpf_load(fentry->k);
continue;
#endif
The function seccomp_bpf_load uses the current register set of the user thread to determine the system call information.

Difference between structures ttysize and winsize?

I have following code to get width and height of screen in Linux.
#ifdef TIOCGSIZE
struct ttysize ts;
ioctl(STDIN_FILENO, TIOCGSIZE, &ts);
cols = ts.ts_cols;
lines = ts.ts_lines;
#elif defined(TIOCGWINSZ)
struct winsize ts;
ioctl(STDIN_FILENO, TIOCGWINSZ, &ts);
cols = ts.ws_col;
lines = ts.ws_row;
#endif /* TIOCGSIZE */
What is difference between ttysize and winsize?
The ttysize was the original implementation for SunOS 3.0 (February 1986), and soon after made obsolete by winsize, which adds the size of the window in pixels. Here's
what the definitions look like in <sys/ttycom.h> from SunOS 4:
/*
* Window/terminal size structure.
* This information is stored by the kernel
* in order to provide a consistent interface,
* but is not used by the kernel.
*
* Type must be "unsigned short" so that types.h not required.
*/
struct winsize {
unsigned short ws_row; /* rows, in characters */
unsigned short ws_col; /* columns, in characters */
unsigned short ws_xpixel; /* horizontal size, pixels - not used */
unsigned short ws_ypixel; /* vertical size, pixels - not used */
};
#define TIOCGWINSZ _IOR(t, 104, struct winsize) /* get window size */
#define TIOCSWINSZ _IOW(t, 103, struct winsize) /* set window size */
/*
* Sun version of same.
*/
struct ttysize {
int ts_lines; /* number of lines on terminal */
int ts_cols; /* number of columns on terminal */
};
#define TIOCSSIZE _IOW(t,37,struct ttysize)/* set tty size */
#define TIOCGSIZE _IOR(t,38,struct ttysize)/* get tty size */
The data types are different (an integer would waste memory), and the fields have different names.
The ttysize structure has long been obsolete: if either is provided by the system, winsize is supported. That wasn't true when porting ncurses to SCO OpenServer in 1997, as noted in this chunk from lib_setup.c:
/*
* SCO defines TIOCGSIZE and the corresponding struct. Other systems (SunOS,
* Solaris, IRIX) define TIOCGWINSZ and struct winsize.
*/
#ifdef TIOCGSIZE
# define IOCTL_WINSIZE TIOCGSIZE
# define STRUCT_WINSIZE struct ttysize
# define WINSIZE_ROWS(n) (int)n.ts_lines
# define WINSIZE_COLS(n) (int)n.ts_cols
#else
# ifdef TIOCGWINSZ
# define IOCTL_WINSIZE TIOCGWINSZ
# define STRUCT_WINSIZE struct winsize
# define WINSIZE_ROWS(n) (int)n.ws_row
# define WINSIZE_COLS(n) (int)n.ws_col
# endif
#endif
You might notice that Linux is not mentioned in the comment. According to comments in asm-sparc64/ioctls.h, the ioctl for ttysize was unsupported as of 2.6.16:
/* Note that all the ioctls that are not available in Linux have a
* double underscore on the front to: a) avoid some programs to
* think we support some ioctls under Linux (autoconfiguration stuff)
*/
...
#define TIOCCONS _IO('t', 36)
#define __TIOCSSIZE _IOW('t', 37, struct sunos_ttysize) /* SunOS Specific */
#define __TIOCGSIZE _IOR('t', 38, struct sunos_ttysize) /* SunOS Specific */
#define TIOCGSOFTCAR _IOR('t', 100, int)
#define TIOCSSOFTCAR _IOW('t', 101, int)
#define __TIOCUCNTL _IOW('t', 102, int) /* SunOS Specific */
#define TIOCSWINSZ _IOW('t', 103, struct winsize)
#define TIOCGWINSZ _IOR('t', 104, struct winsize)
A much earlier comment in 1995 added the definitions (without the double-underscore). Possibly a few programs used that with Linux, although winsize was well established on most platforms before Linux was begun. A little more digging finds that the double-underscore was introduced in 1996 (patch-2.1.9 linux/include/asm-sparc/ioctls.h). Given that, very few programs would have used it with Linux.
Further reading:
Garbage-collect struct ttysize (OpenBSD mailing list)
/dev/console MKS manual page
They have the same function (at least in this case - maybe one structure has an additional field not used here).
Historically some (mainly older) Linux versions have different definitions of the IOCTL codes. Therefore some Linux versions have only TIOCGSIZE defined (using a ttysize structure) and some Linux versions have only TIOCGWINSZ defined.
Using the "#ifdef" construction the program can be compiled for both Linux versions.
Newer Linux versions should have both IOCTLs defined.

Geting numbers of files in specific directory in Linux

Is there anyway to get the total number of files in a specific directory not iterating readdir(3)?
I mean only direct members of a specific directory.
It seems that the only way to get the number of files is calling readdir(3) repeatedly until it returns zero.
Is there any other way to get the number in O(1)? I need a solution that works in Linux.
Thanks.
scandir() example, requires dirent.h:
struct dirent **namelist;
int n=scandir(".", &namelist, 0, alphasort); // "." == current directory.
if (n < 0)
{
perror("scandir");
exit(1);
}
printf("files found = %d\n", n);
free(namelist);
I don't think it is possible in O(1).
Just think about inode structure. There's no clue for this.
But if it's OK to get number of files in a filesystem,
you can use statvfs(2).
#include <sys/vfs.h> /* or <sys/statfs.h> */
int statfs(const char *path, struct statfs *buf);
struct statfs {
__SWORD_TYPE f_type; /* type of file system (see below) */
__SWORD_TYPE f_bsize; /* optimal transfer block size */
fsblkcnt_t f_blocks; /* total data blocks in file system */
fsblkcnt_t f_bfree; /* free blocks in fs */
fsblkcnt_t f_bavail; /* free blocks available to
unprivileged user */
fsfilcnt_t f_files; /* total file nodes in file system */
fsfilcnt_t f_ffree; /* free file nodes in fs */
fsid_t f_fsid; /* file system id */
__SWORD_TYPE f_namelen; /* maximum length of filenames */
__SWORD_TYPE f_frsize; /* fragment size (since Linux 2.6) */
__SWORD_TYPE f_spare[5];
};
You can easily get number of files via f_files - f_ffree.
BTW, This is very interesting question. so I voted it up.
In shell script it's damn simple
ll directory_path | wc -l

Is it possible to store pointers in shared memory without using offsets?

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.
On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//
This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}
We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.
You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).
Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Resources