Record instruction addresses of page faults with perf - linux

I would like to get addresses of the instructions which leads to major page faults using perf.
I have a simple program:
#include <time.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/mman.h>
int main(int argc, char* argv[]) {
int fd = open("path to large file several Gb", O_RDONLY);
struct stat st;
fstat(fd, &st);
void* ptr = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
const uint8_t* data = (const uint8_t*) ptr;
srand(time(NULL));
size_t i1 = ((double) rand() / RAND_MAX) * st.st_size;
size_t i2 = ((double) rand() / RAND_MAX) * st.st_size;
size_t i3 = ((double) rand() / RAND_MAX) * st.st_size;
printf("%x[%lu], %x[%lu], %x[%lu]\n", data[i1], i1, data[i2], i2, data[i3], i3);
munmap(ptr, st.st_size);
close(fd);
return 0;
}
I compile it using gcc -g -O0 main.c
and run perf record -e major-faults -g -d ./a.out
Next I open the resulting report using perf report -g
The report says that there are 3 major page faults (it's correct),
but I can't understand addresses of the instructions which leads to the page faults.
The report is below:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3 of event 'major-faults'
# Event count (approx.): 3
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................ ......................
#
100.00% 0.00% a.out libc-2.23.so [.] __libc_start_main
|
---__libc_start_main
main
100.00% 100.00% a.out a.out [.] main
|
---0x33e258d4c544155
__libc_start_main
main
100.00% 0.00% a.out [unknown] [.] 0x033e258d4c544155
|
---0x33e258d4c544155
__libc_start_main
main
a.out doesn't contain an address 0x33e258d4c544155 or something which ends with 155.
The question is how to get instruction addresses which leads to page faults?

For some reason I cannot reproduce your example, i.e. I'm not getting any samples with the major-faults event. But I can explain with a different example.
The pref report output is misleading, it doesn't the three events, it shows the three stack levels. It's much easier to understand by using perf script - which shows the actual events (including their stacks). The entries look like this (repeated for each sample):
a.out 22107 14721.378764: 10000000 cycles:u:
5653c1afb134 main+0x1b (/tmp/a.out)
7f58bb1eeee3 __libc_start_main+0xf3 (/usr/lib/libc-2.29.so)
49564100002cdb3d [unknown] ([unknown])
Now you see the function stack with the virtual instruction address, nearest symbol and offset from the symbol. If you want to fiddle with the addresses yourself, you can run perf script --show-mmap-events, which tells you:
a.out 22107 14721.372233: PERF_RECORD_MMAP2 22107/22107: [0x5653c1afb000(0x1000) # 0x1000 00:2b 463469 624179165]: r-xp /tmp/a.out
^ Base ^ size ^ offset ^ file
Then you can do the math for 0x5653c1afb134 by subtracting the base 0x5653c1afb000 and adding the offset 0x1000 - you get the address of the instruction or return address within the file.
You also see that 0x49564100002cdb3d is not mapped, could not be resolved - it's just garbage from the frame-pointer based stack unwinding. You can ignore it. You can also use --call-graph dwarf or --call-graph lbr which seem to show more sensible stack origins.

Related

How to tell if O_DIRECT is in use?

I'm running an IO intensive process that supports O_DIRECT. Is there a way to tell if O_DIRECT is being used while the process is running?
I tried "iostat -x 1" but I'm not sure which field would help me.
Thanks.
You will have to get the pid of the running process. Once you get the pid, you can do
cat /proc/[pid]/fdinfo/<fd number>
You will aslo have to know the fd number of the file being opened.
It will show flags field. The flags field is octal value displaying the flags passed to open the file descriptor fd. You will have to examine it to know whether O_DIRECT is set or not.
As an example, on my ubuntu machine(X86_64), I created 2 files - foo1 & foo2
touch foo1 foo2
and then opened foo1 with O_DIRECT and foo2 without O_DIRECT. Below is the program
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main()
{
printf("%u\n", getpid());
int fd1 = open("foo1", O_RDWR|O_DIRECT); //O_DIRECT set
printf("foo1: %d\n", fd1);
int fd2 = open("foo2", O_RDWR); //Normal
printf("foo2: %d\n", fd2);
sleep(60);
close(fd1);
close(fd2);
return 0;
}
On running this I got the output:
8885
foo1: 3 //O_DIRECT
foo2: 4
8885 is the pid. So I did
cat /proc/8885/fdinfo/3 //O_DIRECT
pos: 0
flags: 0140002
mnt_id: 29
-------------------------------
cat /proc/8885/fdinfo/4
pos: 0
flags: 0100002
mnt_id: 29
From the above output you can see that for O_DIRECT, in the flags field 0040000 is also set.

How do I read data from bar 0, from userspace, on a pci-e card in linux?

On windows there is this program called pcitree that allows you to set and read memory without writing a device driver. Is there a linux alternative to pcitree that will allow me read memory on block 0 of my pcie card?
A simple use case would be that I use driver code to write a 32bit integer on the first memory address in block zero of my pci-e card. I then use pcitree alternative to read the value at the first memory address of block zero and see my integer.
Thank you
I found some code online that does what I want here github.com/billfarrow/pcimem.
As I understand it this link offers code that maps kernel memory to user memory via the system call "mmap"
This was mostly stolen from the readme of the program, and the man pages of mmap.
mmap takes
a start address
a size
memory protection flags
file descriptor that that is linked to bar0 of your pci-card.
and an offset
mmap returns a userspace pointer to the memory defined by the start address and size parameters.
This code shows an example of mmaps usage.
//The file handle can be found by typing "lspci -v "
// and looking for your device.
fd = open("/sys/devices/pci0001\:00/0001\:00\:07.0/resource0", O_RDWR | O_SYNC);
//mmap returns a userspace address
//0, 4096 tells it to pull one page
ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
printf("PCI BAR0 0x0000 = 0x%4x\n", *((unsigned short *) ptr);
I use the way to get PCI BAR0 register described above but get the segmentation fault back. I use gdb to debug the error from my code as follows and it shows the return value of mmap() is (void *) 0xffffffffffffffff
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <fcntl.h>
#include <ctype.h>
#include <termios.h>
#include <sys/types.h>
#include <sys/mman.h>
#define PRINT_ERROR \
do { \
fprintf(stderr, "Error at line %d, file %s (%d) [%s]\n", \
__LINE__, __FILE__, errno, strerror(errno)); exit(1); \
} while(0)
#define MAP_SIZE 4096UL
#define MAP_MASK (MAP_SIZE - 1)
int main(int argc, char **argv) {
int fd;
void *ptr;
//The file handle can be found by typing lscpi -v
//and looking for your device.
fd = open("/sys/bus/pci/devices/0000\:00\:05.0/resource0", O_RDWR | O_SYNC);
//mmap returns a userspace address
//0, 4096 tells it to pull one page
ptr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
printf("PCI BAR0 0x0000 = 0x%4x\n", *((unsigned short *) ptr));
if(munmap(ptr, 4096) == -1) PRINT_ERROR;
close(fd);
return 0;
}
On a system with functioning /dev/mem in the kernel it is possible to read a bar for a device using:
sudo dd if=/dev/mem skip=13701120 count=1 bs=256 | hexdump
Look at the dd man page. In the above example 13701120 * 256 is the start physical address at which 256 bytes will be read.

Computing memory address of the environment within a process

I got the following code from the lecture-slides of a security course.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
extern char shellcode;
#define VULN "./vuln"
int main(int argc, char **argv) {
void *addr = (char *) 0xc0000000 - 4
- (strlen(VULN) + 1)
- (strlen(&shellcode) + 1);
fprintf(stderr, "Using address: 0x%p\n", addr);
// some other stuff
char *params[] = { VULN, buf, NULL };
char *env[] = { &shellcode, NULL };
execve(VULN, params, env);
perror("execve");
}
This code calls a vulnerable program with the shellcode in its environment. The shellcode is some assembly code in an external file that opens a shell and VULN defines the name of the vulnerable program.
My question: how is the shellcode address is computed
The addr variable holds the address of the shellcode (which is part of the environment). Can anyone explain to me how this address is determined? So:
Where does the 0xc0000000 - 4 come from?
Why is the length of the shellcode and the programname substracted from it?
Note that both this code and the vulnerable program are compiled like this:
$ CFLAGS="-m32 -fno-stack-protector -z execstack -mpreferred-stack-boundary=2"
$ cc $CFLAGS -o vuln vuln.c
$ cc $CFLAGS -o exploit exploit.c shellcode.s
$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
0
So address space randomization is turned off.
I understood that the stack is the first thing inside the process (highest memory address). And the stack contains, in this order:
The environment data.
argv
argc
the return address of main
the framepointer
local variables in main
...etc...
Constants and global data is not stored on the stack, that's why I also don' t understand why the length of the VULN constant influence the address at which the shellcode is placed.
Hope you can clear this up for me :-)
Note that we're working with a unix system on a intel x86 architecture

i cannot allocate 100KB with "fileuser - memlock unlimited" in /etc/security/limits.conf

I'm using Fedora release 17 (Beefy Miracle) in my lab, i trying to block 100KB of resident memory with mlock C function, the code is as follows.
#include <sys/mman.h>
int main(){
char *p;
mlock(p, 100000);
sleep(100);
}
When i compiled the code with gcc i saw the following error
gcc -o mymlock mymlock.c
strace -e mlock ./mlock
mlock(0x4c668ff4, 100000) = -1 ENOMEM (Cannot allocate memory)
Why do i get this error if i have "fileuser - memlock unlimited" in limits.conf?
my memory usage
[fileuser#Rossetti ~]$ free -m
total used free shared buffers cached
Mem: 2900 2674 226 0 58 957
-/+ buffers/cache: 1657 1242
Swap: 4927 146 4781
My C code was wrong, now it work
New Code
#include <sys/mman.h>
#include <limits.h>
int main(){
char *p = malloc(4096*1024);
mlock(p, (4096*1024));
sleep(100);
}

Read a single sector from a disk

I am trying to read a single specific sector from the disk directly. I've currently run out of ideas and any suggestions how to go about it would be great!
Try something like this to do it from the CLI:
# df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 27G 24G 1.6G 94% /
# dd bs=512 if=/dev/sda2 of=/tmp/sector200 skip=200 count=1
1+0 records in
1+0 records out
From man 4 sd:
FILES
/dev/sd[a-h]: the whole device
/dev/sd[a-h][0-8]: individual block partitions
And if you want to do this from within a program, just use a combination of system calls from man 2 ... like open, lseek,, and read, with the parameters from the dd example.
I'm not sure what the best programmatic approach is, but from the Linux command-line you could use the dd command in combination with the raw device for your disk to directly read from the disk.
You need to sudo this command to get access to the raw disk device (e.g. /dev/rdisk0).
For example, the following will read a single 512-byte block from an offset of 900 blocks from the top of disk0 and output it to stdout.
sudo dd if=/dev/rdisk0 bs=512 skip=900 count=1
See the dd man page to get additional information on the parameters to dd.
In C it is something like the following... It would require root permissions. I think you need to open the file with O_DIRECT if you want to read single sectors. Otherwise you'll get a page. I'm not sure if the aligned buffer is required for a read, but it is for a write.
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define SECTOR_SIZE 512
int main(int argc, char *argv[]) {
int offset = 0;
int length = 5;
int rc = -1;
char *sector = aligned_alloc(SECTOR_SIZE, SECTOR_SIZE);
memset(sector, 0, SECTOR_SIZE);
/* replace XXX with the source block device */
int fd=open("/dev/XXX", O_RDWR | O_DIRECT);
lseek(fd, offset, SEEK_SET);
for (int i = 0; i < length; i++) {
rc = read(fd, sector, SECTOR_SIZE);
if (rc < 0)
printf("sector read error at offset = %d + %d\n %s", offset, i, strerror(errno));
printf("Sector: %d\n", i);
for (int j = 0; j < SECTOR_SIZE; j++) {
printf("%x", sector[i]);
if ((j + 1) % 16 == 0)
printf("\n");
}
}
free(sector);
close(fd);
}
The other folks have pretty much covered it. You need to
access to the disk's device file (either be root or, better, change the permissions on it)
use the file IO functions to read sectors = chunks of (usually) 512 bytes from said disk.
Another alternative is to use hdparm
For instance-
hdparm --read-sector 16782858 /dev/sda

Resources