VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting) - linux

I am working a Linux kernel module (VMM) to test Intel VMX, to run a self-made VM (The VM starts in real-mode, then switches to 32bit protected mode with Paging enabled).
The VMM is configured to NOT use rdtsc exit, and use rdtsc offsetting.
Then, the VM runs rdtsc to check the performance, like below.
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
void i386mode_tests(void)
{
u32 eax, ebx, ecx, edx;
u32 i = 0;
asm ("mov %%cr0, %%eax\n"
"mov %%eax, %0 \n" : "=m" (eax) : :);
my_printf("Guest CR0 = 0x%x\n", eax);
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
my_printf("Rdtsc takes %d\n", vm_tsc[1] - vm_tsc[0]);
}
The output is something like this,
Guest CR0 = 0x80050033
Rdtsc takes 2742
On the other hand, I make a host application to do the same thing, like above
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
int main(int argc, char **argv)
{
uint64_t vm_tsc[2];
uint32_t eax, ebx, ecx, edx, i;
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
printf("Rdtsc takes %ld\n", vm_tsc[1] - vm_tsc[0]);
return 0;
}
It outputs followings,
Rdtsc takes 2325
Running above two codes in 40 iterations to get the average value as followings,
avag(VM) = 3188.000000
avag(host) = 2331.000000
The performance difference can NOT be ignored, when running the codes in VM and in host. It is NOT expected.
My understanding is, using TSC offsetting + no RDTSC exit, there should be little difference in rdtsc, running in VM and host.
Here are VMCS fields,
0xA501E97E = control_VMX_cpu_based
0xFFFFFFFFFFFFFFF0 = control_CR0_mask
0x0000000080050033 = control_CR0_shadow
In the last level of EPT PTEs, bit[5:3] = 6 (Write Back), bit[6] = 1. EPTP[2:0] = 6 (Write Back)
I tested in bare-metal, and in VMware, I got the similar results.
I am wondering if there is anything I missed in this case.

Related

What caused the performance degradation when reading a local modified cache line with concurrent readers accessing it?

Assume that we have multiple threads accessing the same cache line parallelly. One of them (writer) repeatedly write to that cache line, and read from it. The other threads (readers) only repeatedly read from it. I am clear that the readers suffer from performance degradation since the invalidation-based MESI protocol requires the readers to invalidate their local cache before the writer writes to it, which occurs frequently. But I think the writer's read to that cache line should be fast because local write will not cause such invalidation.
However, the strange thing is that, when I run such experiment on a dual-socket machine with two Intel Xeon Scalable Gold 5220R processors (24 cores each) running at 2.20GHz, the writer's read to that cache line becomes a performance bottleneck.
This is my test program (compiled with gcc 8.4.0, -O2):
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <sys/syscall.h>
#define CACHELINE_SIZE 64
volatile struct {
/* cacheline 1 */
size_t x, y;
char padding[CACHELINE_SIZE - 2 * sizeof(size_t)];
/* cacheline 2 */
size_t p, q;
} __attribute__((aligned(CACHELINE_SIZE))) data;
static inline void bind_core(int core) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core, &mask);
if ((sched_setaffinity(0, sizeof(cpu_set_t), &mask)) != 0) {
perror("bind core failed\n");
}
}
#define smp_mb() asm volatile("lock; addl $0,-4(%%rsp)" ::: "memory", "cc")
void *writer_work(void *arg) {
long id = (long) arg;
int i;
bind_core(id);
printf("writer tid: %ld\n", syscall(SYS_gettid));
while (1) {
/* read after write */
data.x = 1;
data.y;
for (i = 0; i < 50; i++) __asm__("nop"); // to highlight bottleneck
}
}
void *reader_work(void *arg) {
long id = (long) arg;
bind_core(id);
while (1) {
/* read */
data.y;
}
}
#define NR_THREAD 48
int main() {
pthread_t threads[NR_THREAD];
int i;
printf("%p %p\n", &data.x, &data.p);
data.x = data.y = data.p = data.q = 0;
pthread_create(&threads[0], NULL, writer_work, 0);
for (i = 1; i < NR_THREAD; i++) {
pthread_create(&threads[i], NULL, reader_work, i);
}
for (i = 0; i < NR_THREAD; i++) {
pthread_join(threads[i], NULL);
}
return 0;
}
I use perf record -t <tid> to collect cycles event for the writer thread and use perf annotate writer_work to show detailed proportion in writer_work:
: while (1) {
: /* read after write */
: data.x = 1;
0.20 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
: data.y;
94.40 : a5b: mov 0x200626(%rip),%rax # 201088 <data+0x8>
0.03 : a62: mov $0x32,%eax
0.00 : a67: nopw 0x0(%rax,%rax,1)
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a70: nop
0.03 : a71: sub $0x1,%eax
5.17 : a74: jne a70 <writer_work+0x50>
0.15 : a76: jmp a50 <writer_work+0x30>
It seems that the load instruction of data.y becomes the bottleneck.
First, I think the performance degradation should be related to the cache line modification, because when I comment out the writer's store operation, the perf result indicates that the writer's read is not the bottleneck any more. Second, I think it should have something to do with the concurrent readers, because when I comment out the readers' read, writer's read is not blamed by perf. Concurrent readers accessing the same cacheline also slow down the writer, because when I change sum += data.y to sum += data.z in the writer side, which leads to a load to another cache line, the count decreases.
There is another possibility, suggested by Peter Cordes, that it's whatever instruction follows the store that's getting the blame. However, when I move the nop loop before the load, perf still blames for the load instruction.
: writer_work():
: while (1) {
: /* read after write */
: data.x = 1;
0.00 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
0.03 : a5b: mov $0x32,%eax
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a60: nop
0.09 : a61: sub $0x1,%eax
6.24 : a64: jne a60 <writer_work+0x40>
: data.y;
93.60 : a66: mov 0x20061b(%rip),%rax # 201088 <data+0x8>
: data.x = 1;
0.02 : a6d: jmp a50 <writer_work+0x30>
So my question is, what caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? Any help would be appreciated!

How to get physical and virtual address bits with C/C++ by CPUID command

I'm getting physical and virtual address bits size with C by using CPUID command in windows.
I can get the processor information this way, but I'm confused by getting the address bits.
Looks like I should you the 80000008 instruction but I do this way, only 7-8 digits change continuously are displayed.
I want to learn how this command works and solve this problem
#include <stdio.h>
void getcpuid(int T, int* val) {
int reg_ax;
int reg_bx;
int reg_cx;
int reg_dx;
__asm {
mov eax, T;
cpuid;
mov reg_ax, eax;
mov reg_bx, ebx;
mov reg_cx, ecx;
mov reg_dx, edx;
}
*(val + 0) = reg_ax;
*(val + 1) = reg_bx;
*(val + 2) = reg_cx;
*(val + 3) = reg_dx;
}
int main() {
int val[5]; val[4] = 0;
getcpuid(0x80000002, val);
printf("%s\r\n", &val[0]);
getcpuid(0x80000003, val);
printf("%s\r\n", &val[0]);
getcpuid(0x80000004, val);
printf("%s\r\n", &val[0]);
return 0;
}
when operate this code with putting EAX = 80000002, 80000003, 80000004, Intel processor brand string was displayed.
And I put 80000008 To getting physical and virtual address bits but random numbers changing constantly was displayed.
I want to know how to use this cpuid commend with 80000008 to get those address bits
i'm programming and operating system beginner.
Please let me know what I have to do.
The inline assembly you're using may be right; but this depends on which compiler it is. I think it is right for Microsoft's MSVC (but I've never used it and can't be sure). For GCC (and CLANG) you'd have to inform the compiler that you're modifying the contents of registers and memory (via. a clobber list), and it would be more efficient to tell the compiler that you're outputting 4 values in 4 registers.
The main problem is that you're trying to treat the output as a (null terminated) string; and the data returned by CPUID is never a null terminated string (even for "get vendor string" and "get brand name string", it's a whitespace padded string with no zero terminator).
To fix that you could:
void getcpuid(int T, int* val) {
unsigned int reg_ax;
unsigned int reg_bx;
unsigned int reg_cx;
unsigned int reg_dx;
__asm {
mov eax, T;
cpuid;
mov reg_ax, eax;
mov reg_bx, ebx;
mov reg_cx, ecx;
mov reg_dx, edx;
}
*(val + 0) = reg_ax;
*(val + 1) = reg_bx;
*(val + 2) = reg_cx;
*(val + 3) = reg_dx;
}
int main() {
uint32_t val[5]; val[4] = 0;
getcpuid(0x80000002U, val);
printf("0x%08X\r\n", val[0]);
getcpuid(0x80000003U, val);
printf("0x%08X\r\n", val[1]);
getcpuid(0x80000004U, val);
printf("0x%08X\r\n", val[2]);
return 0;
}
The next problem is extracting the virtual address size and physical address size values. These are 8-bit values packed into the first and second byte of eax; so:
int main() {
uint32_t val[5]; val[4] = 0;
int physicalAddressSize;
int virtualAddressSize;
getcpuid(0x80000008U, val);
physicalAddressSize = val[0] & 0xFF;
virtualAddressSize= (val[0] >> 8) & 0xFF;
printf("Virtual %d, physical %d\r\n", virtualAddressSize, physicalAddressSize);
return 0;
}
That should work on most recent CPUs; which means that it's still awful and broken on older CPUs.
To start fixing that you want to check that the CPU supports "CPUID leaf 0x80000008" before you assume it exists:
int main() {
uint32_t val[5]; val[4] = 0;
int physicalAddressSize;
int virtualAddressSize;
getcpuid(0x80000000U, val);
if(val(0) < 0x80000008U) {
physicalAddressSize = -1;
virtualAddressSize = -1;
} else {
getcpuid(0x80000008U, val);
physicalAddressSize = val[0] & 0xFF;
virtualAddressSize= (val[0] >> 8) & 0xFF;
}
printf("Virtual %d, physical %d\r\n", virtualAddressSize, physicalAddressSize);
return 0;
}
You can return correct results when "CPUID leaf 0x80000008" doesn't exist. For all CPUs that don't support "CPUID leaf 0x80000008"; virtual address size is 32 bits, and the physical address size is either 36 bits (if PAE is supported) or 32 bits (if PAE is not supported). You can use CPUID to determine if the CPU supports PAE, so it ends up a bit like this:
int main() {
uint32_t val[5]; val[4] = 0;
int physicalAddressSize;
int virtualAddressSize;
getcpuid(0x80000000U, val);
if(val(0) < 0x80000008U) {
getcpuid(0x00000000U, val);
if(val[0] == 0) {
physicalAddressSize = 32; // "CPUID leaf 0x00000001" not supported
} else {
getcpuid(0x00000001U, val);
if( val[3] & (1 << 6) != 0) {
physicalAddressSize = 36; // PAE is supported
} else {
physicalAddressSize = 32; // PAE not supported
}
}
virtualAddressSize = 32;
} else {
getcpuid(0x80000008U, val);
physicalAddressSize = val[0] & 0xFF;
virtualAddressSize= (val[0] >> 8) & 0xFF;
}
printf("Virtual %d, physical %d\r\n", virtualAddressSize, physicalAddressSize);
return 0;
}
The other problem is that sometimes CPUID is buggy; which means that you have to trawl through every single errata sheet for every CPU (from Intel, AMD, VIA, etc) to be sure the results from CPUID are actually correct. For example, there are 3 models of "Intel Pentium 4 Processor on 90 nm Process" where "CPUID leaf 0x800000008" is wrong and says "physical addresses are 40 bits" when they are actually 36 bits.
For all of these cases you need to implement work-arounds (e.g. get the CPU vendor/family/model/stepping information from CPUID and if it matches one of the 3 buggy models of Pentium 4, do an "if(physicalAddressSize == 40) physicalAddressSize = 36;" to fix the CPU's bug).

ARM Cortex A7 returning PMCCNTR = 0 in kernel mode, and Illegal instruction in user mode (even after PMUSERENR = 1)

I want to read the cycle count register (PMCCNTR) on a Raspberry Pi 2, which has an ARM Cortex A7 core. I compile a kernel module for it as follows:
#include <linux/module.h>
#include <linux/kernel.h>
int init_module()
{
volatile u32 PMCR, PMUSERENR, PMCCNTR;
// READ PMCR
PMCR = 0xDEADBEEF;
asm volatile ("mrc p15, 0, %0, c9, c12, 0\n\t" : "=r" (PMCR));
printk (KERN_INFO "PMCR = %x\n", PMCR);
// READ PMUSERENR
PMUSERENR = 0xDEADBEEF;
asm volatile ("mrc p15, 0, %0, c9, c14, 0\n\t" : "=r" (PMUSERENR));
printk (KERN_INFO "PMUSERENR = %x\n", PMUSERENR);
// WRITE PMUSERENR = 1
asm volatile ("mcr p15, 0, %0, c9, c14, 0\n\t" : : "r" (1));
// READ PWMUSERENR AGAIN
asm volatile ("mrc p15, 0, %0, c9, c14, 0\n\t" : "=r" (PMUSERENR));
printk (KERN_INFO "PMUSERENR = %x\n", PMUSERENR);
// READ PMCCNTR
PMCCNTR = 0xDEADBEEF;
asm volatile ("mrc p15, 0, %0, c9, c13, 0\n\t" : "=r" (PMCCNTR));
printk (KERN_ALERT "PMCCNTR = %x\n", PMCCNTR);
return 0;
}
void cleanup_module()
{
}
MODULE_LICENSE("GPL");
and, after insmod, I observe the following in /var/log/kern.log:
PMCR = 41072000
PMUSERENR = 0
PMUSERENR = 1
PMCCNTR = 0
When I try to read PMCCNTR from user-mode, I get illegal instruction, even after PMUSERENR has been set to 1.
Why does PMCCNTR read as 0 in kernel mode, and an illegal instruction in user-mode? Is there something else I need to do that I'm not doing to enable the PMCCNTR?
Update 1
Partly solved. The solution to the multi-core issue is to call on_each_cpu like so:
#include <linux/module.h>
#include <linux/kernel.h>
static void enable_ccnt_read(void* data)
{
// WRITE PMUSERENR = 1
asm volatile ("mcr p15, 0, %0, c9, c14, 0\n\t" : : "r" (1));
}
int init_module()
{
on_each_cpu(enable_ccnt_read, NULL, 1);
return 0;
}
void cleanup_module()
{
}
MODULE_LICENSE("GPL");
I can now read PMCCNTR from userland:
#include <iostream>
unsigned ccnt_read ()
{
volatile unsigned cc;
asm volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r" (cc));
return cc;
}
int main() {
std::cout << ccnt_read() << std::endl;
}
To run a userland program on a specific core you can use taskset like so (example, run on core 2):
$ taskset -c 2 ./ccnt_read
0
The PMCCNTR are still not incrementing. They need to be "switched on" somehow.
Here is the working solution for posterity:
The kernel module:
#include <linux/module.h>
#include <linux/kernel.h>
static void enable_ccnt_read(void* data)
{
// PMUSERENR = 1
asm volatile ("mcr p15, 0, %0, c9, c14, 0" :: "r"(1));
// PMCR.E (bit 0) = 1
asm volatile ("mcr p15, 0, %0, c9, c12, 0" :: "r"(1));
// PMCNTENSET.C (bit 31) = 1
asm volatile ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1 << 31));
}
int init_module()
{
on_each_cpu(enable_ccnt_read, NULL, 1);
return 0;
}
void cleanup_module()
{
}
MODULE_LICENSE("GPL");
The client program:
#include <iostream>
unsigned ccnt_read ()
{
volatile unsigned cc;
asm volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r" (cc));
return cc;
}
int main() {
std::cout << ccnt_read() << std::endl;
}
What you have done is to enable User level access of the counter. You have not enabled the counter as such. In addition to enabling access you have to program 31st bit (C-bit) of PMCNTENSET to enable counting. This along with your on_each_cpu() changes should enable the functionality you are looking for.
A word of caution: your measurements will be messed up, if a process migrates to a different core between CCNT reads.
I am running this chip in simulation, and found a further problem to those described above. The performance counter must be reset when it is enabled, otherwise asserts are generated from the undefined values. This means that the PMCR register should be set as follows:
// PMCR.E (bit 0) = 1, PMCR.C (bit 2) = 1
asm volatile ("mcr p15, 0, %0, c9, c12, 0" :: "r"(5));

Reading x86 MSR from kernel module

My main aim is to get the address values of the last 16 branches maintained by the LBR registers when a program crashes. I tried two ways till now -
1) msr-tools
This allows me to read the msr values from the command line. I make system calls to it from the C program itself and try to read the values. But the register values seem no where related to the addresses in the program itself. Most probably the registers are getting polluted from the other branches in system code. I tried turning off recording of branches in ring 0 and far jumps. But that doesn't help. Still getting unrelated values.
2) accessing through kernel module
Ok I wrote a very simple module (I've never done this before) to access the msr registers directly and possibly avoid register pollution.
Here's what I have -
#define LBR 0x1d9 //IA32_DEBUGCTL MSR
//I first set this to some non 0 value using wrmsr (msr-tools)
static void __init do_rdmsr(unsigned msr, unsigned unused2)
{
uint64_t msr_value;
__asm__ __volatile__ (" rdmsr"
: "=A" (msr_value)
: "c" (msr)
);
printk(KERN_EMERG "%lu \n",msr_value);
}
static int hello_init(void)
{
printk(KERN_EMERG "Value is ");
do_rdmsr (LBR,0);
return 0;
}
static void hello_exit(void)
{
printk(KERN_EMERG "End\n");
}
module_init(hello_init);
module_exit(hello_exit);
But the problem is that every time I use dmesg to read the output I get just
Value is 0
(I have tried for other registers - it always comes as 0)
Is there something that I am forgetting here?
Any help? Thanks
Use the following:
unsigned long long x86_get_msr(int msr)
{
unsigned long msrl = 0, msrh = 0;
/* NOTE: rdmsr is always return EDX:EAX pair value */
asm volatile ("rdmsr" : "=a"(msrl), "=d"(msrh) : "c"(msr));
return ((unsigned long long)msrh << 32) | msrl;
}
You can use Ilya Matveychikov's answer... or... OR :
#include <asm/msr.h>
int err;
unsigned int msr, cpu;
unsigned long long val;
/* rdmsr without exception handling */
val = rdmsrl(msr);
/* rdmsr with exception handling */
err = rdmsrl_safe(msr, &val);
/* rdmsr on a given CPU (instead of current one) */
err = rdmsrl_safe_on_cpu(cpu, msr, &val);
And there are many more functions, such as :
int msr_set_bit(u32 msr, u8 bit)
int msr_clear_bit(u32 msr, u8 bit)
void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs)
int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8])
Have a look at /lib/modules/<uname -r>/build/arch/x86/include/asm/msr.h

Just out of curiosity: how come linux kernel "optimized" strcpy is much slower the libc imp?

I tried to benchmark optimized string operations under http://lxr.linux.no/#linux+v2.6.38/arch/x86/lib/string_32.c and compare to regular strcpy:
#include<stdio.h>
#include<stdlib.h>
char *_strcpy(char *dest, const char *src)
{
int d0, d1, d2;
asm volatile("1:\tlodsb\n\t"
"stosb\n\t"
"testb %%al,%%al\n\t"
"jne 1b"
: "=&S" (d0), "=&D" (d1), "=&a" (d2)
: "0" (src), "1" (dest) : "memory");
return dest;
}
int main(int argc, char **argv){
int times = 1;
if(argc >1)
{
times = atoi(argv[1]);
}
char a[100];
for(; times; times--)
_strcpy(a, "Hello _strcpy!");
return 0;
}
and timeing it using (time .. ) showed that it is about x10 slower than regular strcpy (under x64 linux)
Why?
If your string is constant, it's possible that the compiler is inlining the copy (for the plain strcpy call), making it into a series of unconditional MOV instructions.
since this is linear code without conditions, it would be faster than the linux variant.

Resources