Pthreads & Multicore compiler - multithreading

I'm working with the support SMP kernel: Snapgear 2.6.21.
I have created 4 threads in my c application, and I am trying to set thread 1 to run on CPU1, thread2 on CPU 2, etc.
However, the compiler sparc-linux-gcc does not recognize these functions:
CPU_SET (int cpu, cpu_set_t * set);
CPU_ZERO (cpu_set_t * set);
and this type: cpu_set_t
It always gives me these errors:
implicit declaration of function 'CPU_ZERO'
implicit declaration of function 'CPU_SET'
'cpu_set_t' undeclared (first use in this function)
Here is my code to bind active thread to processor 0:
cpu_set_t mask;
CPU_ZERO (& mask);
CPU_SET (0, & mask) // bind processor 0
sched_setaffinity (0, sizeof(mask), & mask);
I have included and defined at the top :
**define _GNU_SOURCE
include <sched.h>**
But I always get the same errors. can you help me please?

You should read sched_setaffinity(2) carefully and test its result (and display errno on failure, e.g. with perror).
Actually, I believe you should use pthread_setaffinity_np(3) instead (and of course test its failure, etc...)
Even more, I believe that you should not bother to explicitly set the affinity. Recent Linux kernels are often quite good at dispatching running threads on different CPUs.
So simply use pthreads and don't bother about affinity, unless you see actual issues when benchmarking.
BTW, passing the -H flag to your GCC (cross-)compiler could be helpful. It shows you the included files. Perhaps also look into the preprocessed form obtained with gcc -C -E ; it looks like some header files are missing or not found (maybe some missing -I include-directory at compilation time, or some missing headers on your development system)
BTW, your kernel version looks ancient. Can't you upgrade your kernel to something newer (3.15.x or some 3.y)?

Related

Why does strace ignore some syscalls (randomly) depending on environment/kernel?

If I compile the following program:
$ cat main.cpp && g++ main.cpp
#include <time.h>
int main() {
struct timespec ts;
return clock_gettime(CLOCK_MONOTONIC, &ts);
}
and then run it under strace in "standard" Kubuntu, I get this:
strace -tt --trace=clock_gettime ./a.out
17:58:40.395200 +++ exited with 0 +++
As you can see, there is no clock_gettime (full strace output is here).
On the other hand, if I run the same app in my custom built linux kernel under qemu, I get the following output:
strace -tt --trace=clock_gettime ./a.out
18:00:53.082115 clock_gettime(CLOCK_MONOTONIC, {tv_sec=101481, tv_nsec=107976517}) = 0
18:00:53.082331 +++ exited with 0 +++
Which is more expected - there is clock_gettime.
So, my questions are:
Why does strace ignore/omit clock_gettime if I run it in Kubuntu?
Why strace's behaviour differs depending on environment/kernel?
Answer to the first question
From vdso man
strace(1), seccomp(2), and the vDSO
When tracing systems calls with strace(1), symbols (system calls) that are exported by the vDSO will not appear in the trace output. Those system calls will likewise not be visible to seccomp(2) filters.
Answer to the second question:
In the vDSO, clock_gettimeofday and related functions are reliant on specific clock modes; see __arch_get_hw_counter.
If the clock mode is VCLOCK_TSC, the time is read without a syscall, using RDTSC; if it’s VCLOCK_PVCLOCK or VCLOCK_HVCLOCK, it’s read from a specific page to retrieve the information from the hypervisor. HPET doesn’t declare a clock mode, so it ends up with the default VCLOCK_NONE, and the vDSO issues a system call to retrieve the time.
And indeed:
In the default kernel (from Kubuntu):
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Custom built kernel:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
More info about various clock sources. In particular:
The documentation of Red Hat MRG version 2 states that TSC is the preferred clock source due to its much lower overhead, but it uses HPET as a fallback. A benchmark in that environment for 10 million event counts found that TSC took about 0.6 seconds, HPET took slightly over 12 seconds, and ACPI Power Management Timer took around 24 seconds.
It may be due to the fact that clock_gettime() is part of the optimized syscalls. Look at the vdso mechanism described in this answer.
Considering clock_gettime(), on some architecture (e.g. Linux on ARM v7 32 bits), only a subset of the available clock identifiers are supported in the VDSO implementation. For the others, there is a fallback into the actual system call. Here is the source code of the VDSO implementation of clock_gettime() in the Linux kernel for the ARM v7 (file arch/arm/vdso/vgettimeofday.c in the source tree):
notrace int __vdso_clock_gettime(clockid_t clkid, struct timespec *ts)
{
struct vdso_data *vdata;
int ret = -1;
vdata = __get_datapage();
switch (clkid) {
case CLOCK_REALTIME_COARSE:
ret = do_realtime_coarse(ts, vdata);
break;
case CLOCK_MONOTONIC_COARSE:
ret = do_monotonic_coarse(ts, vdata);
break;
case CLOCK_REALTIME:
ret = do_realtime(ts, vdata);
break;
case CLOCK_MONOTONIC:
ret = do_monotonic(ts, vdata);
break;
default:
break;
}
if (ret)
ret = clock_gettime_fallback(clkid, ts);
return ret;
}
The above source code shows a switch/case with the supported clock identifiers and in the default case, there is a fallback to the actual system call.
On such architecture, spying a software like systemd which uses clock_gettime() with CLOCK_MONOTONIC and CLOCK_BOOTTIME, strace only shows the calls with the latter identifier as it is not part of the supported cases in VDSO mode.
Cf. this link for reference

Are there compatibility issues with clang-cl and arch:avx2?

I'm using Windows 10, Visual Studio 2019, Platform: x64 and have the following test script in a single-file Visual Studio Solution:
#include <iostream>
#include <intrin.h>
using namespace std;
int main() {
unsigned __int64 mask = 0x0fffffffffffffff; //1152921504606846975;
unsigned long index;
_BitScanReverse64(&index, mask);
if (index != 59) {
cout << "Fails!" << endl;
return EXIT_FAILURE;
}
else {
cout << "Success!" << endl;
return EXIT_SUCCESS;
}
}
In my property solution I've set the 'Enable Enhanced Instruction Set' to 'Advanced Vector Extenstions 2 (/arch:AVX2)'.
When compiling with msvc (setting 'Platform Toolset' to 'Visual Studio 2019 (v142)') the code returns EXIT_SUCCESS, but when compiling with clang-cl (setting 'Platform Toolset' to 'LLVM (clang-cl)') I get EXIT_FAILURE. When debugging the clang-cl run, the value of index is 4, when it should be 59. This suggests to me that clang-cl is reading the bits in the opposite direction of MSVC.
This isn't the case when I set 'Enable Enhanced Instruction Set' to 'Not Set'. In this scenario, both MSVC and clang-cl return EXIT_SUCCESS.
All of the dlls are loaded and shown in the Debug Output window come from C:\Windows\System32###.dll in all cases.
Does anyone understand this behavior? I would appreciate any insight here.
EDIT: I failed to mention earlier: I compiled this with IntelCore i7-3930K CPU #3.20GHz.
Getting 4 instead of 59 sounds like clang implemented _BitScanReverse64 as 63 - lzcnt. Actual bsr is slow on AMD, so yes there are reasons why a compiler would want to compiler a BSR intrinsic to a different instruction.
But then you ran the executable on a computer that doesn't actually support BMI so lzcnt decoded as rep bsr = bsr, giving the leading-zero count instead of the bit-index of the highest set bit.
AFAIK, all CPUs that have AVX2 also have BMI. If your CPU doesn't have that, you shouldn't expect your executables build with /arch:AVX2 to run correctly on your CPU. And in this case the failure mode wasn't an illegal instruction, it was lzcnt running as bsr.
MSVC doesn't generally optimize intrinsics, apparently including this case, so it just uses bsr directly.
Update: i7-3930K is SandyBridge-E. It doesn't have AVX2, so that explains your results.
clang-cl doesn't error when you tell it to build an AVX2 executable on a non-AVX2 computer. The use-case for that would be compiling on one machine to create an executable to run on different machines.
It also doesn't add CPUID-checking code to your executable for you. If you want that, write it yourself. This is C++, it doesn't hold your hand.
target CPU options
MSVC-style /arch options are much more limited than normal GCC/clang style. There aren't any for different levels of SSE like SSE4.1; it jumps straight to AVX.
Also, /arch:AVX2 apparently implies BMI1/2, even though those are different instruction-sets with different CPUID feature bits. In kernel code for example you might want integer BMI instructions but not SIMD instructions that touch XMM/YMM registers.
clang -O3 -mavx2 would not also enable -mbmi. You normally would want that, but if you failed to also enable BMI then clang would have been stuck using bsr. (Which is actually better for Intel CPUs than 63-lzcnt). I think MSVC's /arch:AVX2 is something like -march=haswell, if it also enables FMA instructions.
And nothing in MSVC has any support for making binaries optimized to run on the computer you build them on. That makes sense, it's designed for a closed-source binary-distribution model of software development.
But GCC and clang have -march=native to enable all the instruction sets your computer supports. And also importantly, set tuning options appropriate for your computer. e.g. don't worry about making code that would be slow on an AMD CPU, or on older Intel, just make asm that's good for your CPU.
TL:DR: CPU selection options in clang-cl are very coarse, lumping non-SIMD extensions in with some level of AVX. That's why /arch:AVX2 enabled integer BMI extension, while clang -mavx2 would not have.

What happens when you execute an instruction that your CPU does not support?

What happens if a CPU attempts to execute a binary that has been compiled with some instructions that your CPU doesn't support. I'm specifically wondering about some of the new AVX instructions running on older processors.
I'm assuming this can be tested for, and a friendly message could in theory be displayed to a user. Presumably most low level libraries will check this on your behalf. Assuming you didn't make this check, what would you expect to happen? What signal would your process receive?
A new instruction can be designed to be "legacy compatible" or it can not.
To the former class belong instructions like tzcnt or xacquire that have an encoding that produces valid instructions in older architecture: tzcnt is encoded as
rep bsf and xacquire is just repne.
The semantic is different of course.
To the second class belong the majority of new instructions, AVX being one popular example.
When the CPU encounters an invalid or reserved encoding it generates the #UD (for UnDefined) exception - that's interrupt number 6.
The Linux kernel set the IDT entry for #UD early in entry_64.S:
idtentry invalid_op do_invalid_op has_error_code=0
the entry points to do_invalid_op that is generated with a macro in traps.c:
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
the macro DO_ERROR generates a function that calls do_error_trap in the same file (here).
do_error_trap uses fill_trap_info (in the same file, here) to create a siginfo_t structure containing the Linux signal information:
case X86_TRAP_UD:
sicode = ILL_ILLOPN;
siaddr = uprobe_get_trap_addr(regs);
break;
from there the following calls happen:
do_trap in traps.c
force_sig_info in signal.c
specific_send_sig_info in signal.c
that ultimately culminates in calling the signal handler for SIGILL of the offending process.
The following program is a very simple example that generates an #UD
BITS 64
GLOBAL _start
SECTION .text
_start:
ud2
we can use strace to check the signal received by running that program
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x400080} ---
+++ killed by SIGILL +++
as expected.
As Cody Gray commented, libraries don't usually rely on SIGILL, instead they use a CPU dispatcher or check the presence of an instruction explicitly.

Error while compiling Linux kernel 2.6.39.4

I am making a system call which calculates the average waiting time in FCFS scheduling algorithm.
After following this guide, I have made changes in the concerned files and made this program.
Now while compiling the kernel, it is showing this error.
CC arch/x86/lib/strstr_32.o
AS arch/x86/lib/thunk_32.o
CC arch/x86/lib/usercopy_32.o
AR arch/x86/lib/lib.a
LD vmlinux.o
MODPOST vmlinux.o
WARNING: modpost: Found 31 section mismatch(es).
To see
full details build your kernel with:
'make CONFIG_DEBUG_SECTION_MISMATCH=y'
GEN .version
CHK include/generated/compile.h
UPD include/generated/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
kernel/built-in.o: In function `sys_atvfcfs':
(.text+0x3e27e): undefined reference to `__floatsisf'
kernel/built-in.o: In function `sys_atvfcfs':
(.text+0x3e286): undefined reference to `__fixsfsi'
make: *** [.tmp_vmlinux1] Error 1
An this is my program
#include <linux/linkage.h>
asmlinkage long sys_atvfcfs(int at[], int bt[], int n)
{
int i=0;
int j,t,wt[n],sum,q;
float avgwt;
for(j=i+1;j<n;j++)
{
if(at[i]>at[j])
{
t=at[i];
at[i]=at[j];
at[j]=t;
q=bt[i];
bt[i]=bt[j];
bt[j]=q;
}
}
wt[0]=0;
sum=0;
for(i=0;i<n-1;i++)
{
wt[i+1]=wt[i]+bt[i];
sum=sum+(wt[i+1]-at[i]);
}
avgwt=sum/n;
return avgwt;
}
Can anyone explain where is the problem?
Google for "linux kernel float usage". It's a special thing. If you can avoid using floating point types, avoid it.
As the answer you've already got says, floating points are a special case for the Linux Kernel.
Specifically, one of the basic rules of the kernel is to avoid using the FPU unless you absolutely have to. To expand on what is said there:
The FPU context is not saved; even in user context the FPU state probably won't correspond with the current process: you would mess with some user process' FPU state. If you really want to do this, you would have to explicitly save/restore the full FPU state (and avoid context switches). It is generally a bad idea; use fixed point arithmetic first.
In short, as described in this question and its answers, the kernel asks the CPU not to bother with context switching the CPU registers. So if your process undergoes a context switch, the next application to run will be able to have hold of and modify your FPU registers. You'll then get back the modified state. Not good.
You can enable the fpu yourself with kernel_fpu_begin() and it is preempt-safe. However, it also disables your code from being pre-emptable and forces you into a critical section, so you must kernel_fpu_end() as soon as possible.

Disable randomization of memory addresses

I'm trying to debug a binary that uses a lot of pointers. Sometimes for seeing output quickly to figure out errors, I print out the address of objects and their corresponding values, however, the object addresses are randomized and this defeats the purpose of this quick check up.
Is there a way to disable this temporarily/permanently so that I get the same values every time I run the program.
Oops. OS is Linux fsttcs1 2.6.32-28-generic #55-Ubuntu SMP Mon Jan 10 23:42:43 UTC 2011 x86_64 GNU/Linux
On Ubuntu , it can be disabled with...
echo 0 > /proc/sys/kernel/randomize_va_space
On Windows, this post might be of some help...
http://blog.didierstevens.com/2007/11/20/quickpost-another-funny-vista-trick-with-aslr/
To temporarily disable ASLR for a particular program you can always issue the following (no need for sudo)
setarch `uname -m` -R ./yourProgram
You can also do this programmatically from C source before a UNIX exec.
If you take a look at the sources for setarch (here's one source):
http://code.metager.de/source/xref/linux/utils/util-linux/sys-utils/setarch.c
You can see if boils down to a system call (syscall) or a function call (depending on what your system defines). From setarch.c:
#ifndef HAVE_PERSONALITY
# include <syscall.h>
# define personality(pers) ((long)syscall(SYS_personality, pers))
#endif
On my CentOS 6 64-bit system, it looks like it uses a function (which probably calls the self-same syscall above). Take a look at this snippet from the include file in /usr/include/sys/personality.h (as referenced as <sys/personality.h> in the setarch source code):
/* Set different ABIs (personalities). */
extern int personality (unsigned long int __persona) __THROW;
What it boils down to, is that you can, from C code, call and set the personality to use ADDR_NO_RANDOMIZE and then exec (just like setarch does).
#include <sys/personality.com>
#ifndef HAVE_PERSONALITY
# include <syscall.h>
# define personality(pers) ((long)syscall(SYS_personality, pers))
#endif
...
void mycode()
{
// If requested, turn off the address rand feature right before execing
if (MyGlobalVar_Turn_Address_Randomization_Off) {
personality(ADDR_NO_RANDOMIZE);
}
execvp(argv[0], argv); // ... from set-arch.
}
It's pretty obvious you can't turn address randomization off in the process you are in (grin: unless maybe dynamic loading), so this only affects forks and execs later. I believe the Address Randomization flags are inherited by child sub-processes?
Anyway, that's how you can programmatically turn off the address randomization in C source code. This may be your only solution if you don't want the force a user to intervene manually and start-up with setarch or one of the other solutions listed earlier.
Before you complain about security issues in turning this off, some shared memory libraries/tools (such as PickingTools shared memory and some IBM databases) need to be able to turn off randomization of memory addresses.

Resources