How well does Valgrind handle threads and machine-level synchronization instructions? - multithreading

I have a highly parallel Windows program that uses lots of threads, hand-coded machine synchronization instructions, and home-rolled parallel-safe storage allocators. Alas, the
storage management has a hole (not a synchonization hole in the allocators,
I'm pretty sure) and I'd like to find it.
Valgrind has been suggested as a good tool for finding storage management errors.
Any experience here with Valgrind used under these circumstances?

Valgrind does not run on Windows, but it does work with Windows programs running under Wine on Linux. If your program will run under Wine, it has a decent chance of working with valgrind. See winehq.org for details.
The latest version is pretty good at handling all the 32-bit x86 instructions. It can handle programs that create many threads, just don't expect them to run simultaneously under valgrind. It will run only one thread at a time, as if it was run on a single core machine.

Related

Best way to simulate old, slow processor on modern hardware?

I really like the idea of running, optimizing my software on old hardware, because you can viscerally feel when things are slower (or faster!). The most obvious way to do this is to buy an old system and literally use it for development, but that would allow down my IDE, and compiler and all other development tasks, which is less helpful, and (possibly) unnecessary.
I want to be able to:
Run my application at various levels of performance, on demand
At the same time, run my IDE, debugger, compiler at full speed
On a single system
Nice to have:
Simulate real, specific old systems, with some accuracy
Similarly throttle memory speed, and size
Optionally run my build system slowly
Try use QEMU in full emulation mode, but keep in mind it's use more cpu resources.
https://stuff.mit.edu/afs/sipb/project/phone-project/OldFiles/share/doc/qemu/qemu-doc.html
QEMU has two operating modes:
Full system emulation. In this mode, QEMU emulates a full system (for example a PC), including one or several processors and various peripherals. It can be used to launch different Operating Systems without rebooting the PC or to debug system code.
User mode emulation (Linux host only). In this mode, QEMU can launch Linux processes compiled for one CPU on another CPU.
Possible architectures can see there:
https://wiki.qemu.org/Documentation/Platforms

Is an operating system kernel an interpeter for all other programs?

So, from my understanding, there are two types of programs, those that are interpreted and those that are compiled. Interpreted programs are executed by an interpreter that is a native application for the platform its on, and compiled programs are themselves native applications (or system software) for the platform they are on.
But my question is this: is anything besides the kernel actually being directly run by the CPU? A Windows Executable is a "Windows Executable", not an x86 or amd64 executable. Does that mean every other process that's not the kernel is literally being interpreted by the kernel in the same way that a browser interprets Javascript? Or is the kernel placing these processes on the "bare metal" that the kernel sits on top of?
IF they're on the "bare metal", how, say does Windows know that a program is a windows program and not a Linux program, since they're both compiled for amd64 processors? If it's because of the "format" of the executable, how is that executable able to run on the "bare metal", since, to me, the fact that it's formatted to run on a particular OS would mean that some interpretation would be required for it to run.
Is this question too complicated for Stack Overflow?
They run on the "bare metal", but they do contain operating system-specific things. An executable file will typically provide some instructions to the kernel (which are, arguably, "interpreted") as to how the program should be loaded into memory, and the file's code will provide ways for it to "hook" in to the running operating system, such as by an operating system's API or via device drivers. Once such a non-interpreted program is loaded into memory, it runs on the bare metal but continues to communicate with the operating system, which is also running on the bare metal.
In the days of single-process operating systems, it was common for executables to essentially "seize" control of the entire computer and communicate with hardware directly. Computers like the Apple ][ and the Commodore 64 work like that. In a modern multitasking operating system like Windows or Linux, applications and the operating system share use of the CPU via a complex multitasking arrangement, and applications access the hardware via a set of abstractions built in to the operating system's API and its device drivers. Take a course in Operating System design if you are interested in learning lots of details.
Bouncing off Junaid's answer, the way that the kernel blocks a program from doing something "funny" is by controlling the allocation and usage of memory. The kernel requires that memory be requested and accessed through it via its API, and thus protects the computer from "unauthorized" access. In the days of single-process operating systems, applications had much more freedom to access memory and other things directly, without involving the operating system. An application running on an old Apple ][ can read to or write to any address in RAM that it wants to on the entire computer.
One of the reasons why a compiled application won't just "run" on another operating system is that these "hooks" are different for different operating systems. For example, an application that knows how to request the allocation of RAM from Windows might not have any idea how to request it from Linux or the Mac OS. As Disk Crasher mentioned, these low level access instructions are inserted by the compiler.
I think you are confusing things. A compiled program is in machine readable format. When you run the program, kernel will allocate memory, cpu etc and ensure that the program does not interfere with other programs. If the program requires access to HW resources or disk etc, the kernel will handle it so kernel will always be between hardware and any software you run in user space.
If the program is interpreted, then a relevant interpreter for that language will convert the code to machine readable on the fly and kernel will still provide the same functionality like access to hardware and making sure programs aren't doing anything funny like trying to access other program memory etc.
The only thing that runs on "bare metal" is assembly language code, which is abstracted from the programmer by many layers in the OS and compiler. Generally speaking, applications are compiled to an OS and CPU architecture. They will not run on other OS's, at least not without a compatible framework in place (e.g. Mono on Linux).
Back in the day a lot of code used to be written on bare metal using macro assemblers, but that's pretty much unheard of on PCs today. (And there was even a time before macro assemblers.)

How to benchmark Linux threaded programs?

I'm trying to compare the performance of threaded programs (on Linux). Since the programs use different thread synchronization methods and different lock granularity, running the programs on a shared server or desktop would not be good, since the other tasks may interfere with the scheduling of my programs. I don't have dedicated hosts, so I thought that using qemu would be a good option.
What I want to know is:
Are there any alternatives for this task?
I suppose that there is no way to reproduce scheduling done by guest Linux system on qemu, if
I - need to? (Suppose my program goes unusually skow or fast -- I'd like to know if I can run it again, but keeping exactly the same scheduling for its threads). Or is there a way?

find out how many instructions made in the core

I am running an Intel hyper threading system using Linux OS and I would like to find out if there is a way to know how many instructions (actual work) the core (or the virtual core if it can be done) did for a period of time.
Is there any register that can tell me how much instructions was made?
You can install oprofile (http://oprofile.sourceforge.net/).
When using it, you start sampling, then stop after a while.
Then you can get a report of various CPU counters, one of them is the number of instructions.

multicore application cygwin

If I run a parallized application (using f.e. OpenMP) on a windows multicore within cygwin - do I have the full multicore performance the windows machine is offering or is there a significant speed reduction to expect due to the cygwin layer?
Any experiences?
I know this is an old question but in light of my recent findings about a Cygwin bug on multithreaded apps on multicore CPUs (see my bug report on the Cygwin mailing list), I just want to point out that multithreaded applications on Cygwin is a no go. In my case, a multithreaded application on a dual core runs 8x slower than if you force it to run on single core (via setting CPU affinity mask).

Resources