multicore application cygwin - cygwin

If I run a parallized application (using f.e. OpenMP) on a windows multicore within cygwin - do I have the full multicore performance the windows machine is offering or is there a significant speed reduction to expect due to the cygwin layer?
Any experiences?

I know this is an old question but in light of my recent findings about a Cygwin bug on multithreaded apps on multicore CPUs (see my bug report on the Cygwin mailing list), I just want to point out that multithreaded applications on Cygwin is a no go. In my case, a multithreaded application on a dual core runs 8x slower than if you force it to run on single core (via setting CPU affinity mask).

Related

Best way to simulate old, slow processor on modern hardware?

I really like the idea of running, optimizing my software on old hardware, because you can viscerally feel when things are slower (or faster!). The most obvious way to do this is to buy an old system and literally use it for development, but that would allow down my IDE, and compiler and all other development tasks, which is less helpful, and (possibly) unnecessary.
I want to be able to:
Run my application at various levels of performance, on demand
At the same time, run my IDE, debugger, compiler at full speed
On a single system
Nice to have:
Simulate real, specific old systems, with some accuracy
Similarly throttle memory speed, and size
Optionally run my build system slowly
Try use QEMU in full emulation mode, but keep in mind it's use more cpu resources.
https://stuff.mit.edu/afs/sipb/project/phone-project/OldFiles/share/doc/qemu/qemu-doc.html
QEMU has two operating modes:
Full system emulation. In this mode, QEMU emulates a full system (for example a PC), including one or several processors and various peripherals. It can be used to launch different Operating Systems without rebooting the PC or to debug system code.
User mode emulation (Linux host only). In this mode, QEMU can launch Linux processes compiled for one CPU on another CPU.
Possible architectures can see there:
https://wiki.qemu.org/Documentation/Platforms

Multithreded applications on different CPUS

If, for example, there is a let's say embedded application which run on unicore CPU. And then that application would be ported on multi core CPU. Would that app run on single or multiple cores?
To be more specific I am interested in ARM CPU (but not only) and toolchain specifics e. g. standard C/C++ libraries.
The intention of this question is this: is it CPU's responsibility to "decide" to execute on multiple cores or compiler toolchain, developer and standard platfor specific libraries? And again, I am interested also in other systems' tendencies out there.
There are plenty of applications and RTOS (for example Linux) that run on different CPUs but the same architecture, so does that mean that they are compiled differently?
Generally speaking single-threaded code will always run on one core. To take advantage of multiple cores you need to have either multiple processes, multiple threads, or both.
There's nothing your compiler can do to help you here. This is an architectural consideration.
If you have multiple threads, for example, most multi-core systems will run them on whatever cores are available if the operating system you're running is properly compiled to support that. Running an OS that's been compiled single-core only will obviously limit your options here.
A single threaded program will run in one thread. It is theoretically possible for the thread to be scheduled to move to a different core, but the scheduler cannot turn a single thread into multiple threads and give you any parallel processing.
EDIT
I misunderstood your question. If there are multiple threads in the application, and that application is binary compatible with the new multicore CPU, the threads will indeed be scheduled to run on different CPUs, if the OS scheduler deems it appropriate.
Well it all depends on the software that if it wants to utilize other cores or not (if present). Lets take an example of Linux on ARM's cortexA53.
Initially a vendor provided boot loader runs on, FSBL (First state bootloader). It then passes control to Arm trusted firmware. ATF then runs uboot. All these run on a single core. Then uboot loads linux kernel and passes control to it. Linux then initializes some stuff and looks into some option, first in the bootargs for smp or nosmp flags. if smp it will get the number of CPUs assigned to it from dtb and then using SMC calls to ATF it will start other cores and then assign work to those cores to provide true feel of multiprocessing environment. This is normally called load balancing and in linux it is mostly done in fair.c file.

virtual machine or dual boot when measuring code performance

i am trying to measure code performance (basically speed-up when using threads). So far i was using cygwin via windows or linux on separate machine. Now i have the ability to set up a new system and i am not sure whether i should have dual boot (windows and ubuntu) or a virtual machine.
My concern is whether i can measure reliable speed up and possibly other stuff (performance monitors) via a linux virtual machine or if i have to go with with normal booting in linux.
anybody have an opinion?
If your "threading" relies heavily on scheduling, I won't recommend you to use VM. VM is just a normal process from the host OS's point of view, so the guest kernel and its scheduler will be affected by scheduling by the host kernel.
If your "threading" is more like parallel computation, I think it's OK to use VM.
For me, it is much safer to boot directly on the system and avoid using a VM in your case. Even when you don't use a VM, it is already hard to have twice the same results in multi-threading because the system being used for OS tasks, so having 2 OS running in the same time as for VM even increases the uncertainty on the results. For instance, running your tests 1000 times on a VM would lead to, let's say, 100 over-estimated time, while it would maybe be only 60 on a lonely OS. It is your call to know if this uncertainty is acceptable or not.

cilk++ on linux system

I had some problems with a cilk++ program that works well on windows system but not on linux system:
on windows system, while increasing the number of threads the execution time decrease
but on linux system, while increasing the number of threads the execution time increase.
I used linux ubuntu 2.6.35-22-generic x86_64 GNU/Linux
I can't understand the source of the problem.So can someone help me please ?
Without sources, there's no way to know. There may be a resource that has a per-thread implementation on Windows and a shared implementation on Linux.
I'd recommend using a performance analyzer like Intel's VTune/Amplifier to figure out where your application is spending it's time.
- Barry Tannenbaum
Intel Cilk Plus Runtime Development

How well does Valgrind handle threads and machine-level synchronization instructions?

I have a highly parallel Windows program that uses lots of threads, hand-coded machine synchronization instructions, and home-rolled parallel-safe storage allocators. Alas, the
storage management has a hole (not a synchonization hole in the allocators,
I'm pretty sure) and I'd like to find it.
Valgrind has been suggested as a good tool for finding storage management errors.
Any experience here with Valgrind used under these circumstances?
Valgrind does not run on Windows, but it does work with Windows programs running under Wine on Linux. If your program will run under Wine, it has a decent chance of working with valgrind. See winehq.org for details.
The latest version is pretty good at handling all the 32-bit x86 instructions. It can handle programs that create many threads, just don't expect them to run simultaneously under valgrind. It will run only one thread at a time, as if it was run on a single core machine.

Resources