Do performance stats like Geekbench represent general multi-tasking performance? - multithreading

I am trying to compare how an i7 dual core 2.7Ghz would perform vs. an i7 quad core 2.0Ghz in a multitasking environment. The quad core scores at around 9000 while the dual comes in at around 7500 (for Geekbench). At the same time, Geekbench explicity specifies that the tests show the full performance potential of all the cores. However, in real world, everyday use, almost none of the application I would be running are multi-threaded (Ruby runtime, Java IDE, Windows VM on mac, app server).
This machine would server as a web development machine. Which cpu would be most "snappy" in terms of response time in this use case?

Results of a benchmark have any practical meaning only if the benchmark very closely approximates your typical workload.
You should consider whether your typical development environment regularly calls for parallelism. For example, if I develop a C/C++/Java app it's common that a header file (or Java source) change to cause several other files to be recompiled and a few binaries to be relinked - that's a highly parallel workload and many-core CPU may prove advantageous.
On the other hand, if I'm changing a few Python or Javascript sources, I doubt I will create any parallel workload when I try to execute and test the changes.
However, these are theoretical considerations.
I don't think the speed of the machine is a bottleneck in any development effort. The human is.

Related

OS specific build performance in Java

We are currently evaluating our next-generation company-wide developer pc-configuration and have noticed something really weird.
Our rather large monolith has - on our current configuration a build time of approx. 4.5 minutes (no test, just compile).
For our next generation configuration we upgraded several components. A moderate increase in frequency and IPC with the processor, doubling the number of CPU cores and a switch from a small SATA SSD towards a NVMe SSD rated at >3GBps. Also, the next generation configuration switches from Windows 7 to Windows 10.
When executing the first tests, we noticed an almost identical build time (4.3 Minutes), which was a lot less improvement than we expected.
During our experiments we tried at one point to run the build process from within a virtual Linux machine running on the windows host. On the old configuration (Windows7) we saw a drop in build times from 4.5 to ~3.7 Minutes, on the Windows 10 Host, we saw a decrease from 4.3 to 2.3 minutes. We have ruled out things like virus scan.
We were rather astonished with these results and have tried to find another explanation than some almost-religious and insulting statements about different operation systems.
So the question is: What could we have possibly done wrong in configuring the Windows machine such that the speed is almost half of a Linux running virtualized in the very same windows host? Especially as all the hardware advancements seem to be eaten up by the switch from windows 7 to 10.
Another question is: How can we ace the javac process use up more cores, because right now, using Hotspot JDK 8 we can see at most two cores really used by the build. I've read about sjavac but that seems a rather experimental feature only available to OpenJDK9 onward, right?
After almost a year in experimenting we came to the conclusion, that it is indeed NTFS which is the evil-doer. If you have a ntfs user-partition with a linux host, you get somewhat similar results compared to an all-windows-setup.
We did benchmarks of gradle-build, eclipse internal build, starting up wildfly and running database-centered tests on multiple devices. All our benchmarks showed consistently a speedup of at least 100% when switching from Windows to Linux (sometimes, Windows takes 3x the amount of time in real world benchmarks than Linux, some artificial benchmarks had a speedup of 60!). Especially on notebooks we experienced much less noise, as the combined processor load of a complete build is substantial less than with windows.
Our conclusion was, to switch from Windows to Linux over the course of the last year.
Regarding the parallelisation thing, we realized, it was some form of code-entanglement. Resolving this helped gradle and javac to parallelise the build a lot (also have a look into gradle-composite-builds)

firmware development on single core vs multi core processor

Say I'm developing firmware for a smart thermostat in someone's home. The current implementation is a multi threaded solution running on a single core processor (lets just throw out Cortex-M since that's what I'm familiar with) and I'm using some off the shelf RTOS.
If I take that project and move/port it over to a dual/multi core processor, how does that work? Do I just tell the RTOS which threads should run on each core and the RTOS manages it all from there? Is there a certain amount of refactoring that needs to be done on each thread so that it works more efficiently in a multi core environment? Or does the RTOS just take whatever thread is in the READY state and run that task on a core with free time available?
Generally speaking, the fact that you're running on a multi-core machine shouldn't matter. It's up for the OS to schedule threads to available cores. Of course your RTOS needs to support the multi-core platform!
There's a gotcha: if your code doesn't handle concurrency properly, and especially if it doesn't handle memory barriers properly, you might run into bugs that were hidden by the fact that it all ran serially on one core. Once you toss a second core into the mix, any such bugs tend to surface, but usually they do it first during an important demo or after release. So design your code so that it will be concurrency-bug-free by construction.

Why so many applications allocate incredibly large amount of virtual memory while not using any of it?

I've been watching some weird phenomena in programming for quite some time, since overcommit is enabled by default on linux systems.
It seems to me that pretty much every high level application (eg. application written in high level programming language like Java, Python or C# including some desktop applications written in C++ that use large libraries such as Qt) use insane amount of virtual operating memory. For example, it's normal for web browser to allocate 20GB of ram while using only 300MB of it. Or for a dektop environment, mysql server, pretty much every java or mono application and so on, to allocate tens of gigabytes of RAM.
Why is that happening? What is the point? Is there any benefit in this?
I noticed that when I disable overcommit in linux, in case of a desktop system that actually runs a lot of these applications, the system becomes unusable as it doesn't even boot up properly.
Languages that run their code inside virtual machines (like Java (*), C# or Python) usually assign large amounts of (virtual) memory right at startup. Part of this is necessary for the virtual machine itself, part is pre-allocated to parcel out to the application inside the VM.
With languages executing under direct OS control (like C or C++), this is not necessary. You can write applications that dynamically use just the amount of memory they actually require. However, some applications / frameworks are still designed in such a way that they request a large chunk memory from the operating system once, and then manage the memory themselves, in hopes of being more efficient about it than the OS.
There are problems with this:
It is not necessarily faster. Most operating systems are already quite smart about how they manage their memory. Rule #1 of optimization, measure, optimize, measure.
Not all operating systems do have virtual memory. There are some quite capable ones out there that cannot run applications that are so "careless" in assuming that you can allocate lots & lots of "not real" memory without problems.
You already found out that if you turn your OS from "generous" to "strict", these memory hogs fall flat on their noses. ;-)
(*) Java, for example, cannot expand its VM once it is started. You have to give the maximum size of the VM as a parameter (-Xmxn). Thinking "better safe than sorry" leads to severe overallocations by certain people / applications.
These applications usually have their own method of memory management, which is optimized for their own usage and is more efficient than the default memory management provided by the system. So they allocate huge memory block, to skip or minimize the effect of the memory management provided by system or libc.

CPU usage of Oracle installed Database machine

I am using oracle 11g and i have an application which is coded in Spring framework. Once i configure the database on Sun fire 4170 installed with Linux the machine's CPU utilization is around 80-100% and, however, when i shift the same database to Sun M3000 server installed with Unix OS (supposedly more powerful machine) the application performance goes down and CPU utilization remains 90-100%. I can't figure out if its the application which is making the such utilization or its the database design.
It is added that the database is not relational; things are handled by the application.
Well you certainly can find some interesting opinions on the intertubes.
Oracle does not have a true server
architecture (others have it).
Rather than performing classic server
tasks, such as multi-threading,
caching of data pages, parallel
processing (split a query across many
devices) etc. within itself, it uses
the o/s to do all that. That means for
each user process (PL/SQL connection)
there is one unix process; 1000 users
means 1000 unix processes, all
competing for the same resources.
You might note that Oracle has had
a connection pooling architecture (multi-threaded server) since version 7 (1992).
a cache for data pages (known helpfully as the buffer cache) since forever
parallel query (splitting a query across many processes) since version 7.1 (1993)
splitting queries across multiple servers since OPS (version 6) or across distributed databases (version 5)
It's also noteworthy that even if all that was said was correct rather than incorrect it doesn't actually help you in determining root cause.
Especially noteworthy, because it uses
file system files (not raw
partitions), and the "caching" is
outside, it relies heavily on (and is
very sensitive to) the file system
cache that you have set up. likewise,
Oracle needs a massive amount of
memory for these processes.
Oracle certainly can use raw partitions again dating back to the last millenium, moreover if you wish to cache within the database - using the buffer cache that PerformanceDBA has forgotten about - and bypass the filesystem cache this feature is available on all current filesystems. Oracle also supplies it's own combined filesystem/volume manager in ASM which you can use if you wish.
Oracle is also rather well instrumented (and if you have access to dtrace so is solaris) and can certainly tell you what sessions, processes etc are using the CPU, what the time the application spends in the database is consumed by (down to individual block read times if you care) and so is very susceptible to profiling. I'd recommend that you check out Thinking Clearly about Performance available at http://www.method-r.com/downloads/cat_view/38-papers-and-articles and written by one of the top Oracle Performance experts in the world. If you have access to the Oracle Diagnostics pack then checking out first of all ADDM reports and secondly AWR reports would be profitable.
Trying to avoid a flame war here.
I should probably have separated out the "how to find out" part of my response more clearly from my responses to the comments about server architecture from PerformanceDBA. I share Stephanie's suspicions about the spring framework, but without properly scoped measurement evidence there is no point in blaming any particular attribute of the environment, that would be just particular bias. Fortunately the instrumentation built into the oracle kernel allows you to trace and then profile the slow sessions to determine exactly where the issue lies. So I would do the following:
1) enable tracing for a representative session (you can use the dbms_monitor package for that).
2) also gather an execution plan for the statement(s) involved with the gather_plan_statistics hint.
3) profile the trace file by time using an appropriate profile (tkprof,orasrp,method-r profiler)
Investigate the problem statements in contribution to response time order.
If you can't carry out the above, then you can use ADDM and/or AWR if licenced as I originally suggested or statspack if not licensed for the diagnostics pack. ADDM naturally concentrates on time consumers, I suggest if you are forced down the statspack route you do the same.
The M3000 is certainly a more powerful machine, but it is more suitable for true servers. The X4170 with hyper-threads is more suited for file servers.
I'm not so certain about that. Have any data to support that claim?
An M3000 has one SPARC64 VII processor with 4 cores (tech specs) while a X4170 has 1 or 2 Intel 5500 "Nehalem-EP" processors each with 4 cores (tech specs). I know that I would expect much more from even a single processor Nehalem-EP system, than the M3000. Obviously data will vary slightly with the workload, but I know where I'd put my money.

Testing performance of parallel programs on a single core machine

I would like to start playing with concurrency in the programs I write (mostly for fun), but I don't own a multi-core system and can't afford one any time soon. I run linux. Is there a way to, for example with a Virtual Machine, compare the performance of a multi-threaded implementation of a program with a single-threaded version, without actually running it on hardware with multiple processors or cores?
That is, I would like to be able to implement parallel algorithms and be able to say that, yes, this multithreaded implementation is better-performing than the single-threaded.
Thanks
You can not test multithreaded programs reliably on a single core machine. Race conditions will show up very differently or even be totally hidden on a single core machine. The performance will decrease etc.
If you want to LEARN how to program multiple threads, you can do so on a single core machine for the first steps (i.e how works the API etc.). But you'll have to test on a multicore machine and its very likely that you will see faults on a multicore machine that you dont see on a single core machine.
Virtual machines are by my experience no help with this. They introduce new bugs, that didnt show up before, but they CANT simulate real concurrency with multiple cores.
Depending on what you're benchmarking you might be able to use an Amazon EC2 node. It's not free, but it's cheaper than buying a computer.
If you have only one core/cpu and your algorithm is cpu intensive, you will probably see multi-threaded program is actually slower than the single-threaded one. But if you have program use i/o in one thread and cpu in another for example, then you can see the multi-threaded program is faster.
To observe effects other than potentially improved locality, you'll need hardware or a simulator that actually models the communication/interaction that occurs when the program runs in parallel. There's no magic to be had.

Resources