Severe multi-threaded memory bottleneck after reaching a specific number of cores - multithreading

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?

I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
inline char spinlock_failed (SPINLOCK *slp)
return (char) __xchg (&slp->lock_part,1);
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
if (__xchg (&slp->lock_part,1)==1) return 1;
return 0;
And to release
inline void spinlock_leave (SPINLOCK *slp)
inline void spinlock_leave (SPINLOCK *slp)
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.

To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.


Low 'Average Physical Core Utilization' according to VTune when using OpenMP, not sure what the bigger picture is

I have been optimizing a ray tracer, and to get a nice speed up, I used OpenMP generally like follows (C++):
Accelerator accelerator; // Has the data to make tracing way faster
Rays rays; // Makes the rays so they're ready to go
#pragma omp parallel for
for (int y = 0; y < window->height; y++) {
for (int x = 0; x < window->width; x++) {
Ray& ray = rays.get(x, y);
I gained 4.85x performance on a 6 core/12 thread CPU. I thought I'd get more than that, maybe something like 6-8x... especially when this eats up >= 99% of the processing time of the application.
I want to find out where my performance bottleneck is, so I opened VTune and profiled. Note that I am new to profiling, so maybe this is normal but this is the graph I got:
In particular, this is the 2nd biggest time consumer:
where the 58% is the microarchitecture usage.
Trying to solve this on my own, I went looking for information on this, but the most I could find was on Intel's VTune wiki pages:
Average Physical Core Utilization
Metric Description
The metric shows average physical cores utilization by computations of the application. Spin and Overhead time are not counted. Ideal average CPU utilization is equal to the number of physical CPU cores.
I'm not sure what this is trying to tell me, which leads me to my question:
Is this normal for a result like this? Or is something going wrong somewhere? Is it okay to only see a 4.8x speedup (compared to a theoretical max of 12.0) for something that is embarrassingly parallel? While ray tracing itself can be unfriendly due to the rays bouncing everywhere, I have done what I can to compact the memory and be as cache friendly as possible, use libraries that utilize SIMD for calculations, done countless implementations from the literature to speed things up, and avoided branching as much as possible and do no recursion. I also parallelized the rays so that there's no false sharing AFAIK, since each row is done by one thread so there shouldn't be any cache line writing for any threads (especially since ray traversal is all const). Also the framebuffer is row major, so I was hoping false sharing wouldn't be an issue from that.
I do not know if a profiler will pick up the main loop that is threaded with OpenMP and this is an expected result, or if I have some kind of newbie mistake and I'm not getting the throughput that I want. I also checked that it spawns 12 threads, and OpenMP does.
I guess tl;dr, am I screwing up using OpenMP? From what I gathered, the average physical core utilization is supposed to be up near the average logical core utilization, but I almost certainly have no idea what I'm talking about.
Imho you're doing it right and you overestimate the efficiency of parallel execution. You did not give details about the architecture you're using (CPU, memory etc), nor the code... but to say it simple I suppose that beyond 4.8x speed increase you're hitting the memory bandwidth limit, so RAM speed is your bottleneck.
As you said, ray tracing is not hard to run in parallel and you're doing it right, so if the CPU is not 100% busy my guess is your memory controller is.
Supposing you're tracing a model (triangles? voxels?) that is in RAM, your rays need to read bits of model when checking for hits. You should check your maximum RAM bandwith, then divide it by 12 (threads) then divide it by the number of rays per second... and find that even 40 GB/s are "not so much" when you trace a lot of rays. That's why GPUs are a better option for ray tracing.
Long story short, I suggest you try to profile memory usage.

Can a hyper-threaded processor core execute two threads at the exact same time?

I'm having a hard time understanding hyper-threading. If the logical core doesn't actually exist, what's the point of using hyper-threading?. The wikipedia article states that:
For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible.
If the two logical cores share the same execution unit, that means one of the threads will have to be put on hold while the other executes, that being said, I don't understand how hyper-threading can be useful, since you're not actually introducing a new execution unit. I can't wrap my head around this
See my answer on a softwareengineering.SE question for some details about how modern CPUs find and exploit instruction-level parallelism (ILP) by running multiple instructions at once. (Including a block diagram of Intel Haswell's pipeline, and links to more CPU microarchitecture details). Also Modern Microprocessors
A 90-Minute Guide!
You have a CPU with lots of execution units and a front-end that can keep them mostly supplied with work to do, but only under good conditions. Stalls like cache misses or branch mispredicts, or just limited parallelism (e.g. a loop that does one long chain of FP additions, bottlenecking on FP latency at one (scalar or SIMD) add per 4 or 5 clocks instead of one or two per clock) will result in throughput of much less than 4 instructions per cycle, and leave execution units idle.
The point of HT (and Simultaneous Multithreading (SMT) in general) is to keep those hungry execution units fed with work to do, even when running code with low ILP or lots of stalls (cache misses / branch mispredicts).
SMT only adds a bit of extra logic to the pipeline so it can keep track of two separate architectural contexts at the same time. So it costs a lot less die area and power than having twice or 4x as many full cores. (Knight's Landing Xeon Phi runs 4 threads per core, mainstream Intel CPUs run 2. Some non-x86 chips run 8 threads per core, aimed at database-server type workloads.) But of course having to divide out-of-order execution resources between logical threads often means the throughput gain is significantly below 2x or 4x, often far below, and for some workloads is negative.
Also related What is the difference between Hyperthreading and Multithreading? Does AMD Zen use either? - AMD's SMT is basically the same as Intel's, just not using the trademark "Hyperthreading" for it. See also other links in my answer there, like and especially for an intro with diagrams to what SMT is all about. (Many members of the Alpha EV8 design team was hired by Intel after DEC folded, and went on to implement SMT in Netburst (Pentium 4) which Intel branded Hyperthreading.)
Common misconceptions
Hyperthreading is not just optimized context switching. Simpler designs that switch to the other thread on a cache miss are possible, but HT is more advanced than that. (Switch-on-stall, or round-robin "barrel processor").
With two threads active, the front-end alternates between threads every cycle (in the fetch, decode, and issue/rename stages), but the out-of-order back-end can actually execute uops from both logical cores in the same cycle. The issue/rename stage is 4 uops wide on Intel before Ice Lake.
In pipeline stages that normally alternate, any time one thread is stalled, the other thread gets all the cycles in that stage. HT is much better than just fixed alternating, because one thread can get lots of work done while the other is recovering from a branch mispredict or waiting for a cache miss.
Note that up to 10 or 12 cache misses can be outstanding at once (from L1D cache in Intel CPUs: this is the number of LFB (Line Fill Buffers), and memory requests are pipelined. But if the address for the next load depends on an earlier load (e.g. pointer chasing through a tree or linked list), the CPU doesn't know where to load from and can't keep multiple requests in flight. So it is actually useful for both threads to be waiting on cache misses in parallel.
Some resources are statically partitioned when two threads are active, some are competitively shared. See this pdf of slides for some details. (For more details about how to actually optimize asm for Intel and AMD CPUs, see Agner Fog's microarchitecture PDF.)
When one logical core "sleeps" (i.e. the kernel runs a HLT instruction or whatever MWAIT to enter a deeper sleep), the physical core transitions to single-thread mode and lets the still-active logical core have all the resources (including the full ReOrder Buffer size, and other statically-partitioned resources), so it's ability to find and exploit ILP in the single thread still running increases more than when the other thread is simply stalled on a cache miss.
BTW, some workloads actually run slower with HT. If your working set barely fits in L2 or L1D cache, then running two on the same core will lead to a lot more cache misses. For very well-tuned high-throughput code that can already keep the execution units saturated
(like an optimized matrix multiply in high-performance computing), it can make sense to disable HT. Always benchmark.
On Skylake, I've found that video encoding (with x265 -preset slower, 1080p) is about 15% faster with 8 threads instead of 4, on my quad-core i7-6700k. I didn't actually disable HT for the 4-thread test, but Linux's scheduler is good at not bouncing threads around and running threads on separate physical cores when there are enough to go around. A 15% speedup is pretty good considering that x265 has a lot of hand-written asm and runs very high instructions-per-cycle even when it has a whole core to itself. (Slower presets like I used tend to be more CPU-bound than memory-bound.)

Local and Global size influence on program execution - OpenCl

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!
I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

Linux: CPU benchmark requiring longer time and different CPU utilization levels

For my research I need a CPU benchmark to do some experiments on my Ubuntu laptop (Ubuntu 15.10, Memory 7.7 GiB, Intel Core i7-4500U CPU # 1.80HGz x 4, 64bit). In an ideal world, I would like to have a benchmark satisfying the following:
The CPU should be an official benchmark rather than created by my own for transparency purposes.
The time needed to execute the benchmark on my laptop should be at least 5 minutes (the more the better).
The benchmark should result in different levels of CPU throughout execution. For example, I don't want a benchmark which permanently keeps the CPU utilization level at around 100% - so I want a benchmark which will make the CPU utilization vary over time.
Especially points 2 and 3 are really key for my research. However, I couldn't find any suitable benchmarks so far. Benchmarks I found so far include: sysbench, CPU Fibonacci, CPU Blowfish, CPU Cryptofish, CPU N-Queens. However, all of them just need a couple of seconds to complete and the utilization level on my laptop is at 100% constantly.
Question: Does anyone know about a suitable benchmark for me? I am also happy to hear any other comments/questions you have. Thank you!
To choose a benchmark, you need to know exactly what you're trying to measure. Your question doesn't include that, so there's not much anyone can tell you without taking a wild guess.
If you're trying to measure how well Turbo clock speed works to make a power-limited CPU like your laptop run faster for bursty workloads (e.g. to compare Haswell against Skylake's new and improved power management), you could just run something trivial that's 1 second on, 2 seconds off, and count how many loop iterations it manages.
The duty cycle and cycle length should be benchmark parameters, so you can make plots. e.g. with very fast on/off cycles, Skylake's faster-reacting Turbo will ramp up faster and drop down to min power faster (leaving more headroom in the bank for the next burst).
The speaker in that talk (the lead architect for power management on Intel CPUs) says that Javascript benchmarks are actually bursty enough for Skylake's power management to give a measurable speedup, unlike most other benchmarks which just peg the CPU at 100% the whole time. So maybe have a look at Javascript benchmarks, if you want to use well-known off-the-shelf benchmarks.
If rolling your own, put a loop-carried dependency chain in the loop, preferably with something that's not too variable in latency across microarchitectures. A long chain of integer adds would work, and Fibonacci is a good way to stop the compiler from optimizing it away. Either pick a fixed iteration count that works well for current CPU speeds, or check the clock every 10M iterations.
Or set a timer that will fire after some time, and have it set a flag that you check inside the loop. (e.g. from a signal handler). Specifically, alarm(2) may be a good choice. Record how many iterations you did in this burst of work.

Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?

*Adding a second core or CPU might increase the performance of your parallel program, but it is unlikely to double it. Likewise, a
four-core machine is not going to execute your parallel program four
times as quickly— in part because of the overhead and coordination
described in the previous sections. However, the design of the
computer hardware also limits its ability to scale. You can expect a
significant improvement in performance, but it won’t be 100 percent
per additional core, and there will almost certainly be a point at
which adding additional cores or CPUs doesn’t improve the performance
at all.
I read the paragraph above from a book. But I don't get the last sentence.
So, Where is the point at which adding additional cores or CPUs doesn’t improve the performance at all?
If you take a serial program and a parallel version of the same program then the parallel program has to do some operations that the serial program does not, specifically operations concerned with coordinating the operations of the multiple processors. These contribute to what is often called 'parallel overhead' -- additional work that a parallel program has to do. This is one of the factors that makes it difficult to get 2x speed-up on 2 processors, 4x on 4 or 32000x on 32000 processors.
If you examine the code of a parallel program you will often find segments which are serial, that is which only use one processor while the others are idle. There are some (fragments of) algorithms which are not parallelisable, and there are some operations which are often not parallelised but which could be: I/O operations for instance, to parallelise these you need some sort of parallel I/O system. This 'serial fraction' provides an irreducible minimum time for your computation. Amdahl's Law explains this, and that article provides a useful starting point for your further reading.
Even when you do have a program which is well parallelised the scaling (ie the way speed-up changes as the number of processors increases) does not equal 1. For most parallel programs the size of the parallel overhead (or the amount of processor time which is devoted to operations which are only necessary for parallel computing) increases as some function of the number of processors. This often means that adding processors adds parallel overhead and at some point in the scaling of your program and jobs the increase in overhead cancels out (or even reverses) the increase in processor power. The article on Amdahl's Law also covers Gustafson's Law which is relevant here.
I've phrased this all in very general terms, no consideration of current processor and computer architectures; what I am describing are features of parallel computation (as currently understood) not of any particular program or computer.
I flat out disagree with #Daniel Pittman's assertion that these issues are of only theoretical concern. Some of us are working very hard to make our programs scale to very large numbers of processors (1000s). And almost all desktop and office development these days, and most mobile development too, targets multi-processor systems and using all those cores is a major concern.
Finally, to answer your question, at what point does adding processors no longer increase execution speed, now that is an architecture- and program-dependent question. Happily, it is one that is amenable to empirical investigation. Figuring out the scalability of parallel programs, and identifying ways of improving it, are a growing niche within the software engineering 'profession'.
#High Performance Mark is right. This happens when you are trying to solve a fixed size problem in the fastest possible way, so that Amdahl' law applies. It does not (usually) happen when you are trying to solve in a fixed time a problem. In the former case, you are willing to use the same amount of time to solve a problem
whose size is bigger;
whose size is exactly the same as before, but with a greeter accuracy.
In this situation, Gustafson's law applies.
So, let's go back to fixed size problems.
In the speedup formula you can distinguish these components:
Inherently sequential computations: σ(n)
Potentially parallel computations: ϕ(n)
Overhead (Communication operations etc): κ(n,p)
and the speedup for p processors for a problem size n is
Adding processors reduces the computation time but increases the communication time (for message-passing algorithms; it increases the synchronization overhead etcfor shared-memory algorithm); if we continue adding more processors, at some point the communication time increase will be larger than the corresponding computation time decrease.
When this happens, the parallel execution time begins to increase.
Speedup is inversely proportional to execution time, so that its curve begins to decline.
For any fixed problem size, there is an optimum number of processors that minimizes the overall parallel execution time.
Here is how you can compute exactly (analytical solution in closed form) the point at which you get no benefit by adding additional processors (or cores if you prefer).
The answer is, of course, "it depends", but in the current world of shared memory multi-processors the short version is "when traffic coordinating shared memory or other resources consumes all available bus bandwidth and/or CPU time".
That is a very theoretical problem, though. Almost nothing scales well enough to keep taking advantage of more cores at small numbers. Few applications benefit from 4, less from 8, and almost none from 64 cores today - well below any theoretical limitations on performance.
If we're talking x86 that architecture is more or less at its limits. # 3 GHz electricity travels 10 cm (actually somewhat less) per Hz, the die is about 1 cm square, components have to be able to switch states in that single Hz (1/3000000000 of a second). The current manufacturing process (22nm) gives interconnections that are 88 (silicon) atoms wide (I may have misunderstood this). With this in mind you realize that there isn't that much more that can be done with physics here (how narrow can an interconnection be? 10 atoms? 20?). At the other end the manufacturer, to be able to market a device as "higher performing" than its predecessor, adds a core which theoretically doubles the processing power.
"Theoretically" is not actually completely true. Some specially written applications will subdivide a large problem into parts that are small enough to be contained inside a single core and its exclusive caches (L1 & L2). A part is given to the core and it processes for a significant amount of time without accessing the L3 cache or RAM (which it shares with other cores and therefore will be where collisions/bottlenecks will occur). Upon completion it writes its results to RAM and receives a new part of the problem to work on.
If a core spends 99% of its time doing internal processing and 1% reading from and writing to shared memory (L3 cache and RAM) you could have an additional 99 cores doing the same thing because, in the end, the limiting factor will be the number of accesses the shared memory is capable of. Given my example of 99:1 such an application could make efficient use of 100 cores.
With more common programs - office, ie, etc - the extra processing power available will hardly be noticed. Some parts of the programs may have smaller parts written to take advantage of multiple cores and if you know which ones you may notice that those parts of the programs are much faster.
The 3 GHz was used as an example because it works well with the speed of light which is 300000000 meters/sec. I read recently that AMD's latest architecture was able to execute at 5 GHz but this was with special coolers and, even then, it was slower (processed less) than an intel i7 running at a significantly slower frequency.
It heavily depends on your program architecture/design. Adding cores improves parallel processing. If your program is not doing anything in parallel but only sequentially, adding cores would not improve its performance at all. It might improve other things though like framework internal processing (if you're using a framework).
So the more parallel processing is allowed in your program the better it scales with more cores. But if your program has limits on parallel processing (by design or nature of data) it will not scale indefinitely. It takes a lot of effort to make program run on hundreds of cores mainly because of growing overhead, resource locking and required data coordination. The most powerful supercomputers are indeed massively multi-core but writing programs that can utilize them is a significant effort and they can only show their power in an inherently parallel tasks.
