Level of Parallelism present in multiple threads per core - multithreading

So i have been looking into some of the technologies that implement multiple threads per core (like intel's hyperthreading) and I am wondering whats the extent of parallelism in these kinds of technologies. Is it true parallelism or just more effective concurrency? It seems they still share the same execution units and core resources, basically seems like its just virtualizing the usage. So I am unsure how true parallelism could occur. And if this is the case then what is the benefit? You can achieve concurrency through effective thread context switching.

I'm no expert, but from what I've read (Long Duration Spin-wait Loops on Hyper-Threading Technology Enabled Intel Processors):
Each physical processor has two logical processors. The logical processors each have their own independent architectural state, but share nearly all other resources on the physical processor, such as caches, execution units, branch predictor, control logic and buses.
So, basically, if one logical processor is using a physical unit (e.g., FPU, the Floating-point unit), the other logical processor is allowed to use another resource (e.g., ALU, the arithmetic logic unit).
From what I've read, you can expect a performance increase of 15-20% best case scenario. I don't have any actual numbers, but don't expect the same level of performance increase as you'd expect from adding another physical processor.

So there are a lot of factors that determine the benefits present in Hyperthreading. First off since they are sharing resources there is obviously no true parallelism but their is some increase in concurrency depending on the type of processor.
There are three types of hardware threading. Fine grained which switches threads in a round robin fashion with the goal of increased throughput, at the cost of increased individual thread latency. Switching is done on a clock to clock basis. There is course grained which is more like a context switch, where the processor switching the thread when a stall or some sort of memory fetching occurs. Then there is simultaneous in which thread switching occurs in the same clock, meaning there is multiple thread data in the Reorder Buffer and Pipeline at the same time. They are depicted as follows.
Hyperthreading corresponds to the SMT in this diagram. And as seen, the effectiveness of the design depends primarily on one thing: how busy the pipeline is. In dynamically scheduled processors, where the goal is to keep the pipeline and execution units as busy as possible, the advantages see diminishing returns of around 0 to 5 percent from what I have seen. For statically scheduled processors, where the pipeline has a lot of stalls the benefits are much more prevalent and see gains of around 20 to 40% depending on the capabilities of the compiler reordering the instructions.

Related

Hardware Multithreading and Simultaneous Multithreading(SMT)

I'm reading Multithreading (computer architecture) - Wiki, aka hardware threading, and I'm trying to understand the second paragraph:
(p2): Where multiprocessing systems include multiple complete processing units in one or more cores, multithreading aims to increase utilization of a single core by using thread-level parallelism, as well as instruction-level parallelism.
while the link to thread-level parallelism says:
(Link): Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. This type of parallelism is found largely in applications written for commercial servers such as ...
which is not so useful... So I read task parallelism above, since I guess TLP is a subtype of it:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—across different processors.
Question: If thread-level parallelism is task parallelism, and task parallelism is for parallelization across multiple processors, how increase utilization of a single core by using thread-level parallelism work?
Guessing: I guess for TLP, it should mean across multiple logical processors, i.e. hardware threads in the perspective of OS, correct?
Another minor issue is that for my first link, Multithreading:
In computer architecture, multithreading is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to execute multiple processes or threads concurrently, supported by the operating system.
And in (p2) it aim to increase utilization of a single core by using thread-level parallelism? What a contradiction.
I don't think we should base off wiki definitions, the wording there is not accurate enough to merit searching for contradictions.
First, I would describe task parallelism as a form of parallelism inherent to some algorithm or problem, where there could be a functional decomposition into multiple tasks with different nature, that can run concurrently. Alternative forms of parallelism include for example spatial or data decomposition, where the problem can be broken into different parts of the data or the input layout (e.g., array ranges, matrix tiles, image parts...).
Thread-level parallelism is a different taxonomy, it is any form of parallelism that can be extracted for utilization by a multi-threaded system. It requires the decomposition to be coarse grained enough to allow the different threads to run independently (otherwise the synchronization overhead required would make it useless).
The alternative for that is for example ILP (instruction level parallelism) which is when a single thread context can extract parallelism within the code by running over a deep out-of-order machine that can schedule based on readiness. This allows more fine-grained parallelism and less programmer involvement usually, but limits the parallelism to the depth of the OOO window.
On a related topic - be careful not to confuse simultaneous execution and concurrent one.
Thread level parallelism can be used by extracting task-level parallelism or other forms of algorithm decomposition from the code. It can then be run on a system that is single-core (preemptive), or multi-threaded. The latter type can be achieved through multi-core systems, simultaneous multi-threading or both (common processors usually have many cores, and may of them support SMT on top of that).
I think my intuition should be correct, it is either:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments.
should be updated to:
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
to include the possibility of hyper threading.
No update, but task parallelism is focus on cross-processor parallelism while for TLP it should mean
Thread-level parallelism is a form of parallelization of computer code across multiple logical processors in parallel computing environments.
again, to include the possibility of hyper threading.
Useful resource:
https://en.wikipedia.org/wiki/Hyper-threading
https://www.youtube.com/watch?v=wnS50lJicXc
https://en.wikipedia.org/wiki/Simultaneous_multithreading
especially this line:
The name multithreading is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with different page tables, different task state segments, different protection rings, different I/O permissions, etc.).
So for the minor issue, see Concurrent computing - #Introduction - p1:
The concept of concurrent computing is frequently confused with the related but distinct concept of parallel computing,[2][3] although both can be described as "multiple processes executing during the same period of time". In parallel computing, execution occurs at the same physical instant: for example, on separate processors of a multi-processor machine, with the goal of speeding up computations—parallel computing is impossible on a (one-core) single processor, as only one computation can occur at any instant (during any single clock cycle).

What are all the different types of parallelism?

I am trying to understand more about parallelism, but I've noticed there are a lot of different terms out there and some seem to mean the same thing while others have a notable difference. So, what are all the different types of parallelism, how do they differ from each other, and do any have specific applications or purposes?
(To keep this more focused, I'm hoping for an answer that provides clarity to all the terminology associated with parallelism, including terms not listed below; technical comparisons between each different type would be nice, but will probably result in this question becoming off-topic - then again, I don't really know, hence the question).
Note:
this is not a question about concurrency and goes beyond the "simple" question: "what is parallelism?", although a clarifying definition might be warranted.
First, I have taken notice of the difference between parallelism and threading, but some of the differences between the following terms are still confusing.
To add clarity to my question here is a list of terms that I have found that are related to parallelism: parallel computing, parallel processing, multithreading, multiprocessing, multicore programming, Hyper-threading (Intel) 2, Simultaneous MultiThreading (SMT) 3, Switch-on-Event MultiThreading 3. (If possible, definitions or references to definitions for each of these terms would also be appreciated).
My very specific question: what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism? (and any other x-level parallelism)?
In a multi-core processor, can parallelism occur within a single core? Is that what Hyper-threading is, and does that require a single core having, for example, two ALU's that can be used in parallel?
Last one: is there a difference between hardware vs software parallelism, aside from the obvious distinction that one happens in hardware while the other in software?
Related resources:
- Process vs Thread,
- Parallelism on a GPU,
- Hyper-threading,
- Concurrency vs Parallelism,
- Hyper-threading and gaming.
Q:What is the difference betweenthread-level parallelism,instruction-level parallelism,and process-level parallelism?
While the subject matter is indeed immensely wide, I would try to have this view, even at a risk of making many opponents present their objections of simplifying the subject matter ( but StackOverflow format does not substitute other sources of complete reference, does it ? ):
A:the main difference is WHAT / WHO / HOW is responsible for keeping things to execute in true-[PARALLEL]
Instruction Level Parallelism - ILP - is the simplest case, the CPU-architecture has designed and "hardwired" this particular form of hardware-based parallelism. Having processors with ILP4 ( 4 instructions executed at once ), or having processors with per-instruction based width of this form of parallel-instruction execution, be it ILP2 for some instructions but ILP1 for some others, again the silicon architecture decides, what can happen indeed in parallel at the instruction level. Some awkward surprises may arise from further details, as memory-controller channels may block ILP-mode in cases, where REG/MEMORY uops will have to wait for a free channel to access the instructed MEMORY.
hardware-threads are the next level of granularity. Given a CPU-core is declared to support two hardware threads, these are the only streams-of-code execution, that may flow in parallel ( if no O/S request comes to instantiate and schedule another thread to get executed, mapped onto one of the available CPU-core hardware-threads ). From the user-perspective, there are O/S tools that permit one to explicitly "nail"-down a process-level-PID / thread-level-PID affinity onto a particular CPU-core(s) and thus limit or even eliminate any "disturbance", so as to move from a "just"-[CONCURRENT] flow of code-execution closer to a true-[PARALLEL]one.
We will knowingly skip all the crowds of threads, that are just a tool for latency-masking ( be it on the SIMT / SMX warp-wide GPU-scheduler, or the more relaxed, MIMT O/S-kernel driven multithreading )
software operated distributed-systems parallelism is the one that ought be mentioned for completeness, but it has the principally highest adverse costs from a need to invent, define, implement and operate the setup / coordination in software ( which all causes overheads to grow remarkably ), in the sense of the re-formulated Amdahl's Law right due to a need to somehow design and keep operational the non-native orchestration of both the distributed process execution and all the dataflow, that it is dependent on.
hardware-based true-[PARALLEL] systems are at the highest level of orchestration, where both the silicon ( like the InMOS' network of meshed Transputers ) and also the programming language ( like the InMOS' occam or occam-pi ) provide the carefully engineered, conceptually crafted true-[PARALLEL] code-execution.
- MIMT: Multiple Instruction Multiple Threads, a non-restricted thread-execution fabric / policy, where any thread may and does issue a different instruction to the processor for execution, as opposed to SIMT
- SIMT: Single Instruction Multiple Threads, typically a GPU Streaming Multiprocessor code-execution architecture- SMX: Streaming Multiprocessor eXecution unit, typically a GPU SIMT building block, onto which the GPU-kernel code-units could be directed ( addressed ) for being TaskQueeue-scheduled and later executed, according to the WARP-wide SIMT-code scheduler coordinated
what is the difference between thread-level parallelism, instruction-level parallelism, and process-level parallelism?
In 1, different CPU cores execute different streams of instructions.
In 2, single CPU core executes different instructions from a single instruction stream in parallel (these instructions are either consecutive instructions in the stream, or otherwise very close to each other).
3 is same as 1, the difference is cosmetic. It’s just the default settings about which memory pages are shared across threads and which aren’t. But these settings are user-adjustable with process creation flags, shared memory sections, dynamic libraries, and other system APIs, that’s why on the lower level, the difference between process and threads is not a big deal.
and any other x-level parallelism
Another important one is SIMD level parallelism. For this one, CPU applies same instruction to multiple operands stored in special wide registers. With SSE we have 128-bit wide registers, and we can e.g. multiply a vector of 4 single-precision floating-point numbers in one register by another 4 values in another register, making 4 products in parallel, with a single mulps instruction. ARM NEON is similar, also 128 bit registers, the instruction to multiply 4 floats by 4 floats is vmul.f32. AVX operates on 256-bit registers so it can multiply 8 floats at once, with a single vmulps instruction.
can parallelism occur within a single core?
Yes.
Is that what Hyper-threading is
Yes, also it’s what instruction-level parallelism is, and SIMD parallelism, too.
does that require a single core having, for example, two ALU's that can be used in parallel?
Modern CPUs have more than two per core but HT was introduced in P4 and it’s not a requirement. The profit from HT is not just loading multiple ALUs, it’s also using the core while a thread is waiting for data to arrive from caches or from system RAM. And also, using the core while it's stalled because of the data dependency between nearby instructions. HT allows a CPU core to compute something else on another hardware thread while it’s waiting, therefore improving ALU utilization. Without HT, the core would likely just sit and wait for hundreds cycles in case of RAM latency, or for dozens cycles in case of data dependency latency.
is there a difference between hardware vs software parallelism
When you have a single hardware thread and multiple OS threads that compute stuff, only 1 thread will be running at any given time. The rest of the threads will be waiting. The OS will periodically (often ~50-100Hz) switch which one’s running, with the goal to give all threads a fair slice of CPU time. You can call that software parallelism if you want, but I wouldn’t call such thing parallel at all.

Is duplication of state resources considered optimal for hyper-threading?

This question has an answer that says:
Hyper-threading duplicates internal resources to reduce context switch
time. Resources can be: Registers, arithmetic unit, cache.
Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?
Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?
Is duplication that researchers arrived at in some sense optimal, or is it just a reflection of current possibilities (transistor size, etc.)?
The answer you're quoting sounds wrong. Hyperthreading competitively shares the existing ALUs, cache, and physical register file.
Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions. (See Modern Microprocessors
A 90-Minute Guide! for very useful background, and a section on SMT. Also this answer for more about how modern superscalar / out-of-order CPUs find and exploit instruction-level parallelism to run more than 1 instruction per clock.)
Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage). David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. He also describes how cache is competitively shared between threads. For example, a Haswell core has 168 physical integer registers, but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)
Some things are statically partitioned, like the ROB, to stop one thread from filling up the back-end with work dependent on a cache-miss load.
Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock. This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.
Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each. (Usually significantly faster than half).
But when only a single thread is running, the full power of a big core is available for that thread. This is what you lose out on if you design a multicore CPU that has lots of small cores. If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread. It helps for a few single-thread workloads, but helps a lot more with HT. So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.
Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Part of this might be the trace cache, but it also didn't have nearly the amount of execution units. P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant). A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread. This is what happens when you implement HT without enough execution resources to really make it good.
HT doesn't help at all for code that achieves very high throughput with a single thread per physical core. For example, saturating the front-end bandwidth of 4 uops / clock cycle without ever stalling.
Or if your code only bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators). It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread. (And also the uop cache and L1I cache).
Saturating the FMAs and doing something with the results typically takes some instructions other than vfma... so high-throughput FP code is often close to saturating the front-end as well.
Agner Fog's microarch pdf says the same thing about very carefully tuned code not benefiting from HT, or even being hurt by it.
Paul Clayton's comments on the question also make some good points about SMT designs in general.
If you have different threads doing different things, SMT can still be helpful. e.g. high-throughput FP code sharing a core with a thread that does mostly integer work and stalls a lot on branch and cache misses could gain significant overall throughput. The low-throughput thread leaves most of the core unused most of the time, so running another thread that uses the other 80% of a core's front-end and back-end resources can be very good.

How is processor speed distributed across threads?

Objective:
I am trying to estimate how fast my code will execute when run concurrently in multiple threads.
Question 1)
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
Question 2)
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
My Situation:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?
Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.
These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.
Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).
Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.
The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).
When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.
So what are the factors that determine how well a process will scale up?
What portion of its run time does a single process spend waiting for
I/O and user input?
Do multiple threads hit the same resources, or different ones?
How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.
How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?
As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.
If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?
No, you should determine it empirically.
What impact, if any, does the presence of other threads effect the execution speed of each other thread?
Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.
Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.
I know it will depend upon a number of factors
Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.
but surely there must be some way of identifying whether or not your code will scale
Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:
disk I/O
network I/O
memory I/O
computation
My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?
If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.
If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.
EDIT given specifics:
I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.
Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.
A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.
limited experience with computer hardware architecture and how multi-threading works under the hood.
Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

Why would multi threaded applications in general scale bad?

I am currently thinking of reasons why multi threaded applications may not scale well.
Two reasons I am aware of and that I have been fighting with are:
Communication between threads is not done well and slows down the speed
Number of cores on a chip and memory bandwith to the cpu do not increase proportionally. This leads to a slower memory bandwith per core the more cores on a chip are heavily used.
What else are problems?
For point 1), they are not necessarily 'not done well', but in most cases there are critical sections that processes/threads have to wait for each other, e.g. update some critical data. This is described well by Amdahl's law.
Another point I'd like to add is the scalability of the task itself. If the task (the input) is not scalable, then increasing processing power (cores/threads) cannot improve the whole throughput. For example, an application is to handle data flows, but there is a constraint that data packets from same flow can not be handled in parallel (due to ordering consideration), then the scalability will be limited by the number of flows.
In addition, the scalability of the algorithm is even more fundamental, considering the difference between O(1) and O(n) algorithms. Of course, maybe the topic here focus on scalability of processing power, rather than data size.
I think that, in (1), you've nailed one of most important factors that can negatively influence the performance of multithreaded apps. Esp. Google for 'false sharing'.
(2), however only affects a set of multithreaded apps - those that that run CPU-bound threads in parallel. If an app uses many threads that are I/O bound, (2) does not matter too much.
Looking at my box here, it has 100 processes and 1403 threads, CPU use 3%. Only 7 out of the 100 processes are single-threaded. Most of the apps, therefore, are multithreaded but I/O waiting.
My box would work reasonably well, at the moment, if it had only one core. Sure, hitting a link that winds up my browser would probably be a bit slower to bring up a complex page, but not much.
In the commonest case then, where apps are multithreaded to take avantage of the high I/O performance of preemptive multitaskers, apps scale very well indeed, even on a single-core CPU.
Try not to fall into the trap of thinking that preemptive multitasking OS are all about 'doing CPU-bound tasks in parallel' - they actually make this difficult by forcing the need for locking, synchro, signalling etc. It's much more about high-performance I/O, something that a cooperative scheduler is spectacularly bad at.
Many multi-threaded applications are built around the "one user one thread" concept which means that once a user or chore needs to be handled a thread is allocated to the task. Every extra thread increases the load on the scheduler leading up to the point where all processing is done trying to determine which thread should be run at this moment. Call this "scheduler saturation."
Windows (the multi-threaded engine, not 95/98/Me etc) has a mechanism called I/O Completion ports which recommend one thread per processor for best performance. IOCP-based applications are usually tremendously fast though, as always, the bottlenecks instead appear in other places such as running out of certain types of OS memory or waiting on the communications medium.
You can search for IOCP here at SO, it has its own tag.
I would add:
The more threads, the smaller their share of CPU cache. A typical modern CPU's might have 3 levels of cache: L1, L2 and L3. L1 might be private to that core, but L2 and L3 might be shared between cores on the die or something. So a single thread can use the entire L2 & L3, but if you have many threads then you get many more cache misses, depending on the profile of your algorithm.
See also:
many-core CPU's: Programming techniques to avoid disappointing scalability
It could be limited by the fixed maximum bandwidth of main memory, where your program has run out of the memory bandwidth, and however you make more thread can't create more available memory bandwidth. This is related to your specific application, whether is a memory bounded one or a compute bounded one, see roofline model.

Resources