In a single GPU such as P100 there are 56 SMs(Streaming Multiprocessors), and different SMs may have little correlation .I would like to know the application performance variation with different SMs.So it there any way to disable some SMs for a certain GPU. I know CPU offer the corresponding mechanisms but have get a good one for GPU yet.Thanks!
There are no CUDA-provided methods to disable a SM (streaming multiprocessor). With varying degrees of difficulty and behavior, some possibilities exist to try this using indirect methods:
Use CUDA MPS, and launch an application that "occupies" fully one or more SMs, by carefully controlling number of blocks launched and resource utilization of those blocks. With CUDA MPS, another application can run on the same GPU, and the kernels can run concurrently, assuming sufficient care is taken for it. This might allow for no direct modification of the application code under test (but an additional application launch is needed, as well as MPS). The kernel duration will need to be "long" so as to occupy the SMs while the application under test is running.
In your application code, effectively re-create the behavior listed in item 1 above by launching the "dummy" kernel from the same application as the code under test, and have the dummy kernel "occupy" one or more SMs. The application under test can then launch the desired kernel. This should allow for kernel concurrency without MPS.
In your application code, for the kernel under test itself, modify the kernel block scheduling behavior, probably using the smid special register via inline PTX, to cause the application kernel itself to only utilize certain SMs, effectively reducing the total number in use.
Related
thread-stop-preemption
//code to run
thread-start-preemption
a piece of code is running in a thread,
are atomic functions are available in user mode?
Linux doesn't offer very good behavior for real-time applications.
Unless your application really is real-time you should change your code to use normal synchronization primitives (e.g. mutexes, condition variables etc.)
But if you really think you need your thread not to be interrupted you might get away (but not really) with the real-time policies mentioned in sched(7) e.g. SCHED_FIFO. If you choose to go down that route you can influence a thread's scheduling using sched_setattr(2).
More warning
Before using this for anything with hard real-time constraints consider a vanilla Linux kernel itself is probably not the tool for the job: although the scheduler will try to keep your thread running I don't think it "guarantees" it.
I recently stumbled upon the question above but I am not sure if I understand what it is asking.
How would one avoid the use of scheduling policies?
I would think that there isn't any other way...
Scheduling policy has nothing to do with the resource allocation! Processes are scheduled basically, and hence allocated resources as such.
From "Resource allocation (computer)" description on Wikipedia :-
When the user opens any program this will be counted as a process, and
therefore requires the computer to allocate certain resources for it
to be able to run. Such resources could have access to a section of
the computer's memory, data in a device interface buffer, one or more
files, or the required amount of processing power.
I don't know how you got confused between them. All the process would, at a time or another, get scheduled at any point of time; unless the CPU is an unfair one.
EDIT :
How would one avoid the use of scheduling policies?
If there are more than one user-process to be executed, then one has to apply the scheduling policy so that the processes get executed in some order. There has to be a queue to hold all the processes. See a different case in BareMetal OS below.
Then, there is BareMetal OS which is single address space OS.
Multitasking on BareMetal is unusual for operating systems in this day
and age. BareMetal uses an internal work queue that all CPU cores
poll. A task added to the work queue will be processed by any
available CPU core in the system and will execute until completion,
which results in no context switch overhead.
So, BareMetal OS doesn't use any scheduling policy, it is based on polling of the work-queue by the cores.
I would like to run a single low latency task (for audio, ALSA/JACK) on a separate core with an embedded Linux system. Removing scheduler and other interrupts might be the key here.
There were several approaches I found so far, e.g. cpusets and an offline scheduler from 2009 (which unfortunately does not support user space tasks).
Is there a newer/more convenient way to achieve this?
Offline scheduler
The topic you are looking for is called "CPU affinity". There are two main aspects to the CPU affinity: affinity of processes and affinity of the interrupts.
To my (admittedly limited) knowledge:
The processes are assigned to CPUs using the taskset command. (The affinity is inherited by the child processes.)
The interrupts to CPU assignment on Linux can be manipulated using the /proc/irq/<n>/smp_affinity. To verify the effectiveness of the assignment, check the /proc/interrupts to see which CPUs serve which interrupts. See here.
In your particular case, you want to reserve a single CPU (aka core) for your critical application, for example CPU0. That means that all processes and interrupts should be assigned to all but the CPU0, using the affinity mask which has the bit 0 (== CPU0) cleared, e.g. 0xfffffffe. And your critical application would have the affinity mask of 0x1, meaning that it is allowed to run only on the CPU0.
Additionally, you might need to use the sched_setscheduler syscall in the application to set the scheduling to one of the real-time policies. That might improve the latencies of your application (but also can make worse).
Note that tuning the CPU affinity is not a trivial endeavor and clear-cut solutions are rare. You would need to test and experiment to make sure that the configuration can sustain the performance you need. For example, it is likely that your application would communicate with the other processes. If the communication is synchronous, and the other processes are slow to react (since they have limited CPU resources), that would in turn negatively impact performance of your critical application. Same applies to the interrupt(s) of the sound card.
Hope that helps.
The JMeter manual says
Your hardware's capabilities will limit the number of threads you can effectively run with JMeter. It will also depend on how fast your server is (a faster server makes JMeter work harder since it returns request quicker). The more JMeter works, the less accurate its timing information may become.
The question I want to ask is How many threads can I run from a single desktop machine and still get accurate enough results? However, I realize that's going to depend on what we define modern hardware as, or how fast my application/site is, etc.
So, the better (but harder to answer) question is, how to I profile JMeter to know when I've gone beyond the thread/user count that it's reasonable for a single machine to handle? Accurate deterministic methods are preferred, but anecdotal/rules-of-thumb are welcome.
I first suggest you follow best-practices for building JMeter test plans and running them:
http://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/
http://jmeter.apache.org/usermanual/best-practices.html
Then once your test plan is built, baseline it on the JMeter machine:
Monitor CPU (don't exceed 50%), swap (ensure no swap in/out at all)
Check GC for no long pauses
And don't forget issues which make Test wrong can come from lot of factors:
Networks issue between injector and application
TCP stack issues on JMeter injector
Components between the Injector and Application (Firewall, Load Balancer ...)
I would like to start playing with concurrency in the programs I write (mostly for fun), but I don't own a multi-core system and can't afford one any time soon. I run linux. Is there a way to, for example with a Virtual Machine, compare the performance of a multi-threaded implementation of a program with a single-threaded version, without actually running it on hardware with multiple processors or cores?
That is, I would like to be able to implement parallel algorithms and be able to say that, yes, this multithreaded implementation is better-performing than the single-threaded.
Thanks
You can not test multithreaded programs reliably on a single core machine. Race conditions will show up very differently or even be totally hidden on a single core machine. The performance will decrease etc.
If you want to LEARN how to program multiple threads, you can do so on a single core machine for the first steps (i.e how works the API etc.). But you'll have to test on a multicore machine and its very likely that you will see faults on a multicore machine that you dont see on a single core machine.
Virtual machines are by my experience no help with this. They introduce new bugs, that didnt show up before, but they CANT simulate real concurrency with multiple cores.
Depending on what you're benchmarking you might be able to use an Amazon EC2 node. It's not free, but it's cheaper than buying a computer.
If you have only one core/cpu and your algorithm is cpu intensive, you will probably see multi-threaded program is actually slower than the single-threaded one. But if you have program use i/o in one thread and cpu in another for example, then you can see the multi-threaded program is faster.
To observe effects other than potentially improved locality, you'll need hardware or a simulator that actually models the communication/interaction that occurs when the program runs in parallel. There's no magic to be had.