I have an analysis that can be parallelized over a different number of processes. It is expected that things will be both IO and CPU intensive (very high throughput short-read DNA alignment if anyone is curious.)
The system running this is a 48 core linux server.
The question is how to determine the optimum number of processes such that total throughput is maximized. At some point the processes will presumably become IO bound such that adding more processes will be of no benefit and possibly detrimental.
Can I tell from standard system monitoring tools when that point has been reached?
Would the output of top (or maybe a different tool) enable me to distinguish between a IO bound and CPU bound process? I am suspicious that a process blocked on IO might still show 100% CPU utilization.
When a process is blocked on IO, it isn't running, so no time is accounted against it. If there's another process that can run, then that will run instead; if there isn't, the time is counted as 'IO wait', which is accounted as a global statistic.
IO wait would be a useful thing to monitor. It shows up in top's header as something like %iw. You can monitor it in more detail with tools like iostat and vmstat. Serverfault might be a better place to ask about that.
Even a single IO-bound process will rarely show high CPU utilization because the operating system has scheduled its IO and is usually just waiting for it to complete. So top cannot accurately distinguish between an IO-bound process and a non-IO-bound process that merely periodically uses the CPU. In fact, a system horribly overloaded with all IO-bound processes, barely able to accomplish anything can exhibit very low CPU utilization.
Using only top, as a first pass, you can indeed merely keep adding threads/processes until CPU utilization levels off to determine the approximate configuration for a given machine.
You can use tools like iostat and vmstat to show how much time processes are spending blocked on I/O. There's generally no harm in adding more processes than you need, but the benefit decreases. You should measure throughput vs. processes as a measurement of overall efficiency.
Related
A scheduler that approximates SRTF, like a multi-level feedback queue design, will tend to favor interactive programs that perform short CPU bursts. Linux's Completely Fair Scheduler sometimes does so, but since it has a different scheduling goal, it often wil not. In which of the following scenarios is CFS likely to result in much worse performance for the interactive thread than an MLFQ-like scheduler that approximates SRTF?
running one interactive thread with short CPU bursts that, if running alone, would use very little CPU time and one very CPU-intensive thread that never does I/O
running one interactive thread with short CPU bursts that, if running alone, would use very little CPU time and one non-interactive thread with much longer CPU bursts that performs disk I/O frequently
running one interactive thread with frequent short CPU bursts that, if running alone, would use most of the available CPU time, and one very CPU-intensive thread that never does I/O
running one interactive thread with short CPU bursts and a very large number of CPU-intensive threads that never do I/O
The correct answers are 3 and 4.
Why 3 & 4 are correct? What's the difference between interactive and non-interactive thread?
In this context, an interactive thread is one that tends to spend most of its time waiting for I/O, only doing small amounts of computation in between. That is, it mostly responds quickly to inputs rather than doing longer computations.
More broadly speaking, when we speak of interactive programs, we usually mean ones that are primarily responding to some external input. A common scheduling goal is to provide programs like these with higher priority than normal programs to provide at least the appearance of better performance to users waiting for the machine to do something. When thinking about interactivity this way, exact definitions vary --- there are different notions of what counts as an "external input".
For answering this question in particular, we don't actually need to use any definition of "interactive". The reason the question specifies that one thread is interactive is to motivate the question --- this is a case where SRTF-like schedulers can do better than CFS by identifying interactive threads by their tendency to have short CPU bursts. Rather than relying on us saying the thread is "interactive", we can understand how the SRTF scheduling policy will work based on the CPU burst lengths, which we are told explicitly. We can understand how the CFS policy will apply by considering that it splits the CPU time approximately fairly between the available threads.
For 1 and 2:
since the interactive thread doesn't use much CPU time overall, it will tend to be run first by CFS, but it will also tend to be run first by SRTF since it has the shortest CPU bursts
For 3:
CFS will end up giving the interactive thread about half the available CPU time (fairly splitting CPU time between the two available threads), but under SRTF, it would would always be run first (whenever it could run) because of its shorter CPU burst and would end up getting much more than half the time (since "running alone, [it] would use most of the available CPU")
For 4:
CFS will end up giving the interactive thread about 1/N of the available CPU time where N is the total number of threads and we are told that N is very large. Under SRTF, the thread would always run first, so it would almost certainly get more than the small sliver of CPU time that 1/N represents
--answer from my professor
I'm worried that I see through the disk LED and iotop quite some write activity every couple of seconds, mostly coming from the chromium's processes, on a completely idle system.
It doesn't make any sense at all to have such a high number of writes to disk, even less with SSD disks. The reads isn't a problem for me, also because I have plenty of disk cache on my 20gb RAM notebook.
The commit option (which is by default 30s) obviously isn't the solution. Tried to increase or even decrease and still see one write every couple of seconds.
So is there a way to force not more than one write per arbitrary interval?
At first check your linux is using CFQ scheduler. then you can use ionice to control I/O scheduling class and priority of a program.
It supports following three scheduling classes (quoting from the man page):
Idle : A program running with idle io priority will only get disk time when no other program has asked for disk io for a defined grace period. The impact of idle io processes on normal system activity should be zero. This scheduling class does not take a priority argument.
Best effort : This is the default scheduling class for any process that hasn't asked for a specific io priority. Programs inherit the CPU nice setting for io priorities. This class takes a priority argument from 0-7, with lower number being higher priority. Programs running at the same best effort priority are served in a round-robin fashion. This is usually recommended for most application.
Real time : The RT scheduling class is given first access to the disk, regardless of what else is going on in the system. Thus the RT class needs to be used with some care, as it can starve other processes. As with the best effort class, 8 priority levels are defined denoting how big a time slice a given process will receive on each scheduling window. This is should be avoided for all heavily loaded system.
ionice options PID
ionice options -p PID
ionice -c1 -n0 PID
for limiting more than this i think you should use of your SAN utilities.
Take a look at eatmydata (https://github.com/stewartsmith/libeatmydata).
It could be ok for you, but read all the docs and think twice before using it...
PSD - Profile Sync Daemon - is the specific solution for Chromium and other browsers https://wiki.archlinux.org/title/profile-sync-daemon
https://github.com/graysky2/profile-sync-daemon
There are a lot of articles that discuss multi-core myth. That, in order to really benefit from multiple cores, one needs to write parallel algorithms. Many of them mention Amdahl's law.
Lets assume for simplicity that we have a desktop computer with a 4-core commodity CPU. And assume that the goal is to improve our application performance, as well as overall system performance.
I wonder how CPU cores are used to perform tasks.
Whether threads from a single process are allocated all cores
Or threads from different processes are scheduled to run on different cores.
If the latter is the case, then why is the myth even discussed? Won't multitasking OSes always benefit from multi-core CPUs, even if all the processes are single threaded? Are threads from the same process more likely to be scheduled at the same time on multiple cores?
What are some factors that matter? CPU cache maybe? Some application related maybe? Why?
Why would you ever want to use parallel libraries/algorithms? After all, CPU resources are shared between all running processes and there are always enough of them.
Is there an "active process" notion? i.e. process that gets most attention from the scheduler. If so, then how much more attention does this process usually get?
Whether threads from a single process are allocated all cores
Yes.
Or threads from different processes are scheduled to run on different cores.
Yes, that too.
If the latter is the case, then why is the myth even discussed? Won't multitasking OSes always benefit from multi-core CPUs, even if all the processes are single threaded?
To some extent, yes. But if that process is doing a lot of computation and the only one we care about at some particular time, the benefit will be pretty low.
On the other hand, it also means the process won't be as likely to be interrupted just because the OS has to do something like handle a disk interrupt, an arriving network packet, or something like that. Interrupting a process to handle some hardware task not only reduce the CPU time the process gets but it also pollutes the CPU caches causing the process to run more slowly when it resumes. So multi-core CPUs can allow a single-threaded process to command a core for a higher percentage of the time and in longer bursts.
Are threads from the same process more likely to be scheduled at the same time on multiple cores?
Typically no. Why would you want to do that? That would tend to degrade overall system performance as threads from the same process are more likely to step on each other's toes. You want the system to get other process' work done efficiently so you get the CPU back.
Is there an "active process" notion?
To some extent. Windows has precisely such a notion -- a "foreground process". Most OSes don't. But they do have a "dynamic priority boost" feature. Basically, if a process is sitting around doing nothing and then needs to do something, it is given some priority as a "reward". This allows a process that sits around waiting for work to be done to get its work done quickly and makes the system feel more interactive and responsive. It often makes little sense on servers, but it's helpful on desktops. Whether this is implemented on threads individually or on all the threads of a process as a group is implementation specific.
If you run separate processes or threads that doesn't needs to interact each others then it will be far better having 4 cores rather then having just 1.
As soon as the processes or threads needs to share some data, you will get the overhead to serialize the access to the shared data.
A lot depends on how good an application is written to run on a multi-core CPU. It may happen in the worst case that trying to run an application on a 4-core CPU is slower than running it on a single core CPU; more likely the increase in performance would be far less than 100%.
I've been using pstack (called in a loop periodically) as a substitute for a real profiling tool. I've noticed that even though there's more then 85% cpu usage for that pid in top, pstack shows the pid being blocked on I/O more often than being CPU bound.
How's pstack implemented? Is there any reason why pstack would be more susceptible to attaching to the pid when it's actually blocked on I/O?
You say you're calling pstack periodically in a loop - i.e. in a separate process (B) from the one you are profiling(A). If they are running in a single core, then B is more likely to "wake up" when A is blocked.
Regardless, I would trigger pstack manually, on the theory that not many samples are needed. Rather the samples I do get need to be scrutinized, not just lumped together.
In general, it's good to take samples during I/O time as well as CPU time, because both I/O and CPU wastage can make your program slow.
If it somewhat inflates one or the other, that's fairly harmless, assuming your real goal is to precisely identify things to optimize, rather than just get precise measurements of fuzzy things like functions.
Earlier I asked about processing a datastream and someone suggested to put data in a queue and processing this data on a different thead. If this was to slow, I should use multiple threads.
However, i'm using a system that has one core.
So my question is: why not up the prio of my app, so it gets more CPU time from the OS?
I'm writing a server based app and it will be the only big thing running on there.
What would be the pro's and con's of putting the prio up?:)
If you have only one core, then the only way that multi-threading can help you is if chunks of that work depends on something other than CPU, so one thread can get some work done while another is waiting for data from a disk or network connection.
If your application has a GUI, then it can benefit from multi-threading in that while it would be no quicker to do the processing (slower in fact, though probably negligibly so if the task is very long), it can still react to user input in the meantime.
If you have two or more cores, then you can also gain in CPU-bound operations though doing so varies from trivial to impossible depending on just what that operation is. This is irrelevant to your case, but worth considering generally if code you write could later be run on a multi-core system.
Upping the priority is probably a bad idea though, especially if you have only one core (one advantage of multi-core systems is that people who up priorities can't do as much damage).
All threads have priorities which is a factor of both their process' priority and their priority within that process. A low-priority thread in a high priority process trumps a high-priority thread in a low-priority process.
The scheduler doles out CPU slices in a round-robin fashion to the highest priority threads that have work to do. If there are CPUs left over (which in your case means if there are zero threads at that priority that need to run), then it doles out slices to the next lowest priority, and so on.
Most of the time, most threads aren't doing much anyway, which can be seen from the fact that most of the time CPU usage on most systems is below the 100% mark (hyperthreading skews this, the internal scheduling within the cores means a hyperthreaded system can be fully saturated and seem to be only running at as little as 70%). Anyway, generally stuff gets done and a thread that suddenly has lots to do will do so at normal priority in pretty much the same time it would at a higher.
However, while the benefit to that busy thread of higher priority is generally little or nothing, the decrement is great. Since it's the only thread that gets any CPU time, all other threads are stuck. All other processes therefore hang for a while. Eventually the scheduler notices that they've all been waiting for around 3seconds, and fixes this by boosting them all to highest priority and giving them larger slices than normal. Now we have a burst of activity as threads that got no time are all suddenly highest-priority threads that all want CPU time. There's a spurt of every thread except the high-priority one running, and the system stops from keeling over, though there's likely still a lot of applications showing "Not Responding" in their title bars. It's far from ideal, but it is an effective way to deal with a thread of higher than usual priority grabbing the core for so long.
The threads gradually drop down in priority, and eventually we're back to the situation where the single higher priority thread is the only one that can work.
For extra fun, if our high priority thread in any way depended upon services provided by the lower priority threads, it would have ended up being stuck waiting on them. Hopefully in a way that made it block and stopped itself from doing any damage, but probably not.
In all, thread priorities are to be approached with great caution, and process priorities even more so. They're only really valid if they'll yield quickly and are either essential to the workings of other threads (e.g. some OS processes will be done at a higher priority, finaliser threads in .NET will be higher than the rest of the process, etc) or if sub-millisecond delays can mess things up (some intensive media work requires this).
If you have multiple cores/processors in your system, upping the priority of a single threaded program will not improve your performance by much, because the other cores would still be unused.
The only way to take advantage of multiple processing units is to write your program using multiple threads/processes.
Having said this, setting your multithreaded application to very high priority may lead to some performance improvement, but I really never saw it to be significant, at least in my own tests.
Edit: I see now that you are using only one core. Basically your program will be able to run more often on the CPU than the rest of the processes that are of lower priority. This may bring you a marginal improvement, but not a dramatic one. Since we cannot know what other applications are running at the same time on your system, the golden rule here is to try it yourself with various priority levels and see what happens. It's the only valid way to see if things will be faster or not.
It all depends on why the data processing is slow.
If the data processing is slow because it is a genuinely cpu intensive operation then splitting it out into multiple threads on a single core system is not going to get you any benefit. In this case increasing the task priority would provide some benefit, assuming that there is (user) cpu time being used by other processes.
However, if the data processing operation is slow because of some non-cpu restriction (eg. if it is I/O bound, or relying on another process), then:
Increasing the task priority is going to have negligible impact. Task priority won't affect I/O times and if there is a dependency on another process on the system you may actually harm performance.
Splitting the data processing out into multiple threads can allow the cpu intensive areas to continue processing while waiting for the non-cpu intensive (eg. I/O) areas to complete.
Increasing the priority of a single-threaded process just gives you more (or bigger) time slices on the one core the process is running on. The core can still only do one thing at a time.
If you spin off a thread to handle the data processing, it can run on a different processor core (assuming a multi-core system), and it and your main thread are actually executing at the same time. Much more efficient.
If you use only one thread your server app will only be able to service one request at a time, no matter what its priority. If you use multiple threads you could service many at the same time.