Can heavy disk IO cause non disk IO threads to slow down? - io

I am trying to understand the performance ramifications on a system that is under extremely heavy disk IO usage. Let's say the system is Linux based.
Setup
Let's say I have three processes A, B, and C running on a linux OS with a single disk.
Process A is just writing data to disk. For the sake of this thought experiment, process A is a c program that is using the posix open and write system calls. Consider that this process is writing so much to disk that it is causing HEAVY disk usage.
Process B is only making socket calls to some webserver, for example getting the contents of google.com. Some system calls involved in this process are socket, connect, and write (note write is used on a socket file descriptor, but not a physical file on disk).
Process C has two threads locking on the same mutex using pthread system calls.
Resource Contention on Disk IO Performance Impacts
What will the performance impacts be on processes B and C given that process A is doing very heavy IO?
Will it take longer for Process B to make its socket connection and perform the socket related system calls when compared to a system that was not under heavy IO load?
Will the threads in process C start taking longer to acquire locks on the mutex while the system is under heavy IO load when compared to no IO load?

Related

Mulithreading does not help for IO intensive task?

I need to copy a set of files with the size of each file ranging from 1MB to 700MB. After I copy each file, I need to validate the checksum of each file against an entry in md5sum.txt.
I wanted to optimize this task and hence evaluated the performance by splitting the load among multiple threads. The results were not as expected. I was expecting that the time taken for copy and validation would decrease with increase in the number of threads, but the time taken actually increased.
I have modified the ThreadPool source code shared in this link https://stackoverflow.com/a/22285532/1568395 to implement the threadpool.
The source code for the application can be found here
https://github.com/saai63/ThreadPool
The results for various number of threads is as shown below,
As per my reading, the probable reason could be that all tasks are now IO bound tasks and hence all of the threads will be blocked on IO operation and hence cannot run in parallel as the shared resource here is the HDD. I also understand that HDD controller tries to optimize the disk access by reducing the seek time. Disks love sequential access patterns, and any concurrent accesses will disrupt this pattern and hence the delay for large files.
Is this the only reason for the delay or there are some other factors? Why does the time increases with the increase in number of threads?
IO is always much slower than the CPU. When multiple threads try to read from an IO device, what they usually achieve is a "bull rush" to the device and increase the "randomness" of the IO operations, thus making it all slower. Fewer threads have a greater chance of sequential operations, which are notoriously faster.
In case of Multithreading you share CPU amongst threads. CPU is swtichted amongts threads whenever running thread goes into some sort of waiting state.
Here you have IO bound task and, there's no point of making your program multithreaded as all of them will be relying on single IO device.
Even if you implement a multiprocess solution (multiple processes on same node), all processes will be waiting for the same IO device and won't give any performance optimization.
One solution would be building some sort of multi node solution with shared disk having simultaneous multi-client access support.
Using this kind of approach you can divide your task amongst multiple nodes, access same disk and perform operation.
Edit:
I think increase in time is beacause time taken by Operating System for servicing multiple threads.
Switching CPU and IO devices amongst thread is gonna take as you increase number of threads, Context Switch is compute intensive task as well as you lose the IO/CPU Cache performance as you switch amongts threads.

There are many threads reserved while golang application running

The golang application is a tool that receives file by invoking a c library, saves it to disk and report the transfer state to monitor service with http protocol.
After a few transferring, I found there are about 70+ threads existed with a few goroutines.
I check the c and go source code, there are no thread or goroutine leak found.
I use "dlv" to debug the application, here is the stack of one of the such threads:
(dlv) bt
0 0x000000000046df03 in runtime.futex
at /home/vagrant/resource/go/src/runtime/sys_linux_amd64.s:388
1 0x0000000000437e92 in runtime.futexsleep
at /home/vagrant/resource/go/src/runtime/os_linux.go:45
2 0x000000000041e042 in runtime.notesleep
at /home/vagrant/resource/go/src/runtime/lock_futex.go:145
3 0x000000000044036d in runtime.stopm
at /home/vagrant/resource/go/src/runtime/proc.go:1594
4 0x0000000000441178 in runtime.findrunnable
at /home/vagrant/resource/go/src/runtime/proc.go:2021
5 0x0000000000441cec in runtime.schedule
at /home/vagrant/resource/go/src/runtime/proc.go:2120
6 0x0000000000442063 in runtime.park_m
at /home/vagrant/resource/go/src/runtime/proc.go:2183
7 0x0000000000469f1b in runtime.mcall
at /home/vagrant/resource/go/src/runtime/asm_amd64.s:240
I don't know where these threads come from or may be threads pool of golang runtime?
Could any one look at this, thank you very much!
The problem
The golang application is a tool that receives file by invoking a c
library, saves it to disk and report the transfer state to monitor
service with http protocol.
After a few transferring, I found there are about 70+ threads existed
with a few goroutines.
The cause
Each call to C (via cgo, or syscall on Windows etc) is no really
different from performing an OS system call as long as the Go scheduler
is concerned.
What happens is this:
When a goroutine is being executed, it runs on an OS thread
(this is sort of obvious, I fathom).
When it performs a syscall or calls C, that goroutine blocks
(stops executing Go code).
The Go runtime scheduler watches after the goroutines which got blocked
and after at east a single "scheduler tick" (which currently — in
Go 1.8 and 1.9 — is 20 µs) passes, and the goroutine is still blocked,
and there are other runnable goroutines,
the scheduler creates another OS thread to make other goroutines
continue execution.
This behaviour might appear to be counter-intuitive at first
but without it, on, say, a two-CPU machine, you would need to just call
two syscalls (such as reading or writing a file) in parallel from
any two goroutines to block the rest of the active goroutines
from doing their work.
In other words, the scheduler tries to keep up with the Go's promise
of always having up to GOMAXPROCS goroutines running
if there are goroutines which want to run, and GOMAXPROCS
is set to the number of CPUs (cores) of the machine.
So, what happens is that if you have a reasonably high churn of C calls which complete slower than that single scheduler tick, you'll have growing pool
of allocated OS threads.
Note that this is not bad in itself: sure, you'll be allocating resources
(on a typical commodity OS each thread has some 8 MiB of stack allocated
plus some bookkeeping data structures internal to the OS) but they are
not wasted: these threads will get reused as soon as they will be needed.
Say, your next burst of such C calls will reuse the allocated threads.
The solution
Still, if you'd like to prevent that from happening, the common approach
is to reasonably serialize your C calls.
A typical approach to this is to have a single "worker" goroutine
which receives "tasks" — in the form of values of some type, usually
a custom type created by you — over a channel and sends the results of
their execution over another channel.
The input channel may be buffered — effectively turning it into a queue.
If you'd still want to parallelize that work, you can have a pool of
worker goroutines — all reading the single input channel and writing to
the single output channel.
But note that if those C calls spend most of their time doing disk I/O
and the files they read/write are located on the filesystem which
is backed by a single medium, you usually won't gain much with
parallelizing unless that medium is blazingly fast — such as SSD or
in-memory (RAM) disk.
So consider all the options and think through your design.

Uninterruptable write in Linux

According to an answer on this question : Why doing I/O in Linux is uninterruptible? I/O on linux is uninterruptible (uninterruptible in sleep). But if I start a process ,say a large 'dd' on a file and while the process is going on I forcefully unmount the Filesystem (where the file is),the process gets killed . ideally it should be in a hung state because it is sleeping and is UN.
"Uninterruptible" applies to the low-level read/write operations handled by the kernel. In C programming, these correspond broadly to read() and write() calls on the C standard library. That a utility can be interrupted does not say much about whether I/O operations can be interrupted, because a specific file operation in a utility might correspond to many low-level I/O operations.
In the case of dd, the default transfer block size is 512 bytes, so copying a large file might consist of many I/O operations. dd can be interrupted between these operations. I would expect the same to apply to most utilities that operate on files. If you can force them to work with huge data blocks (e.g., specify a gigabyte-size argument for bs= in dd) then you might be able to see that low-level I/O operations are uninterruptible.

Can other processes run during memory paging?

First off, take a single processor system with multiple processes running in pseudo-parallel. When a process triggers a page fault, will this force the CPU to stop executing all programs until the page is loaded from disk?
If so, does this change on a multi-core or multiprocessor system, or can the other processes continue to read and write to memory while the page fault is dealt with?
Thanks!
First, scheduling does not work for processes but for threads. A page fault only suspends the thread incurring the fault (on Linux and Windows). The thread is descheduled and the CPU is free to do other work.
At the level of the OS interfacing hardware there is no synchronous IO anyway. It does not exist (at least with modern hardware). The OS does not sit in a tight spin-loop waiting for the hardware to signal IO completion. Instead, the thread is descheduled until the IO completed (or the respective wait handle becomes signaled).
Yes, this is not a problem at all. Nobody in their right mind designs a multi-process OS that's unable to run multiple processes, nor would they arbitrarily block process A because B is waiting on a disk I/O.

What are the thread limitations when working on Linux compared to processes for network/IO-bound apps?

I've heard that under linux on multicore server it would be impossible to reach top performance when you have just 1 process but multiple threads because Linux have some limitations on the IO, so that 1 process with 8 threads on 8-core server might be slower than 8 processes.
Any comments? Are there other limitation which might slow the applications?
The applications is a network C++ application, serving 100s of clients, with some disk IO.
Update: I am concerned that there are some more IO-related issues other than the locking I implement myself... Aren't there any issues doing simultanious network/disk IO in several threads?
Drawbacks of Threads
Threads:
Serialize on memory operations. That is the kernel, and in turn the MMU must service operations such as mmap() that perform page allocations.
Share the same file descriptor table. There is locking involved making changes and performing lookups in this table, which stores stuff like file offsets, and other flags. Every system call made that uses this table such as open(), accept(), fcntl() must lock it to translate fd to internal file handle, and when make changes.
Share some scheduling attributes. Processes are constantly evaluated to determine the load they're putting on the system, and scheduled accordingly. Lots of threads implies a higher CPU load, which the scheduler typically dislikes, and it will increase the response time on events for that process (such as reading incoming data on a socket).
May share some writable memory. Any memory being written to by multiple threads (especially slow if it requires fancy locking), will generate all kinds of cache contention and convoying issues. For example heap operations such as malloc() and free() operate on a global data structure (that can to some degree be worked around). There are other global structures also.
Share credentials, this might be an issue for service-type processes.
Share signal handling, these will interrupt the entire process while they're handled.
Processes or Threads?
If you want to make debugging easier, use threads.
If you are on Windows, use threads. (Processes are extremely heavyweight in Windows).
If stability is a huge concern, try to use processes. (One SIGSEGV/PIPE is all it takes...).
If threads aren't available, use processes. (Not so common now, but it did happen).
If your threads share resources that can't be use from multiple processes, use threads. (Or provide an IPC mechanism to allow communicating with the "owner" thread of the resource).
If you use resources that are only available on a one-per-process basis (and you one per context), obviously use processes.
If your processing contexts share absolutely nothing (such as a socket server that spawns and forgets connections as it accept()s them), and CPU is a bottleneck, use processes and single-threaded runtimes (which are devoid of all kinds of intense locking such as on the heap and other places).
One of the biggest differences between threads and processes is this: Threads use software constructs to protect data structures, processes use hardware (which is significantly faster).
Links
pthreads(7)
About Processes and Threads (MSDN)
Threads vs. Processes
it really should make no difference but is probably about design.
A multi process app may have to do less locking but may use more memory. Sharing data between processes may be harder.
On the other hand multi process can be more robust. You can call exit() and quit the child safely mostly without affecting others.
It depends how dependent the clients are. I usually recommend the simplest solution.

Resources