Can file copying be CPU-bound? - linux

As far as I know, the CPU is usually faster than an I/O device (like the HDD, the network, RAM, etc.), so when copying a file the bottleneck is usually I/O-bound (right?).
If under some condition that I/O device is faster than the CPU (like in a virtual machine) is it possible to keep the CPU busy moving data (like from buffer to kernel space, from kernel space to user space)? And does it then become CPU-bound?

It depends on the program and the conditions where the program is run.
It would be highly unlikely that the speed of a program copying data would be throttled by the CPU speed. However it could be the case if for example the computer runs other programs that use CPU intensively and with higher priority than the program executing the copy.
The most common bottleneck would be the persistence storage medium speed (e.g. Hard drive).
Then, the amount of RAM available.
Then, the CPU being unavailable.
If and only if however, an I/O device is so super fast that outperforms the CPU speed. Then, it could be the case. However this is a hypothetical case since the CPU does not usually performs the copy itself, but commands other hardware to do so.
And, in real systems the bandwidth available for I/O device are far slower than the CPU and RAM bandwidth.
If copy is done efficiently, copying RAM data to HDD should not stress the CPU.
Data from RAM and Northbridge can be copied to the HDD via the Southbridge.
If copy is done inefficiently, of course a program could read every single byte with the CPU and copy it.
Furthermore, as one can infer, the answer also depends from the hardware and architecture of the system.

Wrong answer, I am afraid. At least not always correct.
If I copy a folder with some 50.000 files (different sizes) in Windows Explorer, then TaskManager reports that the copy is mostly CPU bound. (I.e. TM reports low disk usage and very high CPU usage)


Using loopback for synchronous IPC when using NUMA architecture

(For a Linux platform) Is it feasible (from a performance point of view) to try to communicate (in a synchronous way) via loopback interface between processes on different NUMA nodes?
What about if the processes reside on the same NUMA node?
I know it's possible to memory bind a process and/or set CPU affinity to a node (using libnuma). I don't know if this true also for the network interface.
Later edit. If loopback interface is just a memory buffer used by kernel, is there a way to be sure that buffer is on the same NUMA node in order for two processes to communicate without the cross node overhead?
Network interfaces don't reside on a node; they're a device - virtual or real - shared across the whole machine. The loopback interface is just a memory buffer somewhere or other, and some kernel code. The code that runs to support that device is likely bouncing round the CPU cores, just like any other thread in the system.
You talk of NUMA nodes, and tagged the question with Linux. Linux doens't run on pure NUMA architectures, it runs on SMP architectures. Modern CPUs from, say, Intel, AMD, ARM all synthesise an SMP hardware environment using separate cores, varying degrees of cache / memory interface unification, and high speed serial links between cores or CPUs. Effectively it's not possible for the operating system or software running on top to see the underlying NUMA architecture; it thinks it's running on a classical SMP architecture.
Intel / AMD / everyone else have done this because, back in the day, successful multiple CPU machines really were SMP; they had multiple CPUs all sharing the same memory bus, and had equal access to the RAM at the other end of the bus. Software got written to take advantage of that (Linux, Windows, etc).
Then the CPU manufacturers realised that SMP architectures suck so far as speed improvements are concerned. AMD blinked first, and ditched SMP in favour of Hypertransport, and were successful. Intel persisted with pure SMP for longer, but soon gave up too and started using QPI between CPUs.
But to give the old software (Linux, Windows, etc) backward compatibility, the CPU designers had to create a synthetic SMP hardware environment on top of Hypertransport and QPI. In principal they might have, at that point in time, decided that SMP was dead and delivered us pure NUMA architectures. But that would likely have been commercial suicide; it would have taken coorindation of the entire hardware and software industries to agree to go that way, but by then it was already far too late to rewrite everything from scratch.
Thinks like network sockets (including via the loopback interface), pipes, serial ports are not synchronous. They're stream carriers, and the sender and receiver are not synchronised by the act of transferring data. That is, the sender can write() data and think that that has completed, but the data is in reality still stuck in some network buffer somewhere and hasn't yet made it into the read() that the destination process will have to call to receive the data.
What Linux will do with processes and threads is endeavour to run them all at once, up to the limit of the number of CPU cores in the machine. By and large that will result in your processes running simultaneously on separate cores. I think Linux will also use knowledge of which physical CPU's memory holds the bulk of a process's data, and will try to run the process on that CPU; memory latency will be a tiny bit better that way.
If your processes try to communicate via socket, pipe or similar, it results in data being copied out of one process's memory space into a memory buffer controlled by the kernel (that's what write() is doing under the hood), and then being copied out of that into the receiving process's memory space (that's what read() does). Where that intermediate kernel buffer actually is doesn't really matter because the transactions taking place at the microelectronic level (below the SMP level) are pretty much the same regardless. Memory allocations and processes can be bound to specific CPU cores, but you can't influence whereabouts the kernel puts its memory buffers through which the exchanged data must pass.
Regarding memory and process core affinity - it's really, really hard to do this to any measurable benefit. The OSes are so good nowadays at understanding the behaviour of CPUs that it's almost always best to simply let the OS run your processes and cores whereever it chooses. Companies like Intel make large code contributions to the Linux project, specifically to ensure that Linux does this as well as possible on the latest and greatest chips.
Additions in the light of engaging comments!
By "pure NUMA" I really mean systems where one CPU core cannot directly address memory physically attached to another CPU core. Such systems include Transputers, and even the Cell processor found in the Sony PS3. These aren't SMP, there's nothing in the silicon that unifies the separate memories into a single address space, so the question of cache coherency doesn't come into it.
With Transputer systems the only way to access memory attached to another transputer was to have the application software send the data over via a serial link; what made it CSP was that the sending application would finish sending until the receiving application had read the last byte.
For the Cell processor, there were 8 maths cores each with 256kbyte of RAM. That was the only RAM the maths cores could address. To use them the application had to move data and code into that 256k of RAM, tell the core to run, and then move the results out (possibly back out to RAM, or onto another maths core).
There are some supercomputers today that aren't disimilar to this. The K machine (Riken, Kobe in Japan) has an awful lot of cores, a very complex on-chip interconnect fabric, and OpenMPI is used by applications to move data around between nodes; nodes cannot directly address memory on other nodes.
The point is that on the PS3 it was up to application software to decide what data was in what memory and when, whereas modern x86 implementations from Intel and AMD make all data in all memories (no matter if they're shared via an L3 cache or are remote at the other end of a hypertransport or QPI link) accessible from any cores (that's what SMP means afterall).
The all out performance of code written on the Cell process was truly astounding for the Watts and transistor count. Trouble was in a world where programemrs are trained in writing for SMP environments, it takes a brain transplant to get to grips with one that isn't.
Newer languages like Rust and Go have reintroduced the concept of communicating sequential processes, which is all one had with Transputers back in the 1980s, early 1990s. CSP is almost ideal for multicore systems as the hardware does not need to implement an SMP environment. In principle this saves an awful lot of silicon.
CSP implemented on top of today's cache coherent SMP chips in languages like C generally involves a thread writing data into a bufffer, and that being copied into a buffer belonging to another thread (Rust can do it a little differently because Rust knows about memory ownership, and so can transfer ownership instead of copying memory. I'm not familiar with Go - maybe it can do the same).
Looked at at the microelectronic level, copying data from one buffer to another is not really any different to what happens if the data is shared by 2 cores instead of copied (especially in AMD's hypertransport CPUs where each has its own memory system). To share data, the remote core has to use hypertransport to request data from another core's memory, and more traffic to maintain cache coherency. That's about the same amount of hypertransport traffic as if the data where copied from one core to the other, but then there's no subsequent cache coherency traffic.

When to consider a Linux kernel overloaded

I am currently working on a Linux system and yesterday I've noticed that The system was slow answering my http requests. I've opened top and I've found this kind situation, in which the Memory seemed to be busy at 95~99%.
Since the cpu load seems to be low and the swap file quite free, I am wondering when I should consider a linux system overloaded and when not. I know that linux has a different memory handle system, right? Maybe this memory load is not related with the bad reaching of the https server (I mean, it could be related to the network layer or whatever...anyway not related to the memory)?
Thank you.
The term of Linux kernel overloaded is little bit not aligned with reality. You can overload something. For example HDD is overloaded, CPU is overloaded, RAM is full and you are swapping.
you should check all the cases not just CPU load and mem usage... What about io top (maybe your HDD is overloaded?), jnettop (network?).
In your case i suspect you simply use too much RAM and start Swapping 820MB in swap already. Swapping means using swap partition (usually HDD but depends on your configuration) as kind of extension of RAM (similar to windows pagefile). But since HDDs are insanely slower compared to RAM the system takes big performance hit in this case.
Another suspicious thing is CPU usage of 23%.... How many cores (incl.hyperthreading) your system has? Is it possible that your application is not using threads? Therefore your CPU usage is only ~25% but it actually means single core is running 100% (overloaded) and 3 other cores are idle(nothing to do)? Therefore you are having single process/thread application which is saturating one core.

In non-DMA scenario, does a storage device/disk content go to CPU registers first and then to main memory during a disk read?

I am learning computer organization but struggling with the following concept. In non-DMA scenarios, do all disk reads follow the following sequence to get into main memory:
Disk storage surface -> Disk registers -> CPU registers -> Main memory
Similarly for writes, is the sequence:
Main memory -> CPU registers -> Disk registers -> Disk storage surface
(I know that in a DMA scenario, the CPU only initiates the transfer after which the content of the disks are transferred directly to main memory).
If yes, before DMA came, was the above sequence a serious bottleneck as overall CPU registers' capacity is much less compared to main memory and storage disk? Or it is so fast that a human user won't notice in non-DMA modes?
PS: Please bear with my rudimentary terminology, but I hope I conveyed what I want to ask.
Yes, what you describe is what happened in the bad old days with programmed-I/O instead of DMA.
For example, IDE disk-controller hardware used to be less well standardized, so the Linux drivers defaulted to programmed I/O (i.e. a copy loop using x86 IN instructions, since ATA predated memory-mapped I/O registers being common). For decent performance, you had to manually enable DMA in your boot scripts.
But before doing that, check by manually enabling DMA it didn't lead to lockups, or far worse cause data corruption.
re: memory-mapped file: nothing to do with how the data gets from disk into the pagecache (or vice versa). mmap() just means your process's address space includes a shared mapping of the same pages that the OS is using to cache the file's contents.

lock contention in memory allocation - multi-threaded vs. multi-process

We have developed a big C++ application that is running satisfactorily at several sites on big Linux and Solaris boxes (up to 160 CPU cores or even more). It's a heavily multi-threaded (1000+ threads), single-process architecture, consuming huge amounts of memory (200 GB+). We are LD_PRELOADing Google Perftool's tcmalloc (or libumem/mtmalloc on Solaris) to avoid memory allocation performance bottlenecks with generally good results. However, we are starting to see adverse effects of lock contention during memory allocation/deallocation on some bigger installations, especially after the process has been running for a while (which hints to aging/fragmentation effects of the allocator).
We are considering changing to a multi-process/shared memory architecture (the heavy allocation/deallocation will not happen in shared memory, rather on the regular heap).
So, finally, here's our question: can we assume that the virtual memory manager of modern Linux kernels is capable of efficiently handing out memory to hundreds of concurent processes? Or do we have to expect running into the same kind of problems with memory allocation contention that we see in our single-process/multi-threading environment? I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager. Anyone have any actual experience or performance data comparing multi-threaded vs. multi-process memory allocation?
I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager.
There is no reason to expect this. Unless your code is so badly designed that it constantly goes back to the OS to allocate memory, it won't make any significant difference. Your application should only need to go back to the OS's virtual memory manager when it needs more virtual memory, which should not occur significantly once the process reaches its stable size.
If you are constantly allocating and freeing all the way back to the OS, you should stop doing that. If you're not, then you can keep multiple pools of already-allocated memory that can be used by multiple threads without contention. And, as a benefit, your context switches will be cheaper because TLB's don't have to be flushed.
Only if you can't reduce the frequency of address space changes (for example, if you must map and unmap files) or if you have to change other shared resources (like file descriptors) should you look at multiprocess options.

Limiting RAM usage during performance tests

I have to run some performance tests, to see how my programs work when the system runs out of RAM and the system starts thrashing. Ideally, I would be able to change the amount of RAM used by the system.
I haved tried to by boot my system (running Ubuntu 10.10) in single user mode with a limited amount of physical memory, but with the parameters I used (max_addr=300M, max_addr=314572800 or mem=300M) the system did not use my swap partition.
Is there a way to limit the amount of RAM used by the total system, while still using swap space?
The point is to measure the total running time of each program as a function of the input size. I am not trying to pinpoint performance problems, I am trying to compare algorithms, which means I need accuracy.
Write a simple c program which
Will allocate large amount of memory.
Keep on accessing allocated memory random to try to keep in main memory (in an infinite loop).
Now run this program (one or few processes) so that you allocate enough memory to cause the thrashing of process you are testing.
