Can I get I/O information from MPI program

Can I get I/O information from MPI program - io

Recently, I tried to understand I/O performance tuning. I wanted to get useful I/O information, such as I/O read/write size, the number of aggregators, and so on. But I rarely knew about MPI communication. Can I get this information from the MPI program? Or can I get this information from HPC application source code?

Related

Logging all memory accesses of any executable/process in Linux

I have been looking for a way to log all memory accesses of a process/execution in Linux. I know there have been questions asked on this topic previously here like this
Logging memory access footprint of whole system in Linux
But I wanted to know if there is any non-instrumentation tool that performs this activity. I am not looking for QEMU/ VALGRIND for this purpose since it would be a bit slow and I want as little overhead as possible.
I looked at perf mem and PEBS events like cpu/mem-loads/pp for this purpose but I see that they will collect only sampled data and I actually wanted the trace of all the memory accesses without any sampling.
I wanted to know is there any possibility to collect all memory accesses without wasting too much on overhead by using a tool like QEMU. Is there any possibility to use PERF only but without samples so that I get all the memory access data ?
Is there any other tool out there that I am missing ? Or any other strategy that gives me all memory access data ?

It is just impossible both to have fastest possible run of Spec and all memory accesses (or cache misses) traced in this run (using in-system tracers). Do one run for timing and other run (longer,slower), or even recompiled binary for memory access tracing.
You may start from short and simple program (not the ref inputs of recent SpecCPU, or billion mem accesses in your big programs) and use perf linux tool (perf_events) to find acceptable ratio of memory requests recorded to all memory requests. There is perf mem tool or you may try some PEBS-enabled events of memory subsystem. PEBS is enabled by adding :p and :pp suffix to the perf event specifier perf record -e event:pp, where event is one of PEBS events. Also try pmu-tools ocperf.py for easier intel event name encoding and to find PEBS enabled events.
Try to find the real (maximum) overhead with different recording ratios (1% / 10% / 50%) on the memory performance tests. Check worst case of memory recording overhead at left part on the Arithmetic Intensity scale of [Roofline model](https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Typical tests from this part are: STREAM (BLAS1), RandomAccess (GUPS) and memlat are almost SpMV; many real tasks are usually not so left on the scale:
STREAM test (linear access to memory),
RandomAccess (GUPS) test
some memory latency test (memlat of 7z, lat_mem_rd of lmbench).
Do you want to trace every load/store commands or you only want to record requests that missed all (some) caches and were sent to main RAM memory of PC (to L3)?
Why you want no overhead and all memory accesses recorded? It is just impossible as every memory access have tracing of several bytes (the memory address, sometimes: instruction address) to be recorded to the same memory. So, having memory tracing enabled (more than 10% or memory access tracing) clearly will limit available memory bandwidth and the program will run slower. Even 1% tracing can be noted, but it effect (overhead) is smaller.
Your CPU E5-2620 v4 is Broadwell-EP 14nm so it may have also some earlier variant of the Intel PT: https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt https://github.com/01org/processor-trace and especially Andi Kleen's blog on pt: http://halobates.de/blog/p/410 "Cheat sheet for Intel Processor Trace with Linux perf and gdb"
PT support in hardware: Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
PS: Scholars who study SpecCPU for memory access worked with memory access dumps/traces, and dumps were generated slowly:
http://www.bu.edu/barc2015/abstracts/Karsli_BARC_2015.pdf - LLC misses recorded to offline analysis, no timing was recorded from tracing runs
http://users.ece.utexas.edu/~ljohn/teaching/382m-15/reading/gove.pdf - all load/stores instrumented by writing into additional huge tracing buffer to periodic (rare) online aggregation. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core.
http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf (by Aamer Jaleel of Intel Corporation, VSSAD) - Pin-based instrumentation - program code was modified and instrumented to write memory access metadata into buffer. Such instrumentation is from 2x slow or slower, especially for memory bandwidth / latency limited core. The paper lists and explains instrumentation overhead and Caveats:
Instrumentation Overhead: Instrumentation involves
injecting extra code dynamically or statically into the
target application. The additional code causes an
application to spend extra time in executing the original
application ... Additionally, for multi-threaded
applications, instrumentation can modify the ordering of
instructions executed between different threads of the
application. As a result, IDS with multi-threaded
applications comes at the lack of some fidelity
Lack of Speculation: Instrumentation only observes
instructions executed on the correct path of execution. As
a result, IDS may not be able to support wrong-path ...
User-level Traffic Only: Current binary instrumentation
tools only support user-level instrumentation. Thus,
applications that are kernel intensive are unsuitable for
user-level IDS.

FreeRTOS vs Linux against single event upsets

I am working on the on-board computer for a CubeSat. Our computer will be vulnerable to radiation, hence single event upsets, e.g. bit flips are likely to occur. Would a lighter, smaller OS like FreeRTOS bring more stability, robustness and a lower probability of failure over a full-blown Linux operating system?

The probability of a bit error in RAM is a function of time, memory size and radiation density, so a larger memory has a greater probability, and you can fit a FreeRTOS system in much less memory (like 10kb instead of 4Mb). However the usage rate of the smaller memory is likely much higher - i.e. in a FreeRTOS application, most of the code and data are accessed relatively frequently, while in a Linux deployment, much of it is redundant and if corrupted may never be accessed in any case.
However the question makes little sense for a number of reasons, such as:
The effect of a bit-flip event is entirely non-deterministic, any single event it may be benign or catastrophic. It is impossible to say that a system can tolerate 1 error when you don't know when or where the error will occur.
If your system can be implemented on FreeRTOS, why would you even consider Linux? They are chalk and cheese. If you need the extensive networking, filesystem, memory management, POSIX API and device support etc. provided by Linux, FreeRTOS is not suited to your application in any case, as you would have to add all that yourself from your own or additional third-party code. FreeRTOS is only a scheduling kernel, with threading, synchronisation and IPC support and little else. Conversely if you need hard real-time deterministic behaviour, Linux is unsuited to your application.
Where you might benefit from using an RTOS kernel like FreeRTOS is that it will execute from ROM which may be less prone to the bit-flipping cosmic ray issue - (although the availability of ECC/radiation hardened Flash memory may indicate otherwise). You still need RAM for R/W data, but at least the code itself will be robust. A typical FreeRTOS system might run in SRAM (possibly in on-chip RAM on a microcontroller) - I don't know whether low density SRAM is less prone to bit-flipping than high-density SDRAM, but I am willing to believe it is. It is also possible to source radiation hardened SRAM in any case.
The solution for a system using SDRAM in such an environment is to use ECC RAM which may largely overcome the problem of data corruption from radiation and non-deterministic system behaviour. However I would not imagine that even that would be sufficient for space or high-atmosphere applications.
In short the solution is not in the software, it has to be in the hardware, and the lengths you need to go to will depend on the radiation environment your system will be subjected to. However the selection of a small RTOS kernel allows the selection of hardware to be potentially much wider since it will run on a much wider range of architectures in much smaller memory, perform deterministically, respond to events in fewer cycles and is ROMable.

Linux: Sample programs to load CPU, Memory and Hard disk

I am doing a perform analysis on various Linux distributions. I want to measure the performance of Linux distributions in the below scenarios
1) High CPU utilization
2) High Memory utilization
3) High IO utilization
4) High CPU IO wait
I want to write C programs in order to achieve each of the scenarios, so that I can
run those programs individually or in combination to measure the performance.
I wrote some sample c programs to load CPU , but I need c programs to handle the other scenarios.
Any programming help will be greatly appreciated.

I am doing a perform analysis on various Linux distributions.
Unless you are very careful about what you are doing, you are unlikely to find the subtle performance differences between kernel versions and distributions in a meaningful way. From the level of just running programs, there is fairly little difference between distributions, except which Linux kernel version they use.
2) High Memory utilization
Your program needs to malloc() a bunch of memory - and then write to it. Some Linux distributions by default overcommit memory. Simply calling malloc() to create an array and writing to each element should be sufficient.
3) High IO utilization
Consider using fio instead of writing your own code here. If you do need to write your own code, then you'll need to decide a couple of things:
Random or sequential IO? It's less important with SSDs, but on magnetic drives the two cases have very different performance characteristics.
Reads or writes? Different storage subsystems may perform very differently with reads and writes.
Direct IO or buffered IO? Are you wanting to stress the whole end-to-end IO subsystem, or just the underlying storage. Flags like O_DIRECT and O_SYNC substantially change the way the kernel handles IO.
File system IO or block IO? Are you interested in testing the filesystem's performance with creating and deleting files, or just doing IO to a block file?
The simplest code you can write here just uses open() to create a large file, then uses rand() together with pread() and pwrite() to do random block IO in that file. If you want to test the filesystem, you need to call open() and unlink() a bunch of times.
IO benchmarking is a very subtle topic, which is why I would encourage you to stick with a well-understood tool like fio.
4) High CPU IO wait
Combining your IO loader with your CPU loader should lead to high IO wait. If you're stressing any IO subsystem, you'll get IO wait.

Programmatic resource monitoring per process in Linux

I want to know if there is an efficient solution to monitor a process resource consumption (cpu, memory, network bandwidth) in Linux. I want to write a daemon in C++ that does this monitoring for some given PIDs. From what I know, the classic solution is to periodically read the information from /proc, but this doesn't seem the most efficient way (it involves many system calls). For example to monitor the memory usage every second for 50 processes, I have to open, read and close 50 files (that means 150 system calls) every second from /proc. Not to mention the parsing involved when reading these files.
Another problem is the network bandwidth consumption: this cannot be easily computed for each process I want to monitor. The solution adopted by NetHogs involves a pretty high overhead in my opinion: it captures and analyzes every packet using libpcap, then for each packet the local port is determined and searched in /proc to find the corresponding process.
Do you know if there are more efficient alternatives to these methods presented or any libraries that deal with this problems?

/usr/src/linux/Documentation/accounting/taskstats.txt
Taskstats is a netlink-based interface for sending per-task and
per-process statistics from the kernel to userspace.
Taskstats was designed for the following benefits:
efficiently provide statistics during lifetime of a task and on its exit
unified interface for multiple accounting subsystems
extensibility for use by future accounting patches
This interface lets you monitor CPU, memory, and I/O usage by processes of your choosing. You only need to set up and receive messages on a single socket.
This does not differentiate (for example) disk I/O versus network I/O. If that's important to you, you might go with a LD_PRELOAD interception library that tracks socket operations. Assuming that you can control the startup of the programs you wish to observe and that they won't do trickery behind your back, of course.
I can't think of any light-weight solutions if those still fail, but linux-audit can globally trace syscalls, which seems a fair bit more direct than re-capturing and analyzing your own network traffic.

Take a look at the linux trace toolkit (LTTng). It inserts tracepoints into the kernel and has some post processing to get some of the kind of statistics you're asking about. The trace files get large if you capture everything, but you can keep things manageable if you limit the types of events you arm.
http://lttng.org for more info...

Regarding network bandwidth: This Superuser answer describes processing /proc/net/tcp to collect network bandwidth usage.
I know that iptables can be used to do network accounting (see, e.g., LWN's, Linux.com's, or Shorewall's articles), but I don't see any practical way to do accounting that on a per-process basis.

i just came across this as i was looking for answers to the same thing. just a note - when using /proc filesystem, you do not have to close the file after each read. you can keep the file open and each time you do a read you will get new statistics... so, you shouldn't have the overhead of opening and closing each time you want to get the stats... i have this working in javascript on node.js if you want an example...

Reading /proc is ultimately the only way to monitor CPU and memory usage by individual processes without injecting your code into the kernel. If you look at top(1), you'll see reading lots of files in /proc is exactly what it does every second. All user-mode tools and libraries that retrive this sort of information have to get it from /proc.
As with network bandwidth usage, there are several approaches, which all more or less boil down to capturing all network traffic in and out of the box. You can also consider writing a special netfilter (iptables) module that does exactly the type of counting you need without the overhead of traffic capturing.

Kernel Scheduling for 1024 CPUs

Azul Systems has an appliance that supports thousands of cache coherent CPUs. I would love insight into what changes would need to occur to an operating system in order to schedule thousands of simultaneously running threads.

Scheduling thousands of threads is not a big deal, but scheduling them on hundreds of CPUs is. What you need, first and foremost, is very fine-grained locking, or, better yet, lock-free data structures and algorithms. You just can't afford to let 200 CPUs waiting while one CPU executes a critical section.

You're asking for possible changes to the OS, so I presume there's a significant engineering team behind this effort.
There are also a few pieces of clarififying info that would help define the problem parameters:
How much IPC (inter process communication) do you need?
Do they really have to be threads, or can they be processes?
If they're processes, is it okay if the have to talk to each other through sockets, and not by using shared memory?
What is the memory architecture? Are you straight SMP with 1024 cores, or is there some other NUMA (Non-Uniform Memory Architecture) or MMP going on here? What are your page tables like?
Knowing only the very smallest of info about Azul systems, I would guess that you have very little IPC, and that a simple "run one kernel per core" model might actually work out just fine. If processes need to talk to each other, then they can create sockets and transfer data that way. Does your hardware support this model? (You would likely end up needing one IP address per core as well, and at 1024 IP addrs, this might be troublesome, although they could all be NAT'd, and maybe it's not such a big deal). If course, this model would lead to some inefficiencies, like extra page tables, and a fair bit of RAM overhead, and may even not be supported by your hardware system.
Even if "1 kernel per core" doesn't work, you could likely run 1024/8 kernels, and be just fine, letting each kernel control 8 physical CPUs.
That said, if you wanted to run 1 thread per core in a traditional SMP machine with 1024 cores (and only a few physical CPUs) then I would expect that the old fashioned O(1) scheduler is what you'd want. It's likely that your CPU[0] will end up nearly 100% in kernel and doing interrupt handling, but that's just fine for this use case, unless you need more than 1 core to handle your workload.

Making Linux scale has been a long and ongoing project. The first multiprocessor capable Linux kernel had a single lock protecting the entire kernel (the Big Kernel Lock, BKL), which was simple, but limited scalability.
Subsequently the locking has been made more fine-grained, i.e. there are many locks (thousands?), each covering only a small portion of data. However, there are limits to how far this can be taken, as fine-grained locking tends to be complicated, and the locking overhead starts to eat up the performance benefit, especially considering that most multi-CPU Linux systems have relatively few CPU's.
Another thing, is that as far as possible the kernel uses per-cpu data structures. This is very important, as it avoids the cache coherency performance issues with shared data, and of course there is no locking overhead. E.g. every CPU runs its own process scheduler, requiring only occasional global synchronization.
Also, some algorithms are chosen with scalability in mind. E.g. some read-mostly data is protected by Read-Copy-Update (RCU) instead of traditional mutexes; this allows readers to proceed during a concurrent update.
As for memory, Linux tries hard to allocate memory from the same NUMA node as where the process is running. This provides better memory bandwidth and latency for the applications.

My uneducated guess would be that there is a run-queue per processor and a work-stealing algorithm when a processor is idle. I could see this working in an M:N model, where there is a single process per cpu and light-weight processes as the work items. This would then feel similar to a work-stealing threadpool, such as the one in Java-7's fork-join library.
If you really want to know, go pick up Solaris Internals or dig into the Solaris kernel code. I'm still reading Design & Impl of FreeBSD, with Solaris Internals being the next on my list, so all I can do is make wild guesses atm.

I am pretty sure that the SGI Altix we have at work, (which does ccNUMA) uses special hardware for cache coherency.
There is a huge overhead connected to hold 4mb cache per core coherent. It's unlikely to happen in software only.
in an array of 256 cpus you would need 768mb ram just to hold the cache-invalidation bits.
12mb cache / 128 bytes per cache line * 256² cores.

Modifying the OS is one thing, but using unchanged application code is a waste of hardware. When going over some limit (depending on the hardware), the effort to keep coherency and synchronization in order to execute generic code is simply too much. You can do it, but it will be very expensive.
From the OS side you'll need complex affinity model, i.e. not to jump CPUs just because yours is busy. Scheduling threads based on hardware topology - cooperating threads on CPUs that are "close" to minimize penalties. Simple work stealing is not a good solution, you must consider topology. One solution is hierarchical work stealing - steal work by distance, divide topology to sectors and try to steal from closest first.
Touching a bit the lock issue; you'll still use spin-locks nd such, but using totally different implementations. This is probably the most patented field in CS these days.
But, again, you will need to program specifically for such massive scale. Or you'll simply under-use it. No automatic "parallelizers" will do it for you.

The easiest way to do this is to bind each process/thread to a few CPUS, and then only those CPUs would have to compete for a lock on that thread. Obviously, there would need to be some way to move threads around to even out the load, but on a NUMA architecture, you have to minimize this as much as possible.

Even on dual-core intel systems, I'm pretty sure that Linux can already handle "Thousands" of threads with native posix threads.
(Glibc and the kernel both need to be configured to support this, however, but I believe most systems these days have that by default now).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string