I am doing a perform analysis on various Linux distributions. I want to measure the performance of Linux distributions in the below scenarios
1) High CPU utilization
2) High Memory utilization
3) High IO utilization
4) High CPU IO wait
I want to write C programs in order to achieve each of the scenarios, so that I can
run those programs individually or in combination to measure the performance.
I wrote some sample c programs to load CPU , but I need c programs to handle the other scenarios.
Any programming help will be greatly appreciated.
I am doing a perform analysis on various Linux distributions.
Unless you are very careful about what you are doing, you are unlikely to find the subtle performance differences between kernel versions and distributions in a meaningful way. From the level of just running programs, there is fairly little difference between distributions, except which Linux kernel version they use.
2) High Memory utilization
Your program needs to malloc() a bunch of memory - and then write to it. Some Linux distributions by default overcommit memory. Simply calling malloc() to create an array and writing to each element should be sufficient.
3) High IO utilization
Consider using fio instead of writing your own code here. If you do need to write your own code, then you'll need to decide a couple of things:
Random or sequential IO? It's less important with SSDs, but on magnetic drives the two cases have very different performance characteristics.
Reads or writes? Different storage subsystems may perform very differently with reads and writes.
Direct IO or buffered IO? Are you wanting to stress the whole end-to-end IO subsystem, or just the underlying storage. Flags like O_DIRECT and O_SYNC substantially change the way the kernel handles IO.
File system IO or block IO? Are you interested in testing the filesystem's performance with creating and deleting files, or just doing IO to a block file?
The simplest code you can write here just uses open() to create a large file, then uses rand() together with pread() and pwrite() to do random block IO in that file. If you want to test the filesystem, you need to call open() and unlink() a bunch of times.
IO benchmarking is a very subtle topic, which is why I would encourage you to stick with a well-understood tool like fio.
4) High CPU IO wait
Combining your IO loader with your CPU loader should lead to high IO wait. If you're stressing any IO subsystem, you'll get IO wait.
Related
I am currently working on a Linux system and yesterday I've noticed that The system was slow answering my http requests. I've opened top and I've found this kind situation, in which the Memory seemed to be busy at 95~99%.
Since the cpu load seems to be low and the swap file quite free, I am wondering when I should consider a linux system overloaded and when not. I know that linux has a different memory handle system, right? Maybe this memory load is not related with the bad reaching of the https server (I mean, it could be related to the network layer or whatever...anyway not related to the memory)?
Thank you.
The term of Linux kernel overloaded is little bit not aligned with reality. You can overload something. For example HDD is overloaded, CPU is overloaded, RAM is full and you are swapping.
you should check all the cases not just CPU load and mem usage... What about io top (maybe your HDD is overloaded?), jnettop (network?).
In your case i suspect you simply use too much RAM and start Swapping 820MB in swap already. Swapping means using swap partition (usually HDD but depends on your configuration) as kind of extension of RAM (similar to windows pagefile). But since HDDs are insanely slower compared to RAM the system takes big performance hit in this case.
Another suspicious thing is CPU usage of 23%.... How many cores (incl.hyperthreading) your system has? Is it possible that your application is not using threads? Therefore your CPU usage is only ~25% but it actually means single core is running 100% (overloaded) and 3 other cores are idle(nothing to do)? Therefore you are having single process/thread application which is saturating one core.
I was trying to get performance numbers (simple 4K random read) using fio tool with ioengine as libaio.
I observe that if direct io is disabled (direct=0), then iops fell drastically. when direct=1 was provided the iops were 50 times better!
setup: fio being run from a linux client connected to a PCIe based
appliance over Fibre Channel.
Here is snipped from my fio config file:
[global]
filename=/dev/dm-30
size=10G
runtime=300
time_based
group_reporting
[test]
rw=randread
bs=4k
iodepth=16
runtime=300
ioengine=libaio
refill_buffers
ioscheduler=noop
#direct=1
With this setup, I observed the iops to be around 8000 and when I enabled direct=1 in this above shown config file, I see that iops jump to 250K! (which is realistic in case of the setup I am using)
So, my question is if we use libaio engine, using buffered i/o has any issues? is it mandatory that if we use libaio, we should stick to direct io?
Per the docs on Kernel Asynchronous I/O (AIO) Support for Linux:
What Does Not Work?
AIO read and write on files opened without O_DIRECT (i.e. normal buffered filesystem AIO). On ext2, ext3, jfs, xfs and nfs, these do not return an explicit error, but quietly default to synchronous or rather non-AIO behaviour (i.e io_submit waits for I/O to complete in these cases). For most other filesystems, -EINVAL is reported.
In short, if you don't use O_DIRECT, AIO still "works" for many of the most common file systems, but becomes a slow form on synchronous I/O (you may as well have just used read/write and saved yourself a few system calls). The massive performance increase is the result of actually benefiting from asynchronous behavior.
So to answer the question in your title: Yes, libaio should only be used with unbuffered/O_DIRECT file descriptors if you expect to derive any benefit from it.
Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.
In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions
Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.
The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!
Based on your experience, have you gained performance boost from parallelizing disk I/O? I/O reads in particular
In my case, I though having RAID 0 drives would allow to run at least two reads concurrently, but it still is slower than the serial approach.
Would you ever go for concurrent I/O reads? Why?
Try the same with two separate threads reading from two separate disks.
Preferably, the disks should be on separate controllers (and the threads should run on separate CPUs).
Basically, a RAID 0 array already parallelizes reads and writes and behaves as a single entity in that regard.
What you have tried is analogous to parallelizing a calculation on a single CPU machine.
Basically when you have plenty of IO capacity and the process does not only do IO (i.e. it really spends time doing something).
Discs, per physical definition, are pretty serial in their processing.
I remember a thread here but unfortunatly not the details.
IIRC the O tone there was, reading 1 file per thread is an OK approach, but not reading one (evlt. large) file with more than 1 thread.
A redundant array of independent disks (RAID) is a technology that provides increased storage reliability through redundancy. RAID 0 generally doesn't provide higher input / output speeds because any given file that you want to access is striped across the two disks. If you have huge files, you might see some improvement in read access times in a RAID 0 configuration.
Higher levels of RAID provide redundancy, not necessarily performance.
Azul Systems has an appliance that supports thousands of cache coherent CPUs. I would love insight into what changes would need to occur to an operating system in order to schedule thousands of simultaneously running threads.
Scheduling thousands of threads is not a big deal, but scheduling them on hundreds of CPUs is. What you need, first and foremost, is very fine-grained locking, or, better yet, lock-free data structures and algorithms. You just can't afford to let 200 CPUs waiting while one CPU executes a critical section.
You're asking for possible changes to the OS, so I presume there's a significant engineering team behind this effort.
There are also a few pieces of clarififying info that would help define the problem parameters:
How much IPC (inter process communication) do you need?
Do they really have to be threads, or can they be processes?
If they're processes, is it okay if the have to talk to each other through sockets, and not by using shared memory?
What is the memory architecture? Are you straight SMP with 1024 cores, or is there some other NUMA (Non-Uniform Memory Architecture) or MMP going on here? What are your page tables like?
Knowing only the very smallest of info about Azul systems, I would guess that you have very little IPC, and that a simple "run one kernel per core" model might actually work out just fine. If processes need to talk to each other, then they can create sockets and transfer data that way. Does your hardware support this model? (You would likely end up needing one IP address per core as well, and at 1024 IP addrs, this might be troublesome, although they could all be NAT'd, and maybe it's not such a big deal). If course, this model would lead to some inefficiencies, like extra page tables, and a fair bit of RAM overhead, and may even not be supported by your hardware system.
Even if "1 kernel per core" doesn't work, you could likely run 1024/8 kernels, and be just fine, letting each kernel control 8 physical CPUs.
That said, if you wanted to run 1 thread per core in a traditional SMP machine with 1024 cores (and only a few physical CPUs) then I would expect that the old fashioned O(1) scheduler is what you'd want. It's likely that your CPU[0] will end up nearly 100% in kernel and doing interrupt handling, but that's just fine for this use case, unless you need more than 1 core to handle your workload.
Making Linux scale has been a long and ongoing project. The first multiprocessor capable Linux kernel had a single lock protecting the entire kernel (the Big Kernel Lock, BKL), which was simple, but limited scalability.
Subsequently the locking has been made more fine-grained, i.e. there are many locks (thousands?), each covering only a small portion of data. However, there are limits to how far this can be taken, as fine-grained locking tends to be complicated, and the locking overhead starts to eat up the performance benefit, especially considering that most multi-CPU Linux systems have relatively few CPU's.
Another thing, is that as far as possible the kernel uses per-cpu data structures. This is very important, as it avoids the cache coherency performance issues with shared data, and of course there is no locking overhead. E.g. every CPU runs its own process scheduler, requiring only occasional global synchronization.
Also, some algorithms are chosen with scalability in mind. E.g. some read-mostly data is protected by Read-Copy-Update (RCU) instead of traditional mutexes; this allows readers to proceed during a concurrent update.
As for memory, Linux tries hard to allocate memory from the same NUMA node as where the process is running. This provides better memory bandwidth and latency for the applications.
My uneducated guess would be that there is a run-queue per processor and a work-stealing algorithm when a processor is idle. I could see this working in an M:N model, where there is a single process per cpu and light-weight processes as the work items. This would then feel similar to a work-stealing threadpool, such as the one in Java-7's fork-join library.
If you really want to know, go pick up Solaris Internals or dig into the Solaris kernel code. I'm still reading Design & Impl of FreeBSD, with Solaris Internals being the next on my list, so all I can do is make wild guesses atm.
I am pretty sure that the SGI Altix we have at work, (which does ccNUMA) uses special hardware for cache coherency.
There is a huge overhead connected to hold 4mb cache per core coherent. It's unlikely to happen in software only.
in an array of 256 cpus you would need 768mb ram just to hold the cache-invalidation bits.
12mb cache / 128 bytes per cache line * 256² cores.
Modifying the OS is one thing, but using unchanged application code is a waste of hardware. When going over some limit (depending on the hardware), the effort to keep coherency and synchronization in order to execute generic code is simply too much. You can do it, but it will be very expensive.
From the OS side you'll need complex affinity model, i.e. not to jump CPUs just because yours is busy. Scheduling threads based on hardware topology - cooperating threads on CPUs that are "close" to minimize penalties. Simple work stealing is not a good solution, you must consider topology. One solution is hierarchical work stealing - steal work by distance, divide topology to sectors and try to steal from closest first.
Touching a bit the lock issue; you'll still use spin-locks nd such, but using totally different implementations. This is probably the most patented field in CS these days.
But, again, you will need to program specifically for such massive scale. Or you'll simply under-use it. No automatic "parallelizers" will do it for you.
The easiest way to do this is to bind each process/thread to a few CPUS, and then only those CPUs would have to compete for a lock on that thread. Obviously, there would need to be some way to move threads around to even out the load, but on a NUMA architecture, you have to minimize this as much as possible.
Even on dual-core intel systems, I'm pretty sure that Linux can already handle "Thousands" of threads with native posix threads.
(Glibc and the kernel both need to be configured to support this, however, but I believe most systems these days have that by default now).