Programmatic resource monitoring per process in Linux

Programmatic resource monitoring per process in Linux - linux

I want to know if there is an efficient solution to monitor a process resource consumption (cpu, memory, network bandwidth) in Linux. I want to write a daemon in C++ that does this monitoring for some given PIDs. From what I know, the classic solution is to periodically read the information from /proc, but this doesn't seem the most efficient way (it involves many system calls). For example to monitor the memory usage every second for 50 processes, I have to open, read and close 50 files (that means 150 system calls) every second from /proc. Not to mention the parsing involved when reading these files.
Another problem is the network bandwidth consumption: this cannot be easily computed for each process I want to monitor. The solution adopted by NetHogs involves a pretty high overhead in my opinion: it captures and analyzes every packet using libpcap, then for each packet the local port is determined and searched in /proc to find the corresponding process.
Do you know if there are more efficient alternatives to these methods presented or any libraries that deal with this problems?

/usr/src/linux/Documentation/accounting/taskstats.txt
Taskstats is a netlink-based interface for sending per-task and
per-process statistics from the kernel to userspace.
Taskstats was designed for the following benefits:
efficiently provide statistics during lifetime of a task and on its exit
unified interface for multiple accounting subsystems
extensibility for use by future accounting patches
This interface lets you monitor CPU, memory, and I/O usage by processes of your choosing. You only need to set up and receive messages on a single socket.
This does not differentiate (for example) disk I/O versus network I/O. If that's important to you, you might go with a LD_PRELOAD interception library that tracks socket operations. Assuming that you can control the startup of the programs you wish to observe and that they won't do trickery behind your back, of course.
I can't think of any light-weight solutions if those still fail, but linux-audit can globally trace syscalls, which seems a fair bit more direct than re-capturing and analyzing your own network traffic.

Take a look at the linux trace toolkit (LTTng). It inserts tracepoints into the kernel and has some post processing to get some of the kind of statistics you're asking about. The trace files get large if you capture everything, but you can keep things manageable if you limit the types of events you arm.
http://lttng.org for more info...

Regarding network bandwidth: This Superuser answer describes processing /proc/net/tcp to collect network bandwidth usage.
I know that iptables can be used to do network accounting (see, e.g., LWN's, Linux.com's, or Shorewall's articles), but I don't see any practical way to do accounting that on a per-process basis.

i just came across this as i was looking for answers to the same thing. just a note - when using /proc filesystem, you do not have to close the file after each read. you can keep the file open and each time you do a read you will get new statistics... so, you shouldn't have the overhead of opening and closing each time you want to get the stats... i have this working in javascript on node.js if you want an example...

Reading /proc is ultimately the only way to monitor CPU and memory usage by individual processes without injecting your code into the kernel. If you look at top(1), you'll see reading lots of files in /proc is exactly what it does every second. All user-mode tools and libraries that retrive this sort of information have to get it from /proc.
As with network bandwidth usage, there are several approaches, which all more or less boil down to capturing all network traffic in and out of the box. You can also consider writing a special netfilter (iptables) module that does exactly the type of counting you need without the overhead of traffic capturing.

Related

How to make the OS schedule disk accesses optimally?

Suppose that a process needs to access the file system in many (1000+) places, and the order is not important to the program logic. However, the order obviously matters for performance if the file system is stored on a (spinning) hard disk.
How can the application programmer communicate to the OS that it should schedule the accesses optimally? Launching 1000+ threads does not seem practical. Does database management software accomplish this, and if so, then how?
Additional details: I had a large (1TB+) mmapped file where I needed to read 1000+ chunks of about 1KB, each time in new, unpredictable places.

In the early days when parameters like Wikipedia: Hard disk drive performance characteristics → Seek time were very expensive and thus very important, database vendors payed attention to the on-disk data representation and layout as can be seen e.g. in Oracle8i: Designing and Tuning for Performance → Tuning I/O.
The important optimization parameters changed with appearance of Solid-state drives (SSD) where the seek time is 0 (or at least constant) as there is nothing to rotate. Some of the new parameters are addressed by Wikipedia: Solid-state drive (SSD) → optimized file systems.
But even those optimization parameters go away with the use of Wikipedia: In-memory databases. The list of vendors is pretty long, all big players on it.
So how to schedule your access optimally depends a lot on the use case (1000 concurrent hits is not sufficient problem description) and buying some RAM is one of the options and "how can the programmer communicate with the OS" will be one of the last (not first) questions

Files and their transactions are cached in various devices in your computer; RAM and the HD cache are the most usual places. The file system driver may also implement IO transaction queues, defragmentation, and error-correction logic that makes things complicated for the developer who wants to control every aspect of file access. This level of complexity is ultimately designed to provide integrity, security, performance, and coordination of file access across all processes of your system.
Optimization efforts should not interfere with the system's own caching and prediction algorithms, not just for IO but for all caches. Trying to second-guess your system is a waste of your time and your processors' time.
Most probably your IO operations and data will stay on caches and later be committed to your storage devices when your OS sees fit.
That said, there's always options like database suites, mmap, readahead mechanisms, and direct IO to your drive. You will need to invest time benchmarking any of your efforts. I advise against multiple IO threads because cache contention will make things even slower than one thread.

The kernel will already reorder the read/write requests (e.g. to fit the spin of a mechanical disk), if they come from various processes or threads. BTW, most of the reads & writes would go to the kernel file system cache, not to the disk.
You might consider using posix_fadvise(2) & perhaps (in a separate thread) readahead(2). If -instead of read(2)-ing- you use mmap(2) to project some file portion to virtual memory, you might use also madvise(2)
Of course, the file system does not usually guarantee that a sequential portion of a file is physically sequentially located on the disk (and even the disk firmware might reorder sectors). See picture in Ext2 wikipage, also relevant for Ext4. Some file systems might be better in that respect, and you could tune their block size (at mkfs time).
I would not recommend having thousands of threads (only at most a few dozens).
At last, it might worth buying some SSD or some more RAM (for file cache). See http://linuxatemyram.com/
Actual performance would depend a lot on the particular system and hardware.
Perhaps using an indexed file library like GDBM or a database library Sqlite (or a real database like PostGreSQL) might be worthwhile! Perhaps have fewer files but bigger ones could help.
BTW, you are mmap-ing, and reading small chunk of 1K (smaller than page size of 4K). You could use madvise (if possible in advance), but you should try to read larger chunks, since every file access will bring at least a whole page.
You really should benchmark!

ioctl vs netlink vs memmap to communicate between kernel space and user space

Got some statistics information of our custom hardware to be displayed whenever user asks for using a command in the Linux user space. This implementation is currently uses PROC interface. We started adding more statistics information then we encountered a problem wherein the particular statistics command had to be executed twice for getting the entire data as PROC interface was restricted to 1 page.
As mentioned above the data transfer between the kernel and the user space is not critical but as per the data some decisions might be taken by the user. Our requirement for this interface design is that it should be capable of transferring amount of data maybe greater that 8192 bytes and the command needs to use minimal kernel resources (like locks etc.,) and it needs to be quick.
Using ioctl can solve the issue but since the command is exactly not controlling the device but to collect some statistics information, not sure whether it is a good mechanism to use as per Linux. We are currently using 3.4 kernel; not sure whether Netlink is lossy in this version (Previous versions I came across issues like when the queue becomes full, socket starts to drop data). mmap is another option . Can anyone suggest me what would be the best interface to use

Kernel services can send information directly to user applications over Netlink, while you’d have explicitly poll the kernel with ioctl functions, a relatively expensive operation.
Netlink comms is very much asynchronous, with each side receiving messages at some point after the other side sends them. ioctls are purely synchronous: “Hey kernel, WAKE UP! I need you to process my request NOW! CHOP CHOP!”
Netlink supports multicast communications between the kernel and multiple user-space processes, while ioctls are strictly one-to-one.
Netlink messages can be lost for various reasons (e.g. out of memory), while ioctls are generally more reliable due to their immediate-processing nature.
So If you asking for statistics to kernel from user space(application) it is more reliable and easy to use IOCTL while if you generate statistics in kernel space and you want your kernel space to send those data to user space(application) you have to use Netlink sockets.

You can do a ioctl IO call (rather than IOR, IOW, or IORW). Ioctl's can be very useful for collecting information. You'll have a lot of flexibility this way in that you can pass different size buffers or structs to fill with data.

Determining cache misses for various filesystems

I've got a project for school where I have to find out how many cache misses a filesystem will have under heavy and light loads and on a multiple processor machine. After discussing this with my professor, I came up with a basic plan of execution:
Create a program which will bog down the filesystem and fill up the buffer cache.
Use a system benchmarking tool to record the number of cache misses.
Rinse and repeat with a new conditions.
But being new to operating system design, I am unsure of how to proceed. So here are some points where I need some help:
What actions would an ideal program perform to fill up the buffer cache? Currently, the program that I've written reads and writes to several different files, x amount of times.
What tools are there that record the number of cache misses? I have looked into oprofile but I don't think it monitors the filesystem's buffer cache. But I have found this list which looks promising.
Will other running processes affect these benchmarks?
Thanks for your help!

1) If you are trying to test your filesystem performance, throw in several threads that are manipulating large amounts of file metadata alongside your I/O threads. Also, when doing I/O in several parallel threads, mix threads doing large-sized transfers and threads doing small-sized transfers. Many filesystems will coalesce small I/O operations together into larger requests that the physical drive can handle in a more time-efficient manner, and mixing I/O of various sized may help fill up the cache faster (since it has to buffer the coalesced I/O).
2) Be careful with that list of tools, many look like they are designed to operate on raw devices and not through the filesystem layer (so the results you'd get might not represent what you think they do). If you are looking for a tool to benchmark a particular filesystem, your best bet may be to check with the development team for that filesystem. They can most likely point you to the tool that they used to benchmark their FS during development, even if it is a custom tool developed internally.
3) Yes, anything else that is running and might access the filesystem under test can potentially impact your results. You may want to create a separate filesystem to use only for this test and turn off any background scans that might try to access it while you are running your tests.

That is an interesting question. May be I can give you a partial answer.
You should be aware that Linux has multiple caches related to file systems that may have different tools
Inode cache
Dentry cache
Block cache
One way is to calculate (guess?) how much block level traffic your operations should generate, and then measure the real block operations (reads, writes, seeks) with blktrace.
I am not aware of any way to read the cache miss state of the inode and dentry cache. I would really like to be told that I am wrong here.
The hard way is to annotate the inode cache and dentry cache with own counters, but these caches are pretty hard kernel code.

What is the ideal & fastest way to communicate between kernel and user space?

I know that information exchange can happen via following interfaces between kernel and user space programs
system calls
ioctls
/proc & /sys
netlink
I want to find out
If I have missed any other interface?
Which one of them is the fastest way to exchange large amounts of data?
(and if there is any document/mail/explanation supporting such a claim that I can refer to)
Which one is the recommended way to communicate? (I think its netlink, but still would love to hear opinions)

The fastest way to exchange vast amount of data is memory mapping. The mmap call can be used on a device file, and the corresponding kernel driver can then decide to map kernel memory to user address space. A good example of this is the Video For Linux drivers, and I suppose the frame buffer driver works the same way. For an good explanation of how the V4L2 driver works, you have :
The lwn.net article about streaming I/O
The V4L2 spec itself
You can't beat memory mapping for large amount of data, because there is no memcopy like operation involved, the physical underlying memory is effectively shared between kernel and userspace. Of course, like in all shared memory mechanism, you have to provide some synchronisation so that kernel and userspace don't think they have ownership at the same time.

Shared Memory between kernel and usespace is doable.
http://kerneltrap.org/node/14326
For instructions/examples.
You can also use a named pipe which are pretty fast.
All this really depends on what data you are sharing, is it concurrently accessed and what the data is structured like. Calls may be enough for simple data.
Linux kernel /proc FIFO/pipe
Might also help
good luck

You may also consider relay (formerly relayfs):
"Basically relayfs is just a bunch of per-cpu kernel buffers that can be efficiently written into from kernel code. These buffers are represented as files which can be mmap'ed and directly read from in user space. The purpose of this setup is to provide the simplest possible mechanism allowing potentially large amounts of data to be logged in the kernel and 'relayed' to user space."
http://relayfs.sourceforge.net/

You can obviously do shared memory with copy_from_user etc, you can easily set up a character device driver basically all you have to do is make a file_operation structures but this is by far not the fastest way.
I have no benchmarks but system calls on moderns systems should be the fastest. My reasoning is that its what's been most optimized for. It used to be that to get to from user -> kernel one had to create an interrupt, which would then go to the Interrupt table(an array) then locate the interrupt handlex(0x80) and then go to kernel mode. This was really slow, and then came the .sysenter instruction, which basically makes this process really fast. Without going into details, .sysenter reads form a register CS:EIP immediately and the change is quite fast.
Shared memory on the contrary requires writing to and reading from memory, which is infinitely more expensive than reading from a register.

Here is a possible compilation of all the possible interface, although in some ways they overlapped one another (eg, socket and system call are both effectively using system calls):
Procfs
Sysfs
Configfs
Debugfs
Sysctl
devfs (eg, Character Devices)
TCP/UDP Sockets
Netlink Sockets
Ioctl
Kernel System Calls
Signals
Mmap

As for shared memory , I've found that even with NUMA the two thread running on two differrent cores communicate through shared memory still required write/read from L3 cache which if lucky (in one socket)is
about 2X slower than syscall , and if(not on one socket ),is about 5X-UP
slower than syscall,i think syscall's hardware mechanism helped.

Bursty writes to SD/USB stalling my time-critical apps on embedded Linux

I'm working on an embedded Linux project that interfaces an ARM9 to a hardware video encoder chip, and writes the video out to SD card or USB stick. The software architecture involves a kernel driver that reads data into a pool of buffers, and a userland app that writes the data to a file on the mounted removable device.
I am finding that above a certain data rate (around 750kbyte/sec) I start to see the userland video-writing app stalling for maybe half a second, about every 5 seconds. This is enough to cause the kernel driver to run out of buffers - and even if I could increase the number of buffers, the video data has to be synchronised (ideally within 40ms) with other things that are going on in real time. Between these 5 second "lag spikes", the writes complete well within 40ms (as far as the app is concerned - I appreciate they're buffered by the OS)
I think this lag spike is to do with the way Linux is flushing data out to disk - I note that pdflush is designed to wake up every 5s, my understanding is that this would be what does the writing. As soon as the stall is over the userland app is able to quickly service and write the backlog of buffers (that didn't overflow).
I think the device I'm writing to has reasonable ultimate throughput: copying a 15MB file from a memory fs and waiting for sync to complete (and the usb stick's light to stop flashing) gave me a write speed of around 2.7MBytes/sec.
I'm looking for two kinds of clues:
How can I stop the bursty writing from stalling my app - perhaps process priorities, realtime patches, or tuning the filesystem code to write continuously rather than burstily?
How can I make my app(s) aware of what is going on with the filesystem in terms of write backlog and throughput to the card/stick? I have the ability to change the video bitrate in the hardware codec on the fly which would be much better than dropping frames, or imposing an artificial cap on maximum allowed bitrate.
Some more info: this is a 200MHz ARM9 currently running a Montavista 2.6.10-based kernel.
Updates:
Mounting the filesystem SYNC causes throughput to be much too poor.
The removable media is FAT/FAT32 formatted and must be as the purpose of the design is that the media can be plugged into any Windows PC and read.
Regularly calling sync() or fsync() say, every second causes regular stalls and unacceptably poor throughput
I am using write() and open(O_WRONLY | O_CREAT | O_TRUNC) rather than fopen() etc.
I can't immediately find anything online about the mentioned "Linux realtime filesystems". Links?
I hope this makes sense. First embedded Linux question on stackoverflow? :)

For the record, there turned out to be two main aspects that seem to have eliminated the problem in all but the most extreme cases. This system is still in development and hasn't been thoroughly torture-tested yet but is working fairly well (touch wood).
The big win came from making the userland writer app multi-threaded. It is the calls to write() that block sometimes: other processes and threads still run. So long as I have a thread servicing the device driver and updating frame counts and other data to sychronise with other apps that are running, the data can be buffered and written out a few seconds later without breaking any deadlines. I tried a simple ping-pong double buffer first but that wasn't enough; small buffers would be overwhelmed and big ones just caused bigger pauses while the filesystem digested the writes. A pool of 10 1MB buffers queued between threads is working well now.
The other aspect is keeping an eye on ultimate write throughput to physical media. For this I am keeping an eye on the stat Dirty: reported by /proc/meminfo. I have some rough and ready code to throttle the encoder if Dirty: climbs above a certain threshold, seems to vaguely work. More testing and tuning needed later. Fortunately I have lots of RAM (128M) to play with giving me a few seconds to see my backlog building up and throttle down smoothly.
I'll try to remember to pop back and update this answer if I find I need to do anything else to deal with this issue. Thanks to the other answerers.

I'll throw out some suggestions, advice is cheap.
make sure you are using a lower level API for writing to the disk, don't use user-mode caching functions like fopen, fread, fwrite use the lower level functions open, read, write.
pass the O_SYNC flag when you open the file, this will cause each write operation to block until written to disk, which will remove the bursty behavior of your writes...with the expense of each write being slower.
If you are doing reads/ioctls from a device to grab a chunk of video data, you may want to consider allocating a shared memory region between the application and kernel, otherwise you are getting hit with a bunch of copy_to_user calls when transferring video data buffers from kernel space to user space.
You may need to validate that your USB flash device is fast enough with sustained transfers to write the data.
Just a couple thoughts, hope this helps.

Here is some information about tuning pdflush for write-heavy operations.

Sounds like you're looking for linux realtime filesystems. Be sure to search Google et al for that.
XFS has a realtime option, though I haven't played with it.
hdparm might let you turn off the caching altogether.
Tuning the filesystem options (turn off all the extra unneeded file attributes) might reduce what you need to flush, thus speeding the flush. I doubt that'd help much, though.
But my suggestion would be to avoid using the stick as a filesystem at all and instead use it as a raw device. Stuff data on it like you would using 'dd'. Then elsewhere read that raw data and write it out after baking.
Of course, I don't know if that's an option for you.

Has a debugging aid, you could use strace to see what operations is taking time.
There might be some surprising thing with the FAT/FAT32.
Do you write into a single file, or in multiple file ?
You can make a reading thread, that will maintain a pool of video buffer ready to be written in a queue.
When a frame is received, it is added to the queue, and the writing thread is signaled
Shared data
empty_buffer_queue
ready_buffer_queue
video_data_ready_semaphore
Reading thread :
buf=get_buffer()
bufer_to_write = buf_dequeue(empty_buffer_queue)
memcpy(bufer_to_write, buf)
buf_enqueue(bufer_to_write, ready_buffer_queue)
sem_post(video_data_ready_semaphore)
Writing thread
sem_wait(vido_data_ready_semaphore)
bufer_to_write = buf_dequeue(ready_buffer_queue)
write_buffer
buf_enqueue(bufer_to_write, empty_buffer_queue)
If your writing threaded is blocked waiting for the kernel, this could work.
However, if you are blocked inside the kerne space, then thereis nothing much you can do, except looking for a more recent kernel than your 2.6.10

Without knowing more about your particular circumstances, I can only offer the following guesses:
Try using fsync()/sync() to force the kernel to flush data to the storage device more frequently. It sounds like the kernel buffers all your writes and then ties up the bus or otherwise stalls your system while performing the actual write. With careful calls to fsync() you can try to schedule writes over the system bus in a more fine grained way.
It might make sense to structure the application in such a way that the encoding/capture (you didn't mention video capture, so I'm making an assumption here - you might want to add more information) task runs in its own thread and buffers its output in userland - then, a second thread can handle writing to the device. This will give you a smoothing buffer to allow the encoder to always finish its writes without blocking.
One thing that sounds suspicious is that you only see this problem at a certain data rate - if this really was a buffering issue, I'd expect the problem to happen less frequently at lower data rates, but I'd still expect to see this issue.
In any case, more information might prove useful. What's your system's architecture? (In very general terms.)
Given the additional information you provided, it sounds like the device's throughput is rather poor for small writes and frequent flushes. If you're sure that for larger writes you can get sufficient throughput (and I'm not sure that's the case, but the file system might be doing something stupid, like updating the FAT after every write) then having an encoding thread piping data to a writing thread with sufficient buffering in the writing thread to avoid stalls. I've used shared memory ring buffers in the past to implement this kind of scheme, but any IPC mechanism that would allow the writer to write to the I/O process without stalling unless the buffer is full should do the trick.

A useful Linux function and alternative to sync or fsync is sync_file_range. This lets you schedule data for writing without waiting for the in-kernel buffer system to get around to it.
To avoid long pauses, make sure your IO queue (for example: /sys/block/hda/queue/nr_requests) is large enough. That queue is where data goes in between being flushed from memory and arriving on disk.
Note that sync_file_range isn't portable, and is only available in kernels 2.6.17 and later.

I've been told that after the host sends a command, MMC and SD cards "must respond within 0 to 8 bytes".
However, the spec allows these cards to respond with "busy" until they have finished the operation, and apparently there is no limit to how long a card can claim to be busy (please, please tell me if there is such a limit).
I see that some low-cost flash chips such as the M25P80 have a guaranteed "maximum single-sector erase time" of 3 seconds, although typically it "only" requires 0.6 seconds.
That 0.6 seconds sounds suspiciously similar to your "stalling for maybe half a second".
I suspect the tradeoff between cheap, slow flash chips and expensive, fast flash chips has something to do with the wide variation in USB flash drive results:
http://www.testfreaks.com/blog/information/16gb-usb-drive-comparison-17-drives-compared/
http://www.tomshardware.com/reviews/data-transfer-run,1037-10.html
I've heard rumors that every time a flash sector is erased and then re-programmed, it takes a little bit longer than the last time.
So if you have a time-critical application, you may need to (a) test your SD cards and USB sticks to make sure they meet the minimum latency, bandwidth, etc. required by your application, and (b) peridically re-test or pre-emptively replace these memory devices.

Well obvious first, have you tried explicitly telling the file to flush? I also think there might be some ioctl you can use to do it, but I honestly haven't done much C/POSIX file programming.
Seeing you're on a Linux kernel you should be able to tune and rebuild the kernel to something that suits your needs better, eg. much more frequent but then also smaller flushes to the permanent storage.
A quick check in my man pages finds this:
SYNC(2) Linux Programmer’s Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync(): _BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
ERRORS
This function is always successful.

Doing your own flush()ing sounds right to me - you want to be in control, not leave it to the vagaries of the generic buffer layer.
This may be obvious, but make sure you're not calling write() too often - make sure every write() has enough data to be written to make the syscall overhead worth it. Also, in the other direction, don't call it too seldom, or it'll block for long enough to cause a problem.
On a more difficult-to-reimplement track, have you tried switching to asynchronous i/o? Using aio you could fire off a write and hand it one set of buffers while you're sucking video data into the other set, and when the write finishes you switch sets of buffers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string