It is possible to grab process memory using ftrace? - linux

I have two applications one writing requests to and reading responses from stdin/stdout of another. I should not modify the applications, but I have root permission. I need to intercept requests, and responses and measure time when some messages passed as precise as possible.
Currently I'm using ptrace, trace read and write syscalls on fd=0 and fd=1 and grab memory from /proc/<pid>/mem, but the overhead is too big, we cannot use such imprecise timestamps. I'm trying to use ftrace, but, I cannot read from /proc/<pid>/mem, because ftrace doesn't stop the tracee application.
It seems, ftrace only give me arguments of functions and registers, but I cannot google how to grab the buffer at the pointer given as argument. Is it even possible?
Could you suggest another approach for my problem?

Related

How can I know when data is written to disk?

We'd like to measure the I/O time from an application by instrumenting the read() and write() routines on a Linux system. However, the calls to write() return very fast. According to my OS man page for write (man 2 write):
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy
implementations, it does not even guarantee that space has
successfully
been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
Linux manual as of 2013-01-27
so we understand that the write() call initiates an asynchronous call that at some point will flush the data to disk.
So the question is, is there a way to know when the data (even if it has been grouped for caching purposes) is being actually written into disk? -- preferably, when that process starts and ends?
EDIT1 We're particularly interested on measuring the application behavior and we'd like to avoid changing the semantics of the application by changing the parameters to open() -- adding O_SYNC -- or injecting calls to sync(). By changing the application semantics, you can't actually tell about the behavior of the original application.
You could open the file as O_SYNC, which in theory means that write won't return until the data is written to disk. Though what data, real or metadata, is written is dependant on the file system and how it is mounted. This is changing how your application is really working though.
If you're really interested in handling actual I/O to storage yourself (are you a database?) then O_DIRECT leaves you control. Again this is a change in behaviour and imposes additional constraints on your application. It may be what you need, may not.
You really appear to be asking about benchmarking real performance, so the real question is what you want to know. Since a real system does so much caching, the "instant" return from the write is "real" in the sense of what delays on your application actually are. If you're looking for I/O throughput you might be better looking at higher level system statistics.
You basically can't know when the data is really written to disk, and the actual disk writing may happen long time after (typically, a few minutes) your process has terminated. Also, your disk itself has (inside the disk controller) some cache. Be happy with that, since the page cache of your system is then very effective (and makes your Linux system behave quickly).
You might consider calling the sync(2) system call, but you often should not (it could be slow, and still don't guarantee any writing, it is often asking the kernel to flush buffers later).
On a given opened file descriptor, you could consider fsync(2). As Joe answered, you might pass O_SYNC to open, but that would slow down the system.
I strongly suggest (for performance reasons) to trust your kernel page cache management and avoid forcing any disk flush manually. See also the related posix_fadvise(2) & madvise(2) system calls.
If you benchmark some program, run it several times (and take into account what matters to you the most: an average of the measured times -perhaps excluding the best and/or worst of them-, or the worse or the best of them). So the point is that the I/O time (or the CPU time, or the elapsed real time) of an application is something very ambiguous. You probably want to explain your benchmarking process when publishing benchmark results.
You can refer to this link. It might help you.
Flush Data to disk
As far as writing to disk is concerned it is unpredictable. There is no definitive way of telling it. But you can make sure that data is written to disk by calling sync.

Does v4l2 support multi-map?

I'm trying to share frames(images) that I receive from a USB camera(logitech c270) between two processes so that I can avoid a memcpy. I'm using memory mapping streaming I/O method described here and I can successfully get frames from the camera after using v4l2_mmap. However, I have another process(for image processing) which has to use the image buffers after the dequeue and signal the first process to queue the buffer again.
Searching online, I could find that opening a video device multiple times is allowed, but when I try to map(tried both v4l2_mmap and just mmap) in the second process after a successful v4l2_open, I get an EINVAL error.
I found this pdf which talks about implementing multi-map in v4l2(Not official) and was wondering if this is implemented. I have also tried using User pointer streaming I/O method, the document of which explicitly states that a shared memory can be implemented in this method, but I get an EINVAL when I request for buffers(According to the documentation in linuxtv.org this means the camera doesn't support User pointer streaming I/O).
Note: I want to keep the code modular, hence two processes. If this is not possible, doing all the work in a single process(multiple threads & global frame buffer) is still possible.
Using standard shared memory function calls is not possible as the two processes have to map to the video device file(/dev/video0) and I cannot have it under /dev/shm.
The main problem with multi-consumer mmap is that this needs to be implemented on the device driver side. That is: even if some devices might support multi-map, others might not.
So unless you can control the camera that is being used with your application, you will eventually come across one that does not, in which case your application would not work.
So in any case, your application should provide means to handle non multi-map devices.
Btw, you do not need multiple processes to keep your code modular.
Multiple processes have their merits (e.g. privilige separation, crash resilience,...), but might also encourage code duplication...
This may not be relevant now.....
You don't need to use the full monty multi consumer thing to do this. I have used Python to hand off the processing of the mmap buffers to multiple processes (python multi-threading only allows 1 thread at a time to execute)
If you're running multi-threaded then worker threads can pick up the buffer and process it independently when triggered by the master thread
Since the code is obviously very pythonesq I won't post it here as it wouldn't make sense in other languages as it uses python multi-processing support.

Is it possible to read the instruction pointer of a thread without stopping the tracee?

I am considering writing an application-specific sampling based profiler on linux. The ptrace API, if I understand the man page correctly, relies on instrumentation in the kernel that stops the tracee whenever certain events happen in the kernel.
Is there a way to read the instruction pointer of a thread (from another thread on another core) without stopping the process?
First, the instruction pointer alone is useless for profiling, no matter how application-specific.
Look at the second answer on this post for a discussion of all the related issues.
Second, to get any useful information out of a thread, you do have to stop it long enough to read the information, and then it can start up again.
(Notice, this is what happens whenever it answers an interrupt of any kind.)
Don't think you need a large number of samples (or that your sampling has to be fast for that reason).
That's a long-running widely accepted idea (and taught, by people who should know better), and it is without foundation, statistical or otherwise.
(Academics might want to look here.)
Third, take a look at lsstack.
If you want to write your own profiler, it would be a good code base to start from.

Sharing stdout among multiple threads/processes

I have a linux program(the language doesn't matter) which prints it's log onto stdout.
The log IS needed for monitoring of the process.
Now I'm going to parallelize it by fork'ing or using threads.
The problem: resulting stdout will contain unreadable mix of unrelated lines...
And finally The question: How would you re-construct the output logic for parallel processes ?
Sorry for answering myself...
The definite solution was to use the GNU parallel utility.
It came to replace the well known xargs utility, but runs the commands in parallel, separating the output into groups.
So I just left my simple one-process, one-thread utility as-is and piped its call through the parallel like that:
generate-argument-list | parallel < options > my-utility
This, depending on parallel's options can produce nicely grouped outputs for multiple calls of my-utility
If you're in C++ I'd consider using Pantheios or derivative version Boost::Log or using look at Logging In C++ : Part 2 or
If you're in another language then file locking around the IO operations is probably the way to go see File Locks, you can achieve the same results using semaphonres or any other process control system but for me file locks are the easiest.
You could also consider using syslog if this monitoring is considered as system wide.
Another approach, which we use, is to delegate a thread, logger thread, for logging. All other threads wishing to log will send it to logger thread. This method gives you flexibility as formatting of logs can be done is single place, which can also be configurable. If you don't want to worry about locks can use sockets for message passing.
If its multithreaded, then you'll need to mutex protect printing/writing to the stdout log. The most common way to do this in Linux and c/c++ is with pthread_mutex. Additionally, if its c++, boost has synchronization that could be used.
To implement it, you should probably encapsulate all the logging in one function or object, and internally lock and unlock the mutex.
If logging blocking performance becomes prohibitive, you could consider buffering the log messages (in the afore mentioned object or function) and only writing to stdout when the buffer is full. You'll still need mutex protection to buffer, but buffering will be faster than writing to stdout.
If each thread will have its own logging messages, then they will still need to all share the same mutex for writing to stdout. In this case it would probably be best for each thread to buffer its respective log messages and only write to stdout when the the buffer is full, thus only obtaining the mutex for writing to stdout.

Faster forking of large processes on Linux?

What's the fastest, best way on modern Linux of achieving the same effect as a fork-execve combo from a large process ?
My problem is that the process forking is ~500MByte big, and a simple benchmarking test achieves only about 50 forks/s from the process (c.f ~1600 forks/s from a minimally sized process) which is too slow for the intended application.
Some googling turns up vfork as having being invented as the solution to this problem... but also warnings about not to use it. Modern Linux seems to have acquired related clone and posix_spawn calls; are these likely to help ? What's the modern replacement for vfork ?
I'm using 64bit Debian Lenny on an i7 (the project could move to Squeeze if posix_spawn would help).
On Linux, you can use posix_spawn(2) with the POSIX_SPAWN_USEVFORK flag to avoid the overhead of copying page tables when forking from a large process.
See Minimizing Memory Usage for Creating Application Subprocesses for a good summary of posix_spawn(2), its advantages and some examples.
To take advantage of vfork(2), make sure you #define _GNU_SOURCE before #include <spawn.h> and then simply posix_spawnattr_setflags(&attr, POSIX_SPAWN_USEVFORK)
I can confirm that this works on Debian Lenny, and provides a massive speed-up when forking from a large process.
benchmarking the various spawns over 1000 runs at 100M RSS
user system total real
fspawn (fork/exec): 0.100000 15.460000 40.570000 ( 41.366389)
pspawn (posix_spawn): 0.010000 0.010000 0.540000 ( 0.970577)
Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.
Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.
Did you actually measure how much time forks take? Quoting the page you linked,
Linux never had this problem; because Linux used copy-on-write semantics internally, Linux only copies pages when they changed (actually, there are still some tables that have to be copied; in most circumstances their overhead is not significant)
So the number of forks doesn't really show how big the overhead will be. You should measure the time consumed by forks, and (which is a generic advice) consumed only by the forks you actually perform, not by benchmarking maximum performance.
But if you really figure out that forking a large process is a slow, you may spawn a small ancillary process, pipe master process to its input, and receive commands to exec from it. The small process will fork and exec these commands.
posix_spawn()
This function, as far as I understand, is implemented via fork/exec on desktop systems. However, in embedded systems (particularly, in those without MMU on board), processes are spawned via a syscall, interface to which is posix_spawn or a similar function. Quoting the informative section of POSIX standard describing posix_spawn:
Swapping is generally too slow for a realtime environment.
Dynamic address translation is not available everywhere that POSIX might be useful.
Processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services.
Thus, POSIX needs process creation and file execution primitives that can be efficiently implemented without address translation or other MMU services.
I don't think that you will benefit from this function on desktop if your goal is to minimize time consumption.
If you know the number of subprocess ahead of time, it might be reasonable to pre-fork your application on startup then distribute the execv information via a pipe. Alternatively, if there is some sort of "lull" in your program it might be reasonable to fork ahead of time a subprocess or two for quick turnaround at a later time. Neither of these options would directly solve the problem but if either approach is suitable to your app, it might allow you to side-step the issue.
I've come across this blog post: http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
pid = clone(fn, stack_aligned, CLONE_VM | SIGCHLD, arg);
Excerpt:
The system call clone() comes to the rescue. Using clone() we create a
child process which has the following features:
The child runs in the same memory space as the parent. This means that no memory structures are copied when the child process is
created. As a result of this, any change to any non-stack variable
made by the child is visible by the parent process. This is similar to
threads, and therefore completely different from fork(), and also very
dangerous – we don’t want the child to mess up the parent.
The child starts from an entry function which is being called right after the child was created. This is like threads, and unlike fork().
The child has a separate stack space which is similar to threads and fork(), but entirely different to vfork().
The most important: This thread-like child process can call exec().
In a nutshell, by calling clone in the following way, we create a
child process which is very similar to a thread but still can call
exec():
However I think it may still be subject to the setuid problem:
http://ewontfix.com/7/ "setuid and vfork"
Now we get to the worst of it. Threads and vfork allow you to get in a
situation where two processes are both sharing memory space and
running at the same time. Now, what happens if another thread in the
parent calls setuid (or any other privilege-affecting function)? You
end up with two processes with different privilege levels running in a
shared address space. And this is A Bad Thing.
Consider for example a multi-threaded server daemon, running initially
as root, that’s using posix_spawn, implemented naively with vfork, to
run an external command. It doesn’t care if this command runs as root
or with low privileges, since it’s a fixed command line with fixed
environment and can’t do anything harmful. (As a stupid example, let’s
say it’s running date as an external command because the programmer
couldn’t figure out how to use strftime.)
Since it doesn’t care, it calls setuid in another thread without any
synchronization against running the external program, with the intent
to drop down to a normal user and execute user-provided code (perhaps
a script or dlopen-obtained module) as that user. Unfortunately, it
just gave that user permission to mmap new code over top of the
running posix_spawn code, or to change the strings posix_spawn is
passing to exec in the child. Whoops.

Resources