How to capture all generated events with ftrace without any loss - linux

I'm currently doing some experiments, and I need to record all the events that are generated during the execution of normal stress-ng execution cyle like this /usr/bin/stress-ng -c 80 -t 30 --times --exec 50 --exec-ops 50, specifically the ones related to exec (sched:sched_process_exec and syscalls:sys_enter_execve).
Unfortunately when analysing the trace file, I get some processes that didn't generated any sys_execve, but were captured by the sched_process_exec, which to me makes no sense.
This happened even though no events where lost (in the trace file the entries in buffer/written are the same, and trace-cmd doesn't warn about events lost).
Given this situation I can't understand why this happens, and the only explanation I can give is that these events are not being recorded. Any help would be appreciated.
Here's an example for reference of trace file I get
To be clear in what I'm saying, these lines should be the norm:
stress-ng-1748 [001] .... 19573.548553: sys_execve(filename: 7ffe7a791720, argv: 7ffe7a791700, envp: 7ffe7a7916f8)
stress-ng-1748 [001] .... 19573.548707: sched_process_exec: filename=/usr/bin/stress-ng pid=1748 old_pid=1748
A process which generated both the sys_execve event and the sched_process_exec event.
Whereas this one:
stress-ng-1780 [005] .... 19573.598398: sched_process_exec: filename=/usr/bin/stress-ng pid=1780 old_pid=1780
which is the last one of the file in the link, is an example of process without a sys_execve event associated.
Bonus question: I'd also need to record the equivalent fork event (namely syscalls:sys_enter_fork) with a stress execution with fork-ops (or something equivalent), but I haven't been able to do so, neither from trace-cmd, nor manually from Ftrace. I've read around the internet that there are some special cases when dealing with forking processes, but couldn't understand what to do in order to record this event in particular.
Any help on this matter would be appreciated as well.

I solved this problem by also capturing the event syscalls:sys_enter_execve. Between the two of them I was able to get every instance of exec called.

Related

Monitoring Process Syscalls in Live Environment

I've been working on a project for a little while, and the first step is building a library of syscall traces for processes. Essentially, what I'm trying to do is have system wherein every time a process requests an OS service via a syscall, relevant information (calling process, time, syscall name) of the event get logged to a file.
Theoretically, this sounds like a simple enough thing to do, however, implementing such is becoming more of a pain as time goes on. I suppose the main that's causing issues for me is a general lack of knowing where to start implementation.
Initially, I thought that this could all be handled be adding a few lines of code to the kernel entry point, but after digging through entry_64.S for a little while, I came to the conclusion that there must be an easier way. The next idea I had was to overwrite all the services pointed to by sys_call_table with my own service that did logging then called the original service. But, turns out, there are some difficulties to this method with linux kernel 5.4.18 due to sys_call_table no longer being exported. And, even when recompiling the kernel so that sys_call_table is exported, the table is in a memory protected location. Lastly, I've been experimenting with auditd. Specifically, I followed this link but it doesn't seem to be working (when I executed kill command there was is only a corresponding result in ausearch about 50% of time based on timestamps).
I'm getting a little burned out by all these dead-ends, and am really hoping to finally have this first stage in my project up and running. Does anyone have any pointers as to what I should try?
Solution: BPFTrace was exactly what I was looking for.
I used BPFTrace to log every time the kernel began execution of a syscall (excluding those initiated by BPFTrace itself)

Is is OK to use a non-zero return code for a process that executed successfully?

I'm implementing a simple job scheduler, which spans a new process for every job to run. When a job exits, I'd like it to report the number of actions executed to the scheduler.
The simplest way I could find, is to exit with the number of actions as a return code. The process would for example exit with return code 3 for "3 actions executed".
But the standard (AFAIK) being to use the return code 0 when a process exited successfully, and any other value when there was en error, would this approach risk to create any problem?
Note: the child process is not an executable script, but a fork of the parent, so not accessible from the outside world.
What you are looking for is inter process communication - and there are plenty ways to do it:
Sockets
Shared memory
Pipes
Exclusive file descriptors (to some extend, rather go for something else if you can)
...
Return convention changes are not something a regular programmer should dare to violate.
The only risk is confusing a calling script. What you describe makes sense, since what you want really is the count. As Joe said, use negative values for failures, and you should consider including a --help option that explains the return values ... so you can figure out what this code is doing when you try to use it next month.
I would use logs for it: log the number of actions executed to the scheduler. This way you can also log datetimes and other extra info.
I would not change the return convention...
If the scheduler spans a child and you are writing that you could also open a pipe per child, or a named pipes or maybe unix domain sockets, and use that for inter process communication and writing the processed jobs there.
I would stick with conventions, namely returning 0 for success, expecially if your program is visible/usable around by other people, or anyway document well those decisions.
Anyway apart from conventions there are also standards.

Logging to a non blocking named pipe?

I have a question, and I could'nt find help anywhere on stackoverflow or the web.
I have a program (celery distributed task queue) and I have multiple instances (workers) each having a logfile (celery_worker1.log, celery_worker2.log).
The important errors are stored to a database, but I like to tail these logs from time to time when running new operations to make sure everything is ok (the loglevel is lower).
My problem: these logs are taking a lot of disk space.
What I would like to do: be able to "watch" the logs (tail -f) only when I need it, without them taking a lot of space.
My ideas until now:
outputing logs to stdout, not to a file: not possible here since I have many workers outputing to different files, but I want to tail them all at once (tail -f celery_worker*.log)
using logrotate: it is an "OK" solution for me. I don't want this to be a daily task but would rather not put a minute crontab for this, and more, the server is not mine so that would mean some work on the admin-sys side
using named pipes: it looked good at first sight but I didn't know that named pipes (linux FIFO) where blocking. Hence, when I don't tail -f ALL of the pipes at the same time, or when I just quit my tail, the writing operations from the logger are blocked.
Is there a way to have a non-blocking named pipe, which would just throw to stdout when tailed, and throw to /dev/null when not?
Or are there technical difficulties to such a type of pipe? If there are, what are they?
Thank you for your answers!
Have each worker log to stdout, but connect each stdout to a utility that automatically spools and rotates logs based on size or time. multilog and svlogd are examples of such. For those programs, you'd merely tail the "current" log file.
You're right that logrotate is not quite the right solution for the problem you have.
Named pipes won't work as you want. At best, your writers could fill up their pipes and then discard subsequent logs, which is the inverse of the behavior you want.
You could try shared memory device man:shm_overview or perhaps a number of them. You need to organise them as circular buffers so they'd store last N kb of your log and whenever you read them with reader it will output everything to your console. This approach is adopted by busybox's syslog/logread suit (see logread.c).

popen & status of a pipe

Let's say I spawn a process PO through popen (READ ONLY) from a process PA. I then pclose() the pipe on PA's side.
On PO's side, how do I determine if stdout is still available without executing a write() ?
Note that I have tried catching SIGPIPE on PO's side to no avail.
UPDATED: I tried using fstat(1, &buf) without success.
UPDATED: The reason I need to detect this condition through PO I do not have access to PO's PID from PA (and hence can't kill it). Futhermore, I'd like for PO to be more robust in face failures of PA i.e. exiting by itself.
RESOLUTION: I went ahead and used socketpair, fork. Trying to control a process through popen turned out to be a nightmare (to me at least). A big thanks to everyone who contributed!
Mmm... pclose() is supposed to wait for PO to finish before closing the pipe. In the meantime, PO can keep writing to its end of the pipe at least up to 4192 bytes (ulimit -p times 512), then it should simply block.
Perhaps you will have to switch for pipe()/fork()/dup2()/close() if you want more control. If this is what you want, let me know and I'll post some code.
PA is the consumer of information (Hence it does popen() , and pclose() ).
PO is the provider, and hence the server - in this case it only knows that it is writing to stdout, but cannot tell what stdout is bound to. So in this case PO is not supposed to know too much about stdout.
The EOF detection should happen at the PA program.
Could you post a few more details about why you need to do it in PO?

Linux iNotify one shot and event mask problem

I'm trying to use iNotify in linux rhel5, kernel 2.6.18, glibc 2.5-18. I did not define the event as one shot but for some some reason it behaves as if I did. The impact is that I have to re-add a watch after each event. Any one ever used iNotify? Another problem is that the mask returned in the event object contains only one flag: IN_ONE_SHOT.
Write the smallest example you can and test that. If it demonstrates the behaviour you are talking about then add it to your question. If it behaves normally then add a little more of your code and test again. Keep repeating until you have reproduced the error or you have your code working. Often I find that building a toy program tells me exactly what I am doing wrong that I could not see in a larger program.
It is probable that inotify is implicitly deleting the watch because the file is being deleted. The behaviour is subtly referred to by the manual page (see the section on the IN_IGNORED event). You can check if this is happening by checking if the flag IN_IGNORED is set in the inotify_event populated by your call to read.
See also inotify delete_self when modifying and saving a file for why the file may be deleted without your knowledge or action during what you think is just a modification.

Resources