Kernel panic seems to be unrelated to the changes - linux

I made changes in sched.c in Linux kernel 2.4 (homework), and now the system goes into kernel panic. The strange thing is: it seems to pass A LOT of booting checks and initializations, and panics only at the very end, showing hte following stack trace:
update_process_times
do_timer
timer_interrupt
handle_IRQ_event
do_IRQ
call_do_IRQ
do)wp_page
handle_mm_fault
do_page_fault
do_sigaction
sys_rt_sigaction
do_page_fault
error_code
And the error is: "In interrupt handler - not synching"
I know it's hard to tell without any code, but can anybody make an educated guess to point me in the right direction?

I can give you my own personal mantra when debugging kernel problems: "it's always your fault."
I often see issues due to overwriting memory outside where I'm working -- if I feed hardware an incorrect address for DMA for example. You may be screwing up a lock somehow; that seems possible in this case if you are seeing a timeout: a forgotten locked lock is causing a timeout to occur due to a hang.
To me, a panic in update_process_times might suggest a problem with the task struct pointer... but I really have no idea.
Keep in mind that things in the kernel often go wrong long before a failure occurs, so a wrong bit anywhere in your code may be to blame, even if it doesn't seem like it should have an effect. If you can, I recommend incrementally adding or removing your code and checking for the problem to see if you can isolate it.

Related

Monitoring Process Syscalls in Live Environment

I've been working on a project for a little while, and the first step is building a library of syscall traces for processes. Essentially, what I'm trying to do is have system wherein every time a process requests an OS service via a syscall, relevant information (calling process, time, syscall name) of the event get logged to a file.
Theoretically, this sounds like a simple enough thing to do, however, implementing such is becoming more of a pain as time goes on. I suppose the main that's causing issues for me is a general lack of knowing where to start implementation.
Initially, I thought that this could all be handled be adding a few lines of code to the kernel entry point, but after digging through entry_64.S for a little while, I came to the conclusion that there must be an easier way. The next idea I had was to overwrite all the services pointed to by sys_call_table with my own service that did logging then called the original service. But, turns out, there are some difficulties to this method with linux kernel 5.4.18 due to sys_call_table no longer being exported. And, even when recompiling the kernel so that sys_call_table is exported, the table is in a memory protected location. Lastly, I've been experimenting with auditd. Specifically, I followed this link but it doesn't seem to be working (when I executed kill command there was is only a corresponding result in ausearch about 50% of time based on timestamps).
I'm getting a little burned out by all these dead-ends, and am really hoping to finally have this first stage in my project up and running. Does anyone have any pointers as to what I should try?
Solution: BPFTrace was exactly what I was looking for.
I used BPFTrace to log every time the kernel began execution of a syscall (excluding those initiated by BPFTrace itself)

What is the difference between panic and process::exit

As per the title, what is the difference between these two and when should I consider using one over the other?
There may or may not be a difference depending on your definition of what happens when a panic happens (defined in Cargo.toml). Depending on whether you have it set to unwind or abort, different things will happen:
With unwind, this will (as the name suggests) unwind the stack. With this, in particular, it is possible to get a full stack trace
With abort, you will only get the last callee
process::exit(), on the other hand, is a "clean" exit - you will not get a last callee, and you'll get a regular process exit status.
Due to this, you'll ideally want to keep to the following:
For planned shutdowns, use exit(). Do note that a known error is considered a planned shutdown
For unplanned shutdowns (i.e. exceptional failures) consider panic!(), as you'll both benefit from being able to get a stack trace when this happens, and the failure case should be exceptional enough that it is effectively unaccounted for and stems from an unplanned scenario
Afaik, a panic is never supposed to happen in a released program. It gives informations for developpers, but not anything user friendly. I'd say "use it for errors that should not happen in prod". There is probably behind something like an exit(101);
exit just terminates your process with the code you give to it. An exit(0) should mean "Everything is okay".

Is any real program checks for file close errors?

I have never seen a real use for checking if a file was closed correctly. I mean, if it didn't close, then what? You have nothing smart to do. Beside, I'm not sure if there's a real world use case, where non of the write/reads/flush will fail, and only the close will.
Does anyone actually uses the return value of close?
From the close(2):
Not checking the return value of close() is a common but nevertheless serious
programming error. It is quite possible that errors on a previous write(2)
operation are first reported at the final close(). Not checking the return
value when closing the file may lead to silent loss of data. This can
especially be observed with NFS and with disk quota.
And if you use signals in your application close may be interrupted (EINTR).
EDIT: That said, I seldom bother unless I'm prepared to handle such cases and write code that has to be 100% fool-proof.

How do you safely read memory in Unix (or at least Linux)?

I want to read a byte of memory but don't know if the memory is truly readable or not. You can do it under OS X with the vm_read function and under Windows with ReadProcessMemory or with _try/_catch. Under Linux I believe I can use ptrace, but only if not already being debugged.
FYI the reason I want to do this is I'm writing exception handler program state-dumping code, and it helps the user a lot if the user can see what various memory values are, or for that matter know if they were invalid.
If you don't have to be fast, which is typical in code handling state dump/exception handlers, then you can put your own signal handler in place before the access attempt, and restore it after. Incredibly painful and slow, but it is done.
Another approach is to parse the contents of /dev/proc//maps to build a map of the memory once, then on each access decide whether the address in inside the process or not.
If it were me, I'd try to find something that already does this and re-implement or copy the code directly if licence met my needs. It's painful to write from scratch, and nice to have something that can give support traces with symbol resolution.

pthread_mutex_lock return not tested

I'm really wondering why all source codes that implement a
pthread_mutex_lock never test its return value as defined :
documentation of pthread
even in books the examples don't test if the lock is in error, codes just do the lock.
Is there any reason I missed to let it untested ?
Basically, the only “interesting” error is EINVAL, which in most programs will only happen because of memory corruption, or, as I know from my own painful experience, during program shutdown after destructors have already destroyed some mutexes. The way I see it, the only reasonable response to such an error is to abort the program, which on the other hand is very inconvenient if the errors occur precisely because the program is already shutting down. Of course, this can be solved, but it’s not at all that simple, and not much is gained by it for most programs.
First off, I think "all source code" and "never test" are too strong. I think "some" and "often" would be more accurate.
In books, error checking code is often omitted for clarity of exposition.
As to real-world code, I guess the answer has to be that it is perceived that the likelihood of failure is very low. Whether this is a good assumption is debatable.

Resources