I just finished writing a linux security module which verifies the integrity of executable files at the start of their execution (using digital signatures). Now I want to dig a little bit deeper and want to check the files' integrity during run-time (i.e. periodically check them - since I am mostly dealing with processes that get started and run forever...) so that an attacker is not able to change the file within main memory without being identified (at least after some time).
The problem here is that I have absolutely no clue how I can check the file's current memory image. My authentication method mentioned above makes use of a mmap-hook which gets called whenever a file is mmaped before its execution, but as far as I know the LSM framework does not provide tools for periodical checks.
So my question: Are there any hints how I shoudl start this? How I can read a memory image and check its integrity?
Thank you

I understand what you're trying to do, but I'm really worried that this may be a security feature that gives you a warm fuzzy feeling for no good reason; and those are the most dangerous kinds of security features to have. (Another example of this might be the LSM sitting right next to yours, SElinux. Although I think I'm in the minority on this opinion...)
The program data of a process is not the only thing that affects its behavior. Stack overflows, where malicious code is written into the stack and jumped into, make integrity checking of the original program text moot. Not to mention the fact that an attacker can use the original unchanged program text to his advantage.
Also, there are probably some performance issues you'll run into if you are constantly computing DSA inside the kernel. And, you're adding that much more to long list of privileged kernel code that could be possibly exploited later on.
In any case, to address the question: You can possibly write a kernel module that instantiates a kernel thread that, on a timer, hops through each process and checks its integrity. This can be done by using the page tables for each process, mapping in the read only pages, and integrity checking them. This may not work, though, as each memory page probably needs to have its own signature, unless you concatenate them all together somehow.
A good thing to note is that shared libraries only need to be integrity checked once per sweep, since they are re-mapped across all the processes that use them. It takes sophistication to implement this though, so maybe have this under this "nice-to-have" section of your design.
If you disagree with my rationale that this may not be a good idea, I'd be very interested in your thoughts. I ran into this idea at work a while ago, and it would be nice to bring fresh ideas to our discussion.


Vulkan abort problematic commands?

I have an application where multiple threads would be rendering different parts of a world. However it may occur that one of those threads could submit a highly problematic, or even malicious, command to Vulkan.
Is there anyway to preemptively check for issues with the command that could catch it being problematic? Or let it attempt to be executed, but then by some means determine if it is problematic and abort it? All the while not corrupting or wrecking appropriate commands that were submitted from other threads.
I know obvious solution is "don't submit malicious commands!" but without explaining everything, the jist of this is to try and create a kind graphics sandbox.
The Vulkan run-time assumes well formed input; there isn't any error checking (that's left to layer drivers) so without validation you could get rendering corruption or driver crashes.
You can get some limited protection to GPU-side buffer overruns using robustBufferAccess, but it only catches a tiny subset of the problems.
Beyond that the only real solution is to rely on host process isolation, and put each content provider into a separate process on the host OS with a unique rendering context.
Even with that you can get trivial denial-of-service (shader with a very long running and/or infinite loop), which the API doesn't really give you any means to control. You'd be reliant on the privileged GPU driver timing out the process and killing it.

Multi threading analysis techniques

Does anyone know of any analysis techniques that can be used to design/debug thread locking and unlocking sequences? Essentially a technique (like a truth table) I can use to prove that my sequence of locks won't deadlock.
This is not the sort of problem that programming by trial and error works well in.
My particular problem is a read write lock - but I ask this in the general sense. I believe it would be a useful technique to learn if one exists.
I have tried a causal graph in which I have boxes and arrows that I can use to follow the flow of control and that has solved 80% of my problem. But I am still getting occasional deadlocks under stress testing when one thread sneaks though the "gap between instructions" if that makes any sense.
To summarize; what I need is some way of representing the problem so that I can formally analyze the overlap of mutex locks.
Bad news I'm afraid. There are no techniques that I know of that can "prove" that a system that uses locks to control access to shared memory. By "prove" I mean that you cannot demonstrate analytically that a program won't deadlock, livelock, etc.
The problem is that threads run asynchronously. As soon as you start having a sensible number of threads and shared resources, the number of possible sequences of events (e.g. locking/unlocking shared resources) is astronomically high and you cannot model / analyse each and every one of them.
For this reason Communicating Sequential Processes was developed by Tony Hoare, way back in 1978. It is a development of the Actor model which itself goes a long way to resolving the problem.
Actor and CSP
Briefly, in the Actor model data is not communicated via shared memory with a lock. Instead a copy is sent down a communications channel of some sort (e.g. a socket, or pipe) between two threads. This means that you're never locking memory. In effect all memory is private to threads, with copies of it being sent as and when required to other threads. It's a very 'object orientated' thing; private data (thread-owned memory), public interface (messages emitted and received on communications channels). It's also very scalable - pipes can become sockets, threads can become processes on other computers.
The CSP model is just like that, except that the communications channel won't accept a message unless the receiving end is ready to read it.
This addition is crucial - it means that a system design can be analysed algebraically. Indeed Tony Hoare formulated a process calculi for CSP. The Wikipedia page on CSP cites use of this to prove an eCommerce system's design.
So if one is developing a strict CSP system, it is possible to prove analytically that it cannot deadlock, etc.
Real World Experience
I've done many a CSP (or CSP-ish) system, and it's always been good. Instead of doing the maths I've used intuition to help me avoid problems. In effect CSP ensures that if I've gone and built a system that can deadlock, it will deadlock every time. So at least I find it in development, not 2 years later when some network link gets a bit busier than normal.
Real World Options
For Actor model programming there's a lot of options. ZeroMQ, nanomsg, Microsoft's .NET Data Flow library.
They're all pretty good, and with care you can make a system that'll be pretty good. I like ZeroMQ and nanomsg a lot - they make it trivial to split a bunch of threads up into separate processes on separate computers and you've not changed the architecture at all. If absolute performance isn't essential coupling these two up with, for example, Google Protocol Buffers makes for a really tidy system with huge options for incorporating different OSes, languages and systems into your design.
I suspect that MS's DataFlow library for .NET moves owner of references to the data around instead of copying it. That ought to make it pretty performant (though I've not actually tried it to see).
CSP is a bit harder to come by. You can nearly make ZeroMQ and DataFlow into CSP by setting message buffer lengths. Unfortunately you cannot set the buffer length to zero (which is what would make it CSP). MS's documentation even talks about the benefits to system robustness achieved by setting the queue length to 1.
You can synthesize CSP on top of Actor by having flows of synchronisation messages across the links. This is annoying to have to implement.
I've quite often spun up my own comms framework to get a CSP environment.
There's libraries for Java I think, don't know how actively developed they are.
However as you have existing code written around locked shared memory it'll be a tough job to adapt your code. So....
Kernel Shark
If you're on Linux and your kernel has FTRACE compiled in you can use Kernel Shark to see what has happened in your system. Similarly with DTRACE on Solaris, WindView on VxWorks, TATL on MCOS.
What you do is run your system until it stops, and then very quickly preserve the FTRACE log (it gets overwritten in a circular buffer by the OS). You can then see graphically what has happened (turn on Kernel Shark's process view), which may give clues as to what did what and when.
This helps you diagnose your application's deadlock, which may lead you towards getting things right, but ultimately you can never prove that it is correct this way. That doesn't stop you having a Eureka moment where you now know in your bones that you've got it right.
I know of no equivalent of FTRACE / Kernel shark for Windows.
For a broad range of multithreading tasks, we can draw a graph which reflects the order of locking of resources. If that graph has cycles, this means that deadlock is well possible. If there is no cycles, deadlock never occur.
For example, consider the Dining Philosophers task. If each philosopher takes left fork first, and then the right fork, then the graph of order of locking is a ring connecting all the forks. Deadlock is very possible in this situation. However, if one of philosophers changes his order, the ring become a line and deadlock would never occur. If all philosophers change their order and all would take right fork first, the graph again shapes a ring and deadlock is real.

How can I know when data is written to disk?

We'd like to measure the I/O time from an application by instrumenting the read() and write() routines on a Linux system. However, the calls to write() return very fast. According to my OS man page for write (man 2 write):
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy
implementations, it does not even guarantee that space has
been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
Linux manual as of 2013-01-27
so we understand that the write() call initiates an asynchronous call that at some point will flush the data to disk.
So the question is, is there a way to know when the data (even if it has been grouped for caching purposes) is being actually written into disk? -- preferably, when that process starts and ends?
EDIT1 We're particularly interested on measuring the application behavior and we'd like to avoid changing the semantics of the application by changing the parameters to open() -- adding O_SYNC -- or injecting calls to sync(). By changing the application semantics, you can't actually tell about the behavior of the original application.
You could open the file as O_SYNC, which in theory means that write won't return until the data is written to disk. Though what data, real or metadata, is written is dependant on the file system and how it is mounted. This is changing how your application is really working though.
If you're really interested in handling actual I/O to storage yourself (are you a database?) then O_DIRECT leaves you control. Again this is a change in behaviour and imposes additional constraints on your application. It may be what you need, may not.
You really appear to be asking about benchmarking real performance, so the real question is what you want to know. Since a real system does so much caching, the "instant" return from the write is "real" in the sense of what delays on your application actually are. If you're looking for I/O throughput you might be better looking at higher level system statistics.
You basically can't know when the data is really written to disk, and the actual disk writing may happen long time after (typically, a few minutes) your process has terminated. Also, your disk itself has (inside the disk controller) some cache. Be happy with that, since the page cache of your system is then very effective (and makes your Linux system behave quickly).
You might consider calling the sync(2) system call, but you often should not (it could be slow, and still don't guarantee any writing, it is often asking the kernel to flush buffers later).
On a given opened file descriptor, you could consider fsync(2). As Joe answered, you might pass O_SYNC to open, but that would slow down the system.
I strongly suggest (for performance reasons) to trust your kernel page cache management and avoid forcing any disk flush manually. See also the related posix_fadvise(2) & madvise(2) system calls.
If you benchmark some program, run it several times (and take into account what matters to you the most: an average of the measured times -perhaps excluding the best and/or worst of them-, or the worse or the best of them). So the point is that the I/O time (or the CPU time, or the elapsed real time) of an application is something very ambiguous. You probably want to explain your benchmarking process when publishing benchmark results.
You can refer to this link. It might help you.
Flush Data to disk
As far as writing to disk is concerned it is unpredictable. There is no definitive way of telling it. But you can make sure that data is written to disk by calling sync.

Linux/C: how to trace the accesses on a number of variables

I'm trying to profile some existing C code that uses large structs with many members, with the goal of refactoring it into a smaller cache-friendly core struct containing the most frequently-accessed members and a pointer to the colder data.
I want to come up with a way of monitoring the app for a few hours in a few use-cases and produce a report of how often each member in an instance of the struct was accessed.
The x86 debug registers would be ideal, but unfortunately I can only watch 4 addresses simultaneously and I need many more.
I was thinking I could temporarily make each member occupy a whole page of its own, mark all the pages as not-accessible, then set up a segfault handler to record each access before somehow (and this is the tricky bit) recovering and allowing the app to continue. None of the memory being monitored is passed to a syscall, so there wouldn't be any issue with syscalls failing due to unreadable args. Is there a way to use the handler to temporarily make the page accessible, perform the faulting instruction, reprotect the page, then return?
Failing this, is there a more sensible way of recording accesses to many addresses? Something in valgrind maybe? Thanks
I was thinking I could temporarily make each member occupy a whole page of its own,
This only works for heap-allocated objects, and is what Electric Fence uses. In the past I've found the Electric Fence overhead so great that it's not usable for anything but toy programs.
Failing this, is there a more sensible way of recording accesses to many addresses? Something in valgrind maybe?
This is possible by writing a custom Valgrind tool, but that is a complicated proposition.
A better approach may be to use Pin tool instead.

Is work_queue thread safe?

Looking at workqueue.c it appears as though only parts that are locked properly are between the publicly exposed APIs and the internal thread that runs. There seem to be some things outside the critical section (which to my untrained eye) that could be unsafe?
Am I correct or way off base?
I presume you are talking about workqueue.c in the Linux kernel?
In that case, only use the public API, assume it is threadsafe and report any issues you see to Ingo Molnar. Note that most kernel developers are seriously smart and will not redo the "big lock" mistake ever again: not everything is run under a mutex because not everything needs to. Tricks like RCU (read copy update) also reduce the amount of locking needed.
And everything without a lock will perform a lot better.
