How to get the last process that modified a particular file?

How to get the last process that modified a particular file? - node.js

Ηi,
Say I have a file called something.txt. I would like to find the most recent program to modify it, specifically the full path to said program (eg. /usr/bin/nano). I only need to worry about files modified while my program is running, so I can add an event listener at program startup and find out what program modified it when my program was running.
Thanks!

auditd in Linux could perform actions regarding file modifications
See the following URI xmodulo.com/how-to-monitor-file-access-on-linux.html

Something like this generally isn't going to be possible for arbitrary processes. If these aren't arbitrary processes, then you could use some sort of network bus (e.g. redis) to publish "write" messages. Otherwise your only other bet would be to implement your own filesystem using FUSE. Even with FUSE though, you may not always have access to the pid depending on who/what is writing to the file and the security setup of your OS.

Related

Is there a way to make a bash script process messages that have been sent to it using the write command

Is there a way to make a bash script process messages that have been sent to it using the "write" command? So for example, if a user wants to activate a feature in my script, could I make it so that they can send the script a command using the write command?
One possible method I thought of was to configure logging for a screen session and then have the bash script parse text through there, but I'm not sure if there would be a simpler or more efficient way to tackle this
EDIT: I was thinking as an alternative solution I could use a named pipe. I'm worried that it would break though if the tmp partition gets filled up completely (not sure if this would impact write as well?). I'm going to be running this script on a shared box, and every once in a while someone will completely fill up the /tmp partition and then just leave it like that until people start complaining

Hmm, you are trying to really circumvent a poor unix command to ask it something it was not specified for. From the man page (emphasize mine):
The write utility allows you to communicate with other users, by copying
lines from your terminal to theirs
That means that write is intended to copy line directly on terminals. As soon as you say, I will dump terminal output with screen, and then parse the dump file, you loose the simplicity of write (and also need disk space, with the problem of removing old lines from a sequencial file)
Worse, as your script lives on its own, it could (should?) be a daemon script attached to no terminal
So if I have correctly understood your question, your requirements are:
a script that does some tasks and should be able to respond to asynchronous requests - common usages are named pipes or network or unix domain sockets, less common are files in a dedicated folder with a optional signal to have immediate processing, adding lines to a sequential file while being possible is uncommon, because of a synchonization of access problem
a simple and convivial way for users to pass requests. Ok write is nice for that part, but much too hard to interface IMHO
If you do not want to waste time on that part by using standard tools, I would recommend the mail system. It is trivial to alias a mail address to a program that will be called with the mail message as input. But I am not sure it is worth it, because the user could directly call the program with the request as input or command line parameter.
So the client part could be simply a program that:
create a temporary file in a dedicated folder (mkstemp is your friend in C or C++, or mktemp in shell - but beware of race conditions)
write the request to that file
optionaly send a signal to a pid - provided the script write its own PID on startup to a dedicated file

"find" command cannot detect files added during execution

Stackoverflow has saved my life on countless occasions over the years. Now, it's time for me to post my first question ever, the answer to which I have been unable to find so far.
I have a tool (language/implementation is irrelevant) which accepts a text file as input. This text file (let's call it file_list.txt) contains a long list of file paths, one per line. The tool then iterates over the lines in file_list.txt and does something with every file path. This needs to be done continuously and file_list.txt needs to always contain the latest file paths because users continuously upload or delete files from the share being monitored. To achieve this, I have set up a cron job which calls a script. First the script calls the find utility with the search parameters required and pipes the output to a temporary file. When the file is fully populated, it is moved to file_list.txt. Then, once this is done, the tool is invoked with file_list.txt as an input parameter.
So far, so good. The share being monitored is VERY LARGE (~60 TB) and the find command takes around 5 hours to execute. This is not a problem since we have multiple overlapping find commands running in parallel (triggered once per hour). The entire setup runs on a compute farm, so CPU utilization, etc. is also not an issue.
The problem arises in the lag time for file detection. Ideally, I want a user to add a file and I want one of the already running, overlapping find commands to detect this file within a matter of minutes. However, I have noticed that none of the already-running find commands will detect this file. Only a find command started AFTER this file was added will detect it. This means that generally, I need to wait around 5 hours for a newly added file to be detected. This leads me to believe that the find utility somehow acts on a "cached" version of the share state when it was triggered. Is this true? Can anyone confirm this? And if so, what can I do to improve the detection lag?
Please let me know if further clarificaion is required. I am happy to provide any further details.

To summarize: you have a gigantic filesystem volume (60 TB) which contains a huge number of files, and you use find(1) to name a large number of those files and put those names into a text file for analysis. You have discovered that files are not listed if they are created after find(1) was started but before it finished.
I think the best solution is to stop thinking of this as a batch job, and do it "online" using inotify(7). You can use the inotify API to be immediately informed of changes to your filesystem, including new files being created. There is of course the original C API, as well as the excellent pyinotify.
With inotify, you can start a watcher program once and leave it running continuously (under a supervisor if needed for restarts). The operating system can then notify you whenever a relevant filesystem event occurs, and you can respond immediately rather than waiting for the next scan.
The one downside for your use case might be that the watcher program does need to run on a machine which has the filesystem mounted locally. But the overall compute resources required are probably much less than your current approach of repeated linear scans.

executing find commands and piping the output to temporary files might work up to a certain scale, but is far from optimal. If you want a less resource intensive, more reactive solution, I would recommend considering to reimplement your software using the inotify interface:
The inotify API provides a mechanism for monitoring filesystem events.
Inotify can be used to monitor individual files, or to monitor
directories. When a directory is monitored, inotify will return
events for the directory itself, and for files inside the directory.
So an event will be raised for each file change; or file being added.
Note that you can then keep an internal list of files up to date which only needs to be changed when you get a event.

How to create temporary files on linux that will automatically clean up after themselves no matter what?

I want to create a temporary file on linux while making sure that the file will disappear after my program has terminated, even if it got killed or someone performs a hard reboot in the wrong moment. Does tmpfile() handle all this for me?

You seem pre-occupied with the idea that files might get left behind some how because of some race condition, I don't see an explanation of why this is a concern.
"A race condition occurs when a program doesn't work as it's supposed to because of an unexpected ordering of events that produces contention over the same resource."
I was assuming that from your comments on other answers your concern was specifically on a dead-lock which is a result of trying to remediate a race-condition ( contention of the shared resource ). It is still not clear what your concern is, calling tmpfile() and having the program exit abnormally before that function gets to call unlink() is the least of your worries if your application is really that fragile.
Given that there isn't any mention of concurrency, threading or other processes sharing this file descriptor to this temp file, I still don't see the possibility for a race condition, maybe the concept of an incomplete logical transaction, but that can be detected and cleaned up.
The correct way to make absolutely sure that any allocated file system resources are cleaned up is not solely on exit of an application but also also on start-up. All my server code, makes sure that everything is cleaned up from a previous run before it starts and makes itself available.
Put your temp files in a sub-dir in /tmp make sure your application cleans this sub-dir on startup and normal shutdown. You can wrap your app start up with a shell script that detects abnormal ( kill -9 ) shutdown based on PID existence and also does clean up activities.

If you don't want to use tmpfile(), you can unlink() your file immediately after creating it. It will stay open and present and allocated until it is closed.
But on a hard reboot, a fsck might be needed in order to recover the space. But as this is always the case, it is no special drawback of this approach.

according to tmpfile() man page:
The file will be automatically deleted when it is closed or the
program terminates.
I have not tested, but it seems it should do what you want.
Moreover:
The default location, if TMPDIR is not set, is /tmp.
Then, when a reboot is produced, /tmp will be empty.

EDIT: Yes
I checked the tmpfile source, and it does indeed use glglgl trick, and instantly unlocks the file.
Original:
I would say no. Got killed should work, but I would assume that it can happen, that after a hard reboot (e.g. due to power outtake) the file is still there. But that depends on your Linux distribution and the used settings.
If the temp file is created in a ramdisk, it is gone (there are unix distris out there that e.g. use a ram based tmpfs for temporary files).
Or if you use an environment that has certain policy regarding tmp, it could be also gone (maybe not instant, but often there are policies, like e.g. remove all files in /tmp that are not accessed within one month), but it could be also on a standard file system where such rules are not enforced. In this case the file would stay.

The customary approach is to set up a signal handler to clean up if the program is interrupted. This will not handle kill -9 or a physical reboot, which can't be trapped. Create temporary files in /tmp, which is normally cleaned out when the system boots. All that remains then is to teach people not to use kill -9 when they don't need to, but that appears to be an uphill battle.

In linux, mktemp command works.

How to design a filewatcher /directory watcher in VC++?

I am new to VC++ and programming. I have a task in which I am supposed to design a file watcher in VC++.
The problem goes this way:
I have to monitor some log files continously; whenever a particular log file gets deleted(this deletion is done by some other program), I have to open a TextFile and write some data and the timestamp into it.
How do I go about it? Please help!!

First, you need to setup a system to monitor for file events from that folder.
To get started, take a look at FindFirstChangeNotification().
You'll basically get a waitable handle from that.
Then, were it me, I'd have a thread that waited on that event. Each time the event triggers, the thread resumes, queries for the change details (what file), then perform the needed actions, and resume sleeping on that handle again.
You'll need some additional semaphore or something to use to interrupt this worker-thread and wake it so that you can tell it to quit. Simple to do: have your thread's main loop do a WaitForMultipleObjects - the "wake up semaphore" and the FindFirstChangeNotification handle. When you wake up, check which even notified you, then either process the file change or quit.

MFC has a slightly different way of handling it (slightly) but to do this using the Win32 API what you'd typcially do is use the Directory Management Functions to set up a change notification handle for the directory the file goes in. Then you can wait on the handle and when something happens inside that directory your wait completes, and you can check to see if it was a change to the file that you care about.
Look at the docs for FindFirstChangeNotification and ReadDirectoryChangesW for more information.

Try the Windows Management Instrumentation (WMI) if you have enough privileges. AFAIK it is also the most efficient way to handle the filesystem events.
Handle or query the __InstanceDeletionEvent, __InstanceModificationEvent or __InstanceCreationEvent for the deletion, modification or creation events respectively and filter the files and target path that you want.
Take a look at the WMI Reference/C++ invocation.
For a full-scale example take a look at codeproject querying example.

I strongly recommmend you consider using the implementation here. This API is not 100% reliable, but this code does a good job of wrapping it. If your filesystem traffic is local and not too frequent, it should work well for you.

Automatically adjusting process priorities under Linux

I'm trying to write a program that automatically sets process priorities based on a configuration file (basically path - priority pairs).
I thought the best solution would be a kernel module that replaces the execve() system call. Too bad, the system call table isn't exported in kernel versions > 2.6.0, so it's not possible to replace system calls without really ugly hacks.
I do not want to do the following:
-Replace binaries with shell scripts, that start and renice the binaries.
-Patch/recompile my stock Ubuntu kernel
-Do ugly hacks like reading kernel executable memory and guessing the syscall table location
-Polling of running processes
I really want to be:
-Able to control the priority of any process based on it's executable path, and a configuration file. Rules apply to any user.
Does anyone of you have any ideas on how to complete this task?

If you've settled for a polling solution, most of the features you want to implement already exist in the Automatic Nice Daemon. You can configure nice levels for processes based on process name, user and group. It's even possible to adjust process priorities dynamically based on how much CPU time it has used so far.

Sometimes polling is a necessity, and even more optimal in the end -- believe it or not. It depends on a lot of variables.
If the polling overhead is low-enough, it far exceeds the added complexity, cost, and RISK of developing your own style kernel hooks to get notified of the changes you need. That said, when hooks or notification events are available, or can be easily injected, they should certainly be used if the situation calls.
This is classic programmer 'perfection' thinking. As engineers, we strive for perfection. This is the real world though and sometimes compromises must be made. Ironically, the more perfect solution may be the less efficient one in some cases.
I develop a similar 'process and process priority optimization automation' tool for Windows called Process Lasso (not an advertisement, its free). I had a similar choice to make and have a hybrid solution in place. Kernel mode hooks are available for certain process related events in Windows (creation and destruction), but they not only aren't exposed at user mode, but also aren't helpful at monitoring other process metrics. I don't think any OS is going to natively inform you of any change to any process metric. The overhead for that many different hooks might be much greater than simple polling.
Lastly, considering the HIGH frequency of process changes, it may be better to handle all changes at once (polling at interval) vs. notification events/hooks, which may have to be processed many more times per second.
You are RIGHT to stay away from scripts. Why? Because they are slow(er). Of course, the linux scheduler does a fairly good job at handling CPU bound threads by downgrading their priority and rewarding (upgrading) the priority of I/O bound threads -- so even in high loads a script should be responsive I guess.

There's another point of attack you might consider: replace the system's dynamic linker with a modified one which applies your logic. (See this paper for some nice examples of what's possible from the largely neglected art of linker hacking).
Where this approach will have problems is with purely statically linked binaries. I doubt there's much on a modern system which actually doesn't link something dynamically (things like busybox-static being the obvious exceptions, although you might regard the ability to get a minimal shell outside of your controls as a feature when it all goes horribly wrong), so this may not be a big deal. On the other hand, if the priority policies are intended to bring some order to an overloaded shared multi-user system then you might see smart users preparing static-linked versions of apps to avoid linker-imposed priorities.

Sure, just iterate through /proc/nnn/exe to get the pathname of the running image. Only use the ones with slashes, the others are kernel procs.
Check to see if you have already processed that one, otherwise look up the new priority in your configuration file and use renice(8) to tweak its priority.

If you want to do it as a kernel module then you could look into making your own binary loader. See the following kernel source files for examples:
$KERNEL_SOURCE/fs/binfmt_elf.c
$KERNEL_SOURCE/fs/binfmt_misc.c
$KERNEL_SOURCE/fs/binfmt_script.c
They can give you a first idea where to start.
You could just modify the ELF loader to check for an additional section in ELF files and when found use its content for changing scheduling priorities. You then would not even need to manage separate configuration files, but simply add a new section to every ELF executable you want to manage this way and you are done. See objcopy/objdump of the binutils tools for how to add new sections to ELF files.

Does anyone of you have any ideas on how to complete this task?
As an idea, consider using apparmor in complain-mode. That would log certain messages to syslog, which you could listen to.

If the processes in question are started by executing an executable file with a known path, you can use the inotify mechanism to watch for events on that file. Executing it will trigger an I_OPEN and an I_ACCESS event.
Unfortunately, this won't tell you which process caused the event to trigger, but you can then check which /proc/*/exe are a symlink to the executable file in question and renice the process id in question.
E.g. here is a crude implementation in Perl using Linux::Inotify2 (which, on Ubuntu, is provided by the liblinux-inotify2-perl package):
perl -MLinux::Inotify2 -e '
use warnings;
use strict;
my $x = shift(#ARGV);
my $w = new Linux::Inotify2;
$w->watch($x, IN_ACCESS, sub
{
for (glob("/proc/*/exe"))
{
if (-r $_ && readlink($_) eq $x && m#^/proc/(\d+)/#)
{
system(#ARGV, $1)
}
}
});
1 while $w->poll
' /bin/ls renice
You can of course save the Perl code to a file, say onexecuting, prepend a first line #!/usr/bin/env perl, make the file executable, put it on your $PATH, and from then on use onexecuting /bin/ls renice.
Then you can use this utility as a basis for implementing various policies for renicing executables. (or doing other things).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string