Stackoverflow has saved my life on countless occasions over the years. Now, it's time for me to post my first question ever, the answer to which I have been unable to find so far.
I have a tool (language/implementation is irrelevant) which accepts a text file as input. This text file (let's call it file_list.txt) contains a long list of file paths, one per line. The tool then iterates over the lines in file_list.txt and does something with every file path. This needs to be done continuously and file_list.txt needs to always contain the latest file paths because users continuously upload or delete files from the share being monitored. To achieve this, I have set up a cron job which calls a script. First the script calls the find utility with the search parameters required and pipes the output to a temporary file. When the file is fully populated, it is moved to file_list.txt. Then, once this is done, the tool is invoked with file_list.txt as an input parameter.
So far, so good. The share being monitored is VERY LARGE (~60 TB) and the find command takes around 5 hours to execute. This is not a problem since we have multiple overlapping find commands running in parallel (triggered once per hour). The entire setup runs on a compute farm, so CPU utilization, etc. is also not an issue.
The problem arises in the lag time for file detection. Ideally, I want a user to add a file and I want one of the already running, overlapping find commands to detect this file within a matter of minutes. However, I have noticed that none of the already-running find commands will detect this file. Only a find command started AFTER this file was added will detect it. This means that generally, I need to wait around 5 hours for a newly added file to be detected. This leads me to believe that the find utility somehow acts on a "cached" version of the share state when it was triggered. Is this true? Can anyone confirm this? And if so, what can I do to improve the detection lag?
Please let me know if further clarificaion is required. I am happy to provide any further details.
To summarize: you have a gigantic filesystem volume (60 TB) which contains a huge number of files, and you use find(1) to name a large number of those files and put those names into a text file for analysis. You have discovered that files are not listed if they are created after find(1) was started but before it finished.
I think the best solution is to stop thinking of this as a batch job, and do it "online" using inotify(7). You can use the inotify API to be immediately informed of changes to your filesystem, including new files being created. There is of course the original C API, as well as the excellent pyinotify.
With inotify, you can start a watcher program once and leave it running continuously (under a supervisor if needed for restarts). The operating system can then notify you whenever a relevant filesystem event occurs, and you can respond immediately rather than waiting for the next scan.
The one downside for your use case might be that the watcher program does need to run on a machine which has the filesystem mounted locally. But the overall compute resources required are probably much less than your current approach of repeated linear scans.
executing find commands and piping the output to temporary files might work up to a certain scale, but is far from optimal. If you want a less resource intensive, more reactive solution, I would recommend considering to reimplement your software using the inotify interface:
The inotify API provides a mechanism for monitoring filesystem events.
Inotify can be used to monitor individual files, or to monitor
directories. When a directory is monitored, inotify will return
events for the directory itself, and for files inside the directory.
So an event will be raised for each file change; or file being added.
Note that you can then keep an internal list of files up to date which only needs to be changed when you get a event.
Related
I just want to store some values while running shell script ,
scenario : if im running shell script it will do some operation and it will store the results/activity done.
then again I'm running the same script I should identify these are executed and you can continue from here . some what I need . how to do that? can we use .lock file or else any other best ways are there?
I just want to store some values while running shell script , how to do that? can we use .lock file or else any other best ways are there?
.lock files are by convention used to identify running services and I would therefor vote against it.
It just sounds like you want to keep track of your progress.
If you do not mind the data being erased post reboot I'd suggest you simple use /tmp for that (this remains in memory), do mind that if we are talking very large amounts this will drain your available mem.
Without knowing your use case it's hard to tell you what is the best solution.
But I would suggest writing an empty file that just indicates that your script is in progress(very similar to lock behaviour) and a second file that just keeps track of what items you processed.
Then just loop over the items and skip until you hit a 'new' item.
If we are talking very large amounts you should consider using a local database or database server.
I am writing some gross middleware - basically, I have some old code that needs to open 100,000 files for reading only, expecting them all to be in one folder. It never writes. It is multiprocess so it can try to open ~30 files at the same time. The old way, I would have to actually copy the files into that folder (or use links, NFS, etc.). Worth noting I have no ability to change this old code - its just a binary.
I have some new, fancy code that can retrieve a file almost instantly. I want to tie these things together, so when the old code tries to open the file, it is actually, in real time, running the new code.
So I thought of mkfifo and inotifywait. Instead of a folder of 100,000 files, I can make a folder of 100,000 named pipes. So far so good. The legacy code goes to open the files, not knowing that they are indeed named pipes. The problem is, I don't know what order the legacy code is going to open the files (nice, right?). So I would like to TRIGGER the named pipe WRITE (from my fancy new code) when the legacy code goes in for the read. I can't spawn 100,000 writes and have them all block. So I thought hey - inotifywait makes sense. Every time the legacy goes to open the pipe, it triggers a read event, which can then be used to spawn the pipe writer in the background. The problem is.. inotifywait doesn't trigger the read event until AFTER the writer has been spawned!
Any ideas of how to solve this? Basically - I want to intercept a file open, block for a couple hundred ms while I retrieve the contents of the file, then return that contents. Ideally I don't have to create a custom FUSE filesystem to do this.. its just a read-only file open. The problem is this needs to run fast and in parallel.. and I don't know which files are going to be opened in what order. Gotta be a quick and dirty way!
Thanks in advance for everyone's time.
I work on Linux. How to know that a gzip file is ready? I have a server that polls files in directory /dir/. There is an another, independent process that gzip files to /dir/. How can my server know that file is ready?
There is no ready-made solution for this. Looking at the last modification timestamp of the file (mtime) is not reliable because writes could delayed if the system is overloaded (or the input to the gzip operation is not ready), or the generating process may stop writing because it has crashed.
Usually, when applications need to do something like this, they write the temporary file under a different name, following a specific pattern. The reading process recognizes the temporary files and skips them, assuming that they are still a work in process and incomplete. Once the writer is finished, it renames the file to its final name (which is an atomic operation), and only then, the reader picks it up. This approach became popular with Dan Bernstein's maildir format:
Using maildir format
In maildir, a different directory is used for staging, but the general principle is the same.
It is also possible to use lock files and POSIX advisory locking, but they lead to more complexity. However, in some cases, they can be employed in such a way that busy waiting/polling/periodic probing is not necessary.
I'm currently monitoring files in node.js using fs.watch. The problem I have is for example, let say I copy a 1gig file into a folder I'm watching. The moment the file starts copying I get a notification about the file. If I start reading it immediately I end up with bad data since the file has not finished copying. For example a zip file has it's table of contents at the end but I'd end up reading it before it's table of contents has been written.
Off the top of my head I could setup some task to call fs.stat on the file every N seconds and only try to read it when the stats stop changing. That would work but it seems not ideal as I'd like my app to be as responsive as possible and calling stat on a bunch of files every second seems heavy as well as calling stat every 5 or 10 seconds seems unresponsive.
Is there some more robust way to get notified when a file has finished being modified?
So I did a project last year which required doing "file watching". There is a better library out there than fs.watch. Check out npm chokidar.
https://www.npmjs.com/package/chokidar
Underneath it uses fs.watch, but wraps better improvements around it.
There is a property called awaitWriteFinish. Really it's doing some polling on the file to determine whether or not the file is finished writing. I used it and it really works great.
Setting this property will allow you to work against that file, always ensuring that the file has been completely written. And you don't need to go off and implement your own method of determining if the file is complete. Should save a bunch of time.
Aside from that, I don't believe you can really get away from polling with regard to determining if a file is finished writing. Chokidar is still polling, it's just that you don't need to write the logic to do it. And you can configure the polling interval if CPU utilization is deemed to be too high.
Edit: Would also like to add, to just give it a shot and see how it works. I get you want it as responsive as possible... But having something working is better than having something not working at all. It might be that even with a polling solution it's not even an issue for you. If it's deemed a performance problem, then go address it at that time and seek a "better" solution.
Ηi,
Say I have a file called something.txt. I would like to find the most recent program to modify it, specifically the full path to said program (eg. /usr/bin/nano). I only need to worry about files modified while my program is running, so I can add an event listener at program startup and find out what program modified it when my program was running.
Thanks!
auditd in Linux could perform actions regarding file modifications
See the following URI xmodulo.com/how-to-monitor-file-access-on-linux.html
Something like this generally isn't going to be possible for arbitrary processes. If these aren't arbitrary processes, then you could use some sort of network bus (e.g. redis) to publish "write" messages. Otherwise your only other bet would be to implement your own filesystem using FUSE. Even with FUSE though, you may not always have access to the pid depending on who/what is writing to the file and the security setup of your OS.