Daemon for file watching / reporting in the whole UNIX OS

Daemon for file watching / reporting in the whole UNIX OS - linux

I have to write a Unix/Linux daemon, which should watch for particular set of files (e.g. *.log) in any of the file directories, across various locations and report it to me. Then I have to read all the newly modified files and then I have to process them and push grepped data into Elasticsearch.
Any suggestion on how this can be achieved?
I tried various Perl modules (e.g. File::ChangeNotify, File::Monitor) but for these I need to specify the directories, which I don't want: I need the list of files to be dynamically generated and I also need the content.
Is there any method that I can call OS system calls for file creation and then read the newly generated/modified file?

Not as easy as it sounds unfortunately. You have hooks to inotify (on some platforms) that let you trigger an event on a particular inode changing.
But for wider scope changing, you're really talking about audit and accounting tracking - this isn't a small topic though - not a lot of people do auditing, and there's a reason for that. It's complicated and very platform specific (even different versions of Linux do it differently). Your favourite search engine should be able to help you find answers relevant to your platform.
It may be simpler to run a scheduled task in cron - but not too frequently, because spinning a filesystem like that is dirty - along with File::Find or similar to just run a search occasionally.

Related

Mirroring files from one partition to another on the second disk without RAID1

I am looking for a program that would allow me to mirror one partition to another disk (something like RAID1) for Linux. It doesn't have to be a windowed application, it can be a console application, I just want what is in one place to be mirrored to another.
It would be nice if it were possible to mirror a specific folder that I would care for instead of copying everything from the given partition.
I was looking on the internet, but it's hard to find something that would give such opportunities, hence the idea to ask such a question.
I do not want to make fake RAID on Linux or hardware RAID because I read that if the motherboard fails then it is best to have the same second one to recover data.
I will be grateful for every suggestion :)

You can check my script "CopyDirFile" written in bash, which is located on github.
You can perform a replication (mirroring) task of any source folder to another destination folder (deleting a file in the source folder means deleting it in the destination folder).
The script also allows you to create copy tasks (deleted files in the source folder will not be deleted in the target folder).
The tasks are executed in background at a specified time, not all the time, frequency is set by the user when creating the task.
You can also set the task to start automatically when the user logs on.
All the necessary information can be found in the README file in repository.
If I understood you correctly, I think it meets your requirements.

Linux has standard support for software RAID: mdraid.
It allows you to bundle two disk devices into a RAID 1 device (among other things); you then create a filesystem on top of that device.
LVM offers another way to do software RAID; it doesn't seem to be very popular, but it's certainly supported.
(If your system supports hardware RAID, on the motherboard or with a separate RAID controller, Linux can use that, too, but that doesn't seem to be what you're asking here.)

Create a cache file which is automatically deleted when partition is nearly full in libc (all file systems) or ext4?

When writing software which needs to cache data on disk, is there a way in libc, or a way which is specific to a certain file system (such as ext4), to create a file and flag it as suitable to be deleted automatically (by the kernel) if the partition becomes almost full?
There’s something similar for memory pages: madvise(…, MADV_FREE).
Some systems achieve this by writing a daemon which monitors the partition fullness, and which manually deletes certain pre-determined paths once it exceeds a certain fill level. I’d like to avoid this if possible, as it’s not very scalable: each application would have to notify the daemon of new cache paths as they are created, which may be frequently. If this were in-kernel, a single flag could be held on each inode indicating whether it’s a cache file.
Having a standardised daemon for this would be acceptable as well. At the moment it seems like different major systems integrators all invent their own.

You can use crontab job, and look for specific file extension and delete it. You can even filter based on time and leave the files created in last n minutes.
If you are ok with this, let me know, I will add more details here.

Node .fs Working with a HUGE Directory

Picture a directory with a ton of files. As a rough gauge of magnitude I think the most that we've seen so far is a couple of million but it could technically go another order higher. Using node, I would like to read files from this directory, process them (upload them, basically), and then move them out of the directory. Pretty simple. New files are constantly being added while the application is running, and my job (like a man on a sinking ship holding a bucket) is to empty this directory as fast as it's being filled.
So what are my options? fs.readdir is not ideal, it loads all of the filenames into memory which becomes a problem at this kind of scale. Especially as new files are being added all the time and so it would require repeated calls. (As an aside for anybody referring to this in the future, there is something being proposed to address this whole issue which may or may not have been realised within your timeline.)
I've looked at the myriad of fs drop-ins (graceful-fs, chokadir, readdirp, etc), none of which have this particular use-case within their remit.
I've also come across a couple of people suggesting that this can be handled with child_process, and there's a wrapper called inotifywait which tasks itself with exactly what I am asking but I really don't understand how this addresses the underlying problem, especially at this scale.
I'm wondering if what I really need to do is find a way to just get the first file (or, realistically, batch of files) from the directory without having the overhead of reading the entire directory structure into memory. Some sort of stream that could be terminated after a certain number of files had been read? I know Go has a parameter for reading the first n files from a directory but I can't find a node equivalent, has anybody here come across one or have any interesting ideas? Left-field solutions more than welcome at this point!

You can use your operation system listing file command, and stream the result into NodeJS.
For example in Linux:
var cp=require('child_process')
var stdout=cp.exec('ls').stdout
stdout.on('data',function(a){
console.log(a)
});0
RunKit: https://runkit.com/aminanadav/57da243180f3bb140059a31d

How to check if a file is opened in Linux?

The thing is, I want to track if a user tries to open a file on a shared account. I'm looking for any record/technique that helps me know if the concerned file is opened, at run time.
I want to create a script which monitors if the file is open, and if it is, I want it to send an alert to a particular email address. The file I'm thinking of is a regular file.
I tried using lsof | grep filename for checking if a file is open in gedit, but the command doesn't return anything.
Actually, I'm trying this for a pet project, and thus the question.

The command lsof -t filename shows the IDs of all processes that have the particular file opened. lsof -t filename | wc -w gives you the number of processes currently accessing the file.

The fact that a file has been read into an editor like gedit does not mean that the file is still open. The editor most likely opens the file, reads its contents and then closes the file. After you have edited the file you have the choice to overwrite the existing file or save as another file.

You could (in addition of other answers) use the Linux-specific inotify(7) facilities.
I am understanding that you want to track one (or a few) particular given file, with a fixed file path (actually a given i-node). E.g. you would want to track when /var/run/foobar is accessed or modified, and do something when that happens
In particular, you might want to install and use incrond(8) and configure it thru incrontab(5)
If you want to run a script when some given file (on a native local, e.g. Ext4, BTRS, ... but not NFS file system) is accessed or modified, use inotify incrond is exactly done for that purpose.
PS. AFAIK, inotify don't work well for remote network files, e.g. NFS filesystems (in particular when another NFS client machine is modifying a file).
If the files you are fond of are somehow source files, you might be interested by revision control systems (like git) or builder systems (like GNU make); in a certain way these tools are related to file modification.
You could also have the particular file system sits in some FUSE filesystem, and write your own FUSE daemon.
If you can restrict and modify the programs accessing the file, you might want to use advisory locking, e.g. flock(2), lockf(3).
Perhaps the data sitting in the file should be in some database (e.g. sqlite or a real DBMS like PostGreSQL ou MongoDB). ACID properties are important ....
Notice that the filesystem and the mount options may matter a lot.
You might want to use the stat(1) command.
It is difficult to help more without understanding the real use case and the motivation. You should avoid some XY problem
Probably, the workflow is wrong (having a shared file between several users able to write it), and you should approach the overall issue in some other way. For a pet project I would at least recommend using some advisory lock, and access & modify the information only thru your own programs (perhaps setuid) using flock (this excludes ordinary editors like gedit or commands like cat ...). However, your implicit use case seems to be well suited for a DBMS approach (a database does not have to contain a lot of data, it might be tiny), or some index locked file like GDBM library is handling.
Remember that on POSIX systems and Linux, several processes can access (and even modify) the same file simultaneously (unless you use some locking or synchronization).
Reading the Advanced Linux Programming book (freely available) would give you a broader picture (but it does not mention inotify which appeared aften the book was written).

You can use ls -lrt, it displays the last RW operations in the shell. Then you can conclude whether the file is opened or not. Make sure that you are in the exact directory.

Symbolic link to latest file in a folder

I have a program which requires the path to various files. The files live in different folders and are constantly updated, at irregular intervals.
When the files are updated, they change name, so, for instance, in the folder dir1 I have fv01 and fv02. Later on the day someone adds fv02_v1; the day after someone adds fv03 and so on. In other words, I always have an updated file but with different name.
I want to create a symbolic link in my "run" folder to these files, such that said link always points to the latest file created.
I can do this in Python or Bash, but I was wondering what is out there, as this is hardly an uncommon problem.
How would you go about it?
Thank you.
Juan
PS. My operating system is Linux. I currently have a simple daemon (Python) that looks every once in a while (refreshes every minute) for the latest file. Seems kind of an overkill to me.

Unless there is some compelling reason that you have left unstated (e.g. thousands of files in the directory) just do it the way you suggest with a script sorting the files by modification time. There is no secret method that I am aware of.
You could write a daemon using inotify to monitor your directories and immediately set your links but that seems like overkill.
Edit: I just saw your edit. Since you have the daemon already, inotify might not be such a bad idea. It would be somewhat more efficient than constantly querying since the OS will tell you when something in your directories has changed.
I don't know python well enough to point you to anything specific but there must exist a wrapper for inotify.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string