How to know which instance deleted my file on a Linux server?

How to know which instance deleted my file on a Linux server? - linux

I have a workflow pipeline which will generate data files on a Linux server periodically, and also a cleanup service which will remove data files which are older than a week.
However, sometimes I found that a new generated data file will be missing, which is definitely no older than a week. I'm not sure whether it is a logic bug of the cleanup service, or another program did it. Currently I don't have any idea on how to investigate this issue. Is there any method to log all the file deletion activities and related process id as well as process name?
Thanks in advance.

Related

Hybris catalog cronjob sync is not working

For the first time when we run the sync cronjob (product/content) sync, it runs properly and creates media dump in the admin tab.
from next time when we run it, it just shows successful but actually, sync does not happen.
When I go back and clear the media dump from the admin tab, it starts working and again creats media dump.
So every time I am forced to manually clear the media dump for making this sync job to work.
please advise.

CatalogVersionSyncJob is designed to run only once with each instance. So if we create a sync job instance by ImpEx/HMC, it'll work for first time but in the second execution, it won't get any newly/modified items and no item will be synced. Which mean, the system needs a new instance for each sync execution!
If we execute catalog sync from Catalog Management Tool(HMC/backoffice), then each time, it internally creates a new instance of selected sync job. Hence, it's working.
To solve this, write the custom job which basically does the same thing as HMC/backoffice does internally. Like creates a new instance, assign sync job, and execute it.
For more information, refer configure-catalog-sync-cronjob-Hybris

I've encountered this issue, and the workaround was to create another CronJob that would remove those media dumps before the sync runs.
At a high-level we have a CompositeCronJob that does two things (there are actually more, but I'll just say we have 2 for the sake of this issue) in sequence:
Remove the media dump from the Sync CronJob
Sync CronJob

Archiving files with Python

I wrote this script that runs several executable updates in a shared network folder. Several separate machines must run these updates.
I would like to archive these updates once they are run. However, as you may see the dilemma, if the first machine runs an update and archives the executable,
the rest of connected devices won't run as they will no longer appear in the cwd. Any ideas?

It took me a while to understand what you meant by "archiving", but probably moving to another folder on a network shared mount. Also, the title should definitely be changed, I accidentally marked it as OK in Review Triage system.
You probably want to assign some ID to each machine, then have each of them create a new file once they finish the installation (e.g. empty finished1.txt for PC with ID 1, finished2.txt for PC 2 etc.). Then, one "master" PC should periodically scan for such files, and when finding all it expects, deleting/moving/archiving the installers. It may be good idea to add timeout functionality to the script on master PC, so when one of PCs will get stuck, you will get notified in some way.

Automated processing of large text file(s)

The scenario is as follows: A large text file is put somewhere. At a certain time of the day (or manually or after x number of files), a Virtual Machine with Biztalk installed is about to start automatically for processing of these files. Then, the files should be put in some output place and the VM should be shut down. I don´t know the time it takes for processing these files.
What is the best way to build such a solution? The solution is preferably to be used for similar scenarios in the future.
I was thinking of Logic Apps for the workflow, blob storage or FTP for input/output of the files, an API App for starting/shutting down the VM. Can Azure Functions be used in some way?
EDIT:
I also asked the question elsewhere, see link.
https://social.msdn.microsoft.com/Forums/en-US/19a69fe7-8e61-4b94-a3e7-b21c4c925195/automated-processing-of-large-text-files?forum=azurelogicapps

Just create an Azure Runbook with a Schedule, make that Runbook check for specific files in a Storage Account, if they exist, start up a VM and wait till the files are gone, once the files are gone (so BizTalk processed them, deleted and put them in some place where they belong), Runbook would stop the VM.

How to restore the snapshot of a file system with service online?

I'm now implementing a file system back-up and restore program under Linux. The requirement is that all the operations must be taken online.
My problem is, currently, the program has no sense of the state of files to be restored.
So it is possible that some file is being edited by other application when restoration occurs, in which case, modification on the file may be overwritten by backup.
One solution that I can come up with is testting whether the file is opened by other applications before restoration and postponing restoration to the time when the file is closed. However, to test the open state of a file, I think I should traverse the /proc file system, i.e. checking all the running process and get an open file list for each process, which is time costy.
Is there a better or classic solution to this problem? Any hints will be highly appreciated.
Thank you and Best Regards.

You can try to listen for filesystem change and see if you are responsible of those or not. inotify framework is here to help you in such task. inotify is a userland api. see wikipedia... http://en.wikipedia.org/wiki/Inotify

Process text files ftp'ed into a set of directories in a hosted server

The situation is as follows:
A series of remote workstations collect field data and ftp the collected field data to a server through ftp. The data is sent as a CSV file which is stored in a unique directory for each workstation in the FTP server.
Each workstation sends a new update every 10 minutes, causing the previous data to be overwritten. We would like to somehow concatenate or store this data automatically. The workstation's processing is limited and cannot be extended as it's an embedded system.
One suggestion offered was to run a cronjob in the FTP server, however there is a Terms of service restriction to only allow cronjobs in 30 minute intervals as it's shared-hosting. Given the number of workstations uploading and the 10 minute interval between uploads it looks like the cronjob's 30 minute limit between calls might be a problem.
Is there any other approach that might be suggested? The available server-side scripting languages are perl, php and python.
Upgrading to a dedicated server might be necessary, but I'd still like to get input on how to solve this problem in the most elegant manner.

Most modern Linux's will support inotify to let your process know when the contents of a diretory has changed, so you don't even need to poll.
Edit: With regard to the comment below from Mark Baker :
"Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files."
That will happen with the inotify watch you set on the directory level - the way to make sure you then don't pick up the partial file is to set a further inotify watch on the new file and look for the IN_CLOSE event so that you know the file has been written to completely.
Once your process has seen this, you can delete the inotify watch on this new file, and process it at your leisure.

You might consider a persistent daemon that keeps polling the target directories:
grab_lockfile() or exit();
while (1) {
if (new_files()) {
process_new_files();
}
sleep(60);
}
Then your cron job can just try to start the daemon every 30 minutes. If the daemon can't grab the lockfile, it just dies, so there's no worry about multiple daemons running.
Another approach to consider would be to submit the files via HTTP POST and then process them via a CGI. This way, you guarantee that they've been dealt with properly at the time of submission.

The 30 minute limitation is pretty silly really. Starting processes in linux is not an expensive operation, so if all you're doing is checking for new files there's no good reason not to do it more often than that. We have cron jobs that run every minute and they don't have any noticeable effect on performance. However, I realise it's not your rule and if you're going to stick with that hosting provider you don't have a choice.
You'll need a long running daemon of some kind. The easy way is to just poll regularly, and probably that's what I'd do. Inotify, so you get notified as soon as a file is created, is a better option.
You can use inotify from perl with Linux::Inotify, or from python with pyinotify.
Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files.
With polling it's less likely you'll see partial files, but it will happen eventually and will be a nasty hard-to-reproduce bug when it does happen, so better to deal with the problem now.

If you're looking to stay with your existing FTP server setup then I'd advise using something like inotify or daemonized process to watch the upload directories. If you're OK with moving to a different FTP server, you might take a look at pyftpdlib which is a Python FTP server lib.
I've been a part of the dev team for pyftpdlib a while and one of more common requests was for a way to "process" files once they've finished uploading. Because of that we created an on_file_received() callback method that's triggered on completion of an upload (See issue #79 on our issue tracker for details).
If you're comfortable in Python then it might work out well for you to run pyftpdlib as your FTP server and run your processing code from the callback method. Note that pyftpdlib is asynchronous and not multi-threaded, so your callback method can't be blocking. If you need to run long-running tasks I would recommend a separate Python process or thread be used for the actual processing work.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to know which instance deleted my file on a Linux server? - linux

Related

Hybris catalog cronjob sync is not working

Archiving files with Python

Automated processing of large text file(s)

How to restore the snapshot of a file system with service online?

Process text files ftp'ed into a set of directories in a hosted server

Categories

Resources