Distributing payload to multiple cron jobs - linux

I have a shell script say data.sh. For this script to execute I will pass a single argument say Table_1.
I have a test file which I will get as a result of a different script.
Now in a test file I have more than 1000 arguments to pass to the script.
The file looks like below:
Table_1
Table_2
Table_3
Table_4
and..so..on
Now I want to execute the script to run in parallel.
I am doing this using cron job.
First I am splitting the test file into 20 parts Using the split command in Linux.
split -l $(($(wc -l < test )/20 + 1)) test
I will then have the test file divided to 20 parts such as xaa,xab,xac and so on.
Then run the cron job:
* * * * * while IFS=',' read a;do /home/XXXX/data.sh $a;done < /home/xxxx/xaa
* * * * * while IFS=',' read a;do /home/XXXX/data.sh $a;done < /home/xxxx/xab
and so on.
As this involves lot of manual process. I would like to do this dynamically.
Here is what I want to achieve:
1) As soon as I get the test file I would like it to be split into say 20 files automatically and store at a particular place.
2) Then I would like to schedule the cron job for every day 5 Am by passing the 20 files as arguments to the script.
What is the best way to implement this? Any answers with explanation will be appreciated.

Here is what you could do. Create two cron jobs:
file_splitter.sh -> splits the file and stores them in a particular directory
file_processer.sh -> picks up one file at a time from the directory above, does a read loop, and calls data.sh. Removes the file after successful processing.
Schedule file_splitter.sh to run ahead of file_processor.sh.
If you want to achieve further parallelism, you can make file_splitter.sh write the split files into multiple directories with a few files in each. Let's say they are called sub1, sub2, etc. Then, you can schedule multiple instances of file_processor.sh and pass the sub directory name as an argument. Since the split files are stored in separate directories, we can ensure that only one job processes the files in a particular subdirectory.
It's better to keep the cron command as simple as possible.
* * * * * /path/to/file_processor.sh
is better than
* * * * * while IFS=',' read a;do /home/XXXX/data.sh $a;done < /home/xxxx/xab
Makes sense?
I had written a post about how to manage cron jobs effectively. You may want to take a look at it:
Managing log files created by cron jobs

Related

How to run one particular function in a shell script to be run on a daily basis

I am creating a script which has two function (funcA,funcB), funcA one has the entire server login logs and store it in file1 and the other function captures the login logs on daily basis. I would like to call this second funcB on a daily basis so that it append this logs to the first file. So that the incremental value is added to the file1. How can I achieve it.
Use crontab -e and add:
00 00 * * * /path/to/script.sh

How to run a python script between a particular times every single day (on Linux)?

I am looking for a way to be able to run a python script at a particular times of day and then have it auto terminated at another time of day. Ideally, I would want this to not be done within the script itself.
For example: I would want the script to start at 08:00 and end at 10:00 then start again at 11:30 and then terminate at 15:00 and I would need this to happen every day automatically.
I have browsed through many suggestions online, and many of them suggested to use cron, however, as far as I can see, cron does not natively offer the functionality of automatically terminating an application.
Others have suggested using cron to start the application at a particular time and then use another cron instance to create a "terminate" file that the program will search for at every loop iteration and if the file is present then the python script will terminate via a sys.exit() function or something, however, this seems quite janky and more of a workaround than a real solution.
You may use Jobber. You will be able to start scripts whenever you want and for the time you want.
Warning : Jobber is not free. You can try it for free though.
Here is the link to Jobber's website.
You could write a script that creates a lockfile with cron (https://unix.stackexchange.com/questions/12815/what-are-pid-and-lock-files-for), and use the lockfile to know what the process ID is, then terminate the process with that id using cron as well
After you have determined that the process name is uniquely identifiable you could do something like this (that's indeed also using cron).
0 8 * * * /path/to/unique_name.py& ( echo "pkill unique_name.py" | at 10:00 )
30 11 * * * /path/to/unique_name.py& ( echo "pkill unique_name.py" | at 15:30 )
Edit 1:
And "name safe" versions (using kill).
0 8 * * * /path/to/unique_name.py& echo kill $! | at 10:00
30 11 * * * /path/to/unique_name.py& echo kill $! | at 15:30

Find out ID of 'at' job from within it

When I schedule a job with 'at' it is assigned an id, viz:
job 44 at 2014-01-28 17:30
When that job runs I would like to get at that id from within it. This is on Centos, FWIW. I have established that no environment variable contains the ID. When the Perl code in that job runs I would like it to be able to print the job ID (44 in this example).
Yes, I know that atq shows an = next to jobs that are executing, but there might be more than one of those at a time.
I could do something like pass a unique argument to the job when scheduling it, capture the ID, save that and the argument to a file somewhere, read that from the job. That's a lot of work I'd rather not go to if I don't have to, and it seems like this should be simple but I'm drawing a blank.
What follows is figured out by reading sources of at-3.14. The way at puts job id and the time when it is run into the file name should be similar for any version, but I haven't checked this.
To begin whith at encodes the job id and the time when a particular job should be run into the file name describing a job. The file name has format aJJJJJTTTTTTTT, where JJJJJ is 5 character hexadecimal string, the job id, and TTTTTTTT is an 8 character hexadecimal string, the time when the job should be run. The time is stored as seconds from the epoch.
At jobs are run by feeding a job description file as the standard input to sh -c. Fortunately the Linux kernel provides a symbolic link, /proc/self/fd/0, which will point to the standard input of the process currently being executed (play with ls -l /proc/self/fd/0 in case you need to assure yourself that this indeed is so).
A file describing a job has been deleted by the time a job is run. However, the file is still available for the kernel because it has been duplicated with dup(2) before being used as the standard input for a job. So, actually we are resolving a symbolic link to a file name which is not visible any more. In the perl script at the end we need to take this into account as readlink will return something like /foo/bar/baz (deleted) instead of /foo/bar/baz. And we're interested in just the file name which has all the information we need.
The reason why the symbolic link points to a deleted file is because at daemon unlinks the original before executing the job. Unlinking gets done only after creating a copy, a hard link, which begins with = instead of a. With this the at daemon tries to ensure there will be only one copy of a job running: the daemon will not execle(2), ie. it will bail out, should the link(2) fail. Because the original file has been subject to open(2) and dup(2) the inode is still there for the kernel to use because it still has hard links pointing to it.
After a fairly long and possibly confusing introduction, here is how to put it all together:
#!/usr/bin/perl
use strict;
use warnings;
my $job_file = readlink("/proc/self/fd/0");
if (index($job_file, " ") > 0) {
$job_file = substr($job_file, 0, index($job_file, " ") - 1);
}
my $tmp = substr($job_file, rindex($job_file, "/") + 1);
$tmp =~ s/^a([0-9a-f]{5})[0-9a-f]+/$1/;
my $job_id = hex($tmp);
if ($job_id > 0) {
printf("My AT job id is %d.\n", $job_id);
}
# end of file.

Can anyone tell what this cronjob does?

I am learning about cronjob and I found this piece of code in one project which fetches record from twitter,
the code goes like this:
#0 * * * * cp /vold/www/Abcd/log/twitter_feed_item_aggregator.log vold/www/Abcd/log/twitter_feed_item_aggregator.log.backup; > /vold/www/Abcd/log/twitter_feed_item_aggregator.log
Can anyone explain what this piece of code does?
Hm... Copies a twitter agregator log each hour, and then clears it.
This part 0 * * * * means 'every 0 minutes'. Minute 0 is when a new hour starts.
This part cp /vold/www/Abcd/log/twitter_feed_item_aggregator.log vold/www/Abcd/log/twitter_feed_item_aggregator.log.backup obviously copies the log to a backup.
This part > /vold/www/Abcd/log/twitter_feed_item_aggregator.log outputs the output of no command to the file, thus clearing it.
The hash at the start of the line comments out the line so it does nothing. Without that it would do as #playcat says.

Have cron wait for job to finish before re-launching

I have a cronjob that executes every second minute that usually runs in seconds, but sometimes for several minutes.
I need cron to not execute the command if it's already running when the next minute comes.
The line looks like this
*/1 * * * * cmd
I have tried with this
* * * * * ID=job1 FREQ=1m AFTER=job1 cmd
but to no success.
Is it possible to solve with cron or do I have to implement locking?
You can make a temp file called inProgress (or whatever) and store it in a standard place, and use this to communicate to the next job if it should run or not.
What if flow of the job goes like this:
Check for a standard inProgress
file
If it exists, quit
Else, create inProgress file
Do work
Delete inProgress file.

Resources