I am trying to run a simple code within SGE. However, I get very different results from same code when running from an interactive session (qrsh) or via qsub. Mostly many codes fail to run from qsub (without any warning or error).
Is there anyway to set up an interactive session within a batch submission (running qrsh within qsub)?
qsub test.sh
-V
-cwd
source MYPATH
qrsh
MYCODEHERE
`
Not sure if what you ask is possible. I can think of two ways why you are observing different results.
1) Environment differences: Between cluster nodes
2) Incomplete outputs: Maybe the code runs into an edge cases (not enough memory etc.) and exits silently.
Not exactly what you asked for but just trying to help.
You could submit a parallel job and then use qrsh -inherit <hostname> <command> to run a command under qrsh. Unfortunately grid engine limits the number of times you can call qrsh -inherit to either the number of slots in the job or one less (dependent on the job_is_first_task setting of the PE.
However it is likely that the problems are caused by a different environment between the qrsh environment and that provided by qsub by default. If you are selecting the shell to interpret your job script in the traditional unix way (Putting #!/bin/bash or similar as the first line of your job script you could try adding a -l to that line to make it a login shell #!/bin/bash -l which is likely more similar to what you get with qrsh.
Related
I'm writing a Bash script to evaluate time/CPU/memory performances of commands given as input to the script.
I implemented the evaluation of time by using date command, but I have issues to evaluate CPU and memory performance related to the single command. I know that I can use top command but it shows me only runtime processes.
My issue is that if I run the script by giving as input the command, I don't know previously the assigned PID to this command, and if I want to evaluate an instant command as whoami, I cannot find it when I use top command, even if I use pipe on them.
I think for commands that needs more time, I would like to calculate an average, but for commands like whoami, ls or similar instant commands, I don't have idea how I can get the CPU and memory performance for that specific instant of time.
Thank you in advance!
NOTICE: Feedback on how the question can be improved would be great as I am still learning, I understand there is no code because I am confident it does not need fixing. I have researched online a great deal and cannot seem to find the answer to my question. My script works as it should when I change the parameters to produce less outputs so I know it works just fine. I have debugged the script and got no errors. When my parameters are changed to produce more outputs and the script runs for hours then it stops. My goal for the question below is to determine if linux will timeout a process running over time (or something related) and, if, how it can be resolved.
I am running a shell script that has several for loops which does the following:
- Goes through existing files and copies data into a newly saved/named file
- Makes changes to the data in each file
- Submits these files (which number in the thousands) to another system
The script is very basic (beginner here) but so long as I don't give it too much to generate, it works as it should. However if I want it to loop through all possible cases which means I will generates 10's of thousands of files, then after a certain amount of time the shell script just stops running.
I have more than enough hard drive storage to support all the files being created. One thing to note however is that during the part where files are being submitted, if the machine they are submitted to is full at that moment in time, the shell script I'm running will have to pause where it is and wait for the other machine to clear. This process works for a certain amount of time but eventually the shell script stops running and won't continue.
Is there a way to make it continue or prevent it from stopping? I typed control + Z to suspend the script and then fg to resume but it still does nothing. I check the status by typing ls -la to see if the file size is increasing and it is not although top/ps says the script is still running.
Assuming that you are using 'Bash' for your script - most likely, you are running out of 'system resources' for your shell session. Also most likely, the manner in which your script works is causing the issue. Without seeing your script it will be difficult to provide additional guidance, however, you can check several items at the 'system level' that may assist you, i.e.
review system logs for errors about your process or about 'system resources'
check your docs: man ulimit (or 'man bash' and search for 'ulimit')
consider removing 'deep nesting' (if present); instead, create work sets where step one builds the 'data' needed for the next step, i.e. if possible, instead of:
step 1 (all files) ## guessing this is what you are doing
step 2 (all files)
step 3 (all files
Try each step for each file - Something like:
for MY_FILE in ${FILE_LIST}
do
step_1
step_2
step_3
done
:)
Dale
I am new to SLURM. My problem is that I have a multi-stage job, which needs to be run on a cluster, whose jobs are managed by SLURM. Specifically I want to schedule a job which:
Grabs N nodes,
Installs a software on all of them
(once all nodes finish the installation successfully) it creates a
database instance on the nodes
Loads the database
(once loading is done successfully) Runs a set of queries, for benchmarking purpose
Drops the database and returns the nodes
Each step could be run using a separate bash script; while the execution of the scripts and transitions between stages are coordinated by a master node.
My problem is that I know how to allocate nodes and call a single command or script on each (which runs as a stand-alone job on each node) using SLURM. But as soon as the command is done (or the called script is finished) on each node, the node returns to pool of free resources, leaving the allocated nodes queue for my job. But the above use case involves several stages/scripts; and needs coordination between them.
I am wondering what the correct way is to design/run a set of scripts for such a use case, using SLURM. Any suggestion or example would be extremely helpful, and highly appreciated.
You simply need to encapsulate all your scripts into a single one for submission:
#!/bin/bash
#SBATCH --nodes=4 --exclusive
# Setting Bash to exit whenever a command exits with a non-zero status.
set -e
set -o pipefail
echo "Installing software on each of $SLURM_NODELIST"
srun ./install.sh
echo "Creating database instance"
./createDBInstance.sh $SLURM_NODELIST
echo "Loading DB"
./loadDB.sh params
echo Benchmarking
./benchmarks.sh params
echo Done.
You'll need to fill in the blanks... Make sure that your script follow the standard of exiting with a non-zero status on error.
I have a Cron Job scheduled to execute a command a few times a day. There are cases where the cron job isn't needed but will automatically run. If that happens the following error message shows:
PM2 [ERROR] Script already launched, add -f option to force re execution
Note: The Cron Job runs PM2 in reference to a script.
Is there any negative effect to having the cron job run even if the script is already running?
Please provided detailed information or references. Not just your opinion please.
Avoid erroneous error messages by writing a wrapper script that is run from cron instead. Inside the wrapper script, only run your job if it is not already running by querying the process table.
Assuming ksh, here's a snippet (I'm a tad rusty so syntax may need to be tweaked):
# Running will be non-zero if no match found
running=$(ps|grep MY_PROGRAM)
if [[ "$running" -gt 0 ]]; then
# run your program
else
# log its already running
fi
Not sure what detailed information or references there could be for a situation like this. It's not like someone commissions a study to look at this.
Assuming your command is intelligent enough to only allow one execution at a time (which appears to be the case judging by the error message you posted) then the only ill effect is a few CPU clock cycles (I think).
I've been troubleshooting this issue for about a week and I am nowhere, so I wanted to reach out for some help.
I have a perl script that I execute via command like, usually in a manner of
nohup ./script.pl --param arg --param2 arg2 &
I usually have about ten of these running at once to process the same type of data from different sources (that is specified through parameters). The script works fine and I can see logs for everything in nohup.out and monitor status via ps output. This script also uses a sql database to track status of various tasks, so I can track finishes of certain sources.
However, that was too much work, so I wrote a wrapper script to execute the script automatically and that is where I am running into problems. I want something exactly the same as I have, but automatic.
The getwork.pl script runs ps and parses output to find out how many other processes are running, if it is below the configured thresh it will query the database for the most out of date source and kick off the script.
The problem is that the kicked off jobs aren't running properly, sometimes they terminate without any error messages and sometimes they just hang and sit idle until I kill them.
The getwork script queries sql and gets the entire execution command via sql concatanation, so in the sql query I am doing something like CONCAT('nohup ./script.pl --arg ',param1,' --arg2 ',param2,' &') to get the command string.
I've tried everything to get these kicked off, I've tried using system (), but again, some jobs kick off, some don't, sometimes it gets stuck, sometimes jobs start and then die within a minute. If I take the exact command I used to start the job and run it in bash, it works fine.
I've tried to also open a pipe to the command like
open my $ca, "| $command" or die ($!);
print $ca $command;
close $ca;
That works just about as well as everything else I've tried. The getwork script used to be executed through cron every 30 minutes, but I scrapped that because I needed another shell wrapper script, so now there is an infinite look in the get work script that executes a function every 30 minutes.
I've also tried many variations of the execution command, including redirecting output to different files, etc... nothing seems to be consistent. Any help would be much appreciated, because I am truly stuck here....
EDIT:
Also, I've tried to add separate logging within each script, it would start a new log file with it's PID ($$). There was a bunch of weirdness there too, all log files would get created, but then some of the processes would be running and writing to the file, others would just have an empty text file and some would just have one or two log entries. Sometimes the process would still be running and just not doing anything, other times it would die with nothing in the log. Me, running the command in shell directly always works though.
Thanks in advance
You need a kind of job managing framework.
One of the bigest one is Gearman: http://www.slideshare.net/andy.sh/gearman-and-perl