Snakemake slurm ouput file redirect to new directory - slurm

I'm putting together a snakemake slurm workflow and am having trouble with my working directory becoming cluttered with slurm output files. I would like my workflow to, at a minimum, direct these files to a 'slurm' directory inside my working directory. I currently have my workflow set up as follows:
config.yaml:
reads:
1:
2:
samples:
15FL1-2: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS/data/15FL1-2
15Fl1-4: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS/data/15Fl1-4
cluster.yaml:
localrules: all
__default__:
time: 0:5:0
mem: 1G
output: _{rule}_{wildcards.sample}_%A.slurm
fastqc_raw:
job_name: sm_fastqc_raw
time: 0:10:0
mem: 1G
output: slurm/_{rule}_{wildcards.sample}_{wildcards.read}_%A.slurm
Snakefile:
configfile: "config.yaml"
workdir: config["work"]
rule all:
input:
expand("analysis/fastqc_raw/{sample}_R{read}_fastqc.html", sample=config["samples"],read=config["reads"])
rule clean:
shell:
"rm -rf analysis logs"
rule fastqc_raw:
input:
'data/{sample}_R{read}.fastq.gz'
output:
'analysis/fastqc_raw/{sample}_R{read}_fastqc.html'
log:
err = 'logs/fastqc_raw/{sample}_R{read}.out',
out = 'logs/fastqc_raw/{sample}_R{read}.err'
shell:
"""
fastqc {input} --noextract --outdir 'analysis/fastqc_raw' 2> {log.err} > {log.out}
"""
I then call with:
snakemake --jobs 4 --cluster-config cluster.yaml --cluster "sbatch --mem={cluster.mem} --time={cluster.time} --job-name={cluster.job_name} --output={cluster.output}"
This does not work, as the slurm directory does not already exist. I don't want to manually make this before running my snakemake command, that will not work for scalability. Things I've tried, after reading every related question, are:
1) simply trying to capture all the output via the log within the rule, and setting cluster.output='/dev/null'. Doesn't work, the info in the slurm output isn't captured as it's not output of the rule exactly, its info on the job
2) forcing the directory to be created by adding a dummy log:
log:
err = 'logs/fastqc_raw/{sample}_R{read}.out',
out = 'logs/fastqc_raw/{sample}_R{read}.err'
jobOut = 'slurm/out.err'
I think this doesn't work because sbatch tries to find the slurm folder before implementing the rule
3) allowing the files to be made in the working directory, and adding bash code to the end of the rule to move the files into a slurm directory. I believe this doesn't work because it tries to move the files before the job has finished writing to the slurm output.
Any further ideas or tricks?

You should be able to suppress these outputs by calling sbatch with --output=/dev/null --error=/dev/null. Something like this:
snakemake ... --cluster "sbatch --output=/dev/null --error=/dev/null ..."
If you want the files to go to a directory of your choosing you can of course change the call to reflect that:
snakemake ... --cluster "sbatch --output=/home/Ensa/slurmout/%j.out --error=/home/Ensa/slurmout/%j.out ..."

So this is how I solved the issue (there's probably a better way, and if so, I hope someone will correct me). Personally I will go to great lengths to avoid hard-coding anything. I use a snakemake profile and an sbatch script.
First, I make a snakemake profile that contains a line like this:
cluster: "sbatch --output=slurm_out/slurm-%j.out --mem={resources.mem_mb} -c {resources.cpus} -J {rule}_{wildcards} --mail-type=FAIL --mail-user=me#me.edu"
You can see the --output parameter to redirect the slurm output files to a subdirectory called slurm_out in the current working directory. But AFAIK, slurm can't create that directory if it doesn't exist. So...
Next I make a small sbatch script whose only job is to make the subdirectory, then call the sbatch script to submit the workflow. This "wrapper" looks like:
#!/bin/bash
mkdir -p ./slurm_out
sbatch snake_submit.sbatch
And finally, the snake_submit.sbatch looks like:
#!/bin/bash
ml snakemake
snakemake --profile <myprofile>
In this case both the wrapper and the sbatch script that it calls will have their slurm out files in the current working directory. I prefer it that way because it's easier for me to locate them. But I think you could easily re-direct by adding another #SBATCH --output parameter to the snake_submit.sbatch script (but not the wrapper, then it's turtles all the way down, you know?).
I hope that makes sense.

Related

Use more than one core in bash

I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.
I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.
My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.
This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:
#!/bin/bash
FILES=/home/daw/raw/*
count=0
for f in $FILES
to
base=${f##*/}
echo "process $f file..."
nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
(( count ++ ))
if (( count = 31 )); then
wait
count=0
fi
done
I'm explaining: FILES is a list of files from the raw folder.
The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.
Here readme tools:
https://github.com/vsbuffalo/scythe
Does anybody know how you can handle it?
P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.
IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:
parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*
This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.
Useful additional parameters are:
parallel --bar ... gives a progress bar
parallel --dry-run ... does a dry run so you can see what it would do without actually doing anything
If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:
parallel -S server1,server2,server3 ...

Redirect to file mysteriously does nothing in Bash

Background
I've a script. It's purpose is to generate config files for various system services from templates whenever my gateway acquires a new IP from my ISP. This process includes making successive edits with sed to replace $[template] strings in my custom templates with the correct information.
And to do that I've created a small function designed to take input from stdin, redirect it to a temporary file passed as an argument, and then move that file to replace the destination (and also, often, source) config file. The "edit-in-place dance", if you will.
I created a simple test script with the problematic function:
#!/bin/bash
inplace_dance() {
read -r -d '' data
printf '%s' "${data}" > "${1}~"
mv "${1}~" "${1}"
}
# ATTN: ls is only being used to generate input for testing. It is not being parsed.
ls -l ~/ | inplace_dance ~/test.out
Unfortunately, this works. So it's not the function itself. I also tried it with my custom logging utility (see "complications" below):
#!/bin/bash
. /usr/local/lib/logging.bash
log_identifier='test'
log_console='on'
inplace_dance() {
read -r -d '' data
printf '%s' "${data}" > "${1}~"
mv "${1}~" "${1}"
}
# ATTN: ls is only being used to generate input for testing. It is not being parsed.
bashlog 'notice' $(ls -l ~/ | inplace_dance '/home/wolferz/test.out')
This also works.
The Problem
In its original context, however, it does not work. I've confirmed that ${data} gets set just fine. And that ${1} contains the correct filename. What fails is the second line of the function. I've confirmed printf is being run (see, "Additional Info - Without The Redirect" below)... but the file its output is being redirected to is never created.
And I've been over the code a dozen-dozen times (not an exaggeration) and have yet to identify the cause. So, in desperation, I'm going to post it here and hope some kind soul will wade through my code and maybe spot the problem that I'm missing. I'd also happily take advice on improvements/replacements to my logging utility (in the hopes of tracking down the problem) or further troubleshooting steps.
Here is the original context. The important lines are 106-110, 136-140, 144-147, and 151-155
Additional Info
☛ PATH/Environment
The PATH is PATH=/usr/local/sbin:/usr/local/bin:/usr/bin. I believe this is being inherited from systemd (systemd=>dhcpcd.service=>dhcpcd=>dhcpcd-run-hooks=>dhcpcd.exit-hook).
dhcpcd-run-hooks (see "Complications" below) does clear the environment (keeping the above PATH) when it runs. Thus, I've added an example of the environment the script runs in to the "original context" gist. In this case, the environment when $reason == 'BOUND'. This is output by printenv | sort at the end of execution (and thus should show the final state of the environment).
NOTE: Be aware this is Arch Linux and the absence of /bin, /sbin, and /usr/sbin in the PATH is normal (they are just symlinks to /usr/bin anyway).
☛ Return Code
Inserting echo $? after the second line of the function gives me a return code of "0". This is true both with the redirect in line 2 and without (just the printf).
☛ Without The Redirect
Without the redirect, in the original context, the second line of the function prints the contents of ${data} to stdout (which is then captured by bashlog()) exactly as expected.
⚠️ Execute Instead of Source.
Turns out that $0 was /usr/lib/dhcpcd/dhcpcd-run-hooks rather than my script. Apparently dhcpcd-run-hooks doesn't execute the script... it sources it. I made some changes to line 196 to fix this.
♔ Aaaaaand that seems to have fixed all problems. ♔
I'm trying to confirm that was the silver bullet now... I didn't notice it was working till I had made several other changes as well. If I can confirm it I'll submit an answer.
Complications
What complicates matters quite a bit is that it's original context is a /etc/dhcpcd.exit-hook script. dhcpcd-run-hooks appears to eat all stderr and stdout which makes troubleshooting... unpleasant. I've implemented my own logging utility to capture the output of commands in the script and pass it to journald but it's not helping in this case. Either no error is being generated or, somehow, the error is not getting captured by my logging utility. The script is running as root and there is no mandatory access control installed so it shouldn't be a permissions issue.

abyss-pe: variables to assemble multiple genomes with one command

How do I rewrite the following to properly replace the variable with my genomeID? (I have it working with this method in Spades and Masurca assemblers, so it's something about Abyss that doesnt like this approach and I need a work-around)
I am trying to run abyss on a cluster server but am running into trouble with how abyss-pe is reading my variable input:
my submit file loads a script for each genome listed in a .txt file
my script writes in the genome name throughout the script
the abyss assembly fumbles the variable replacement
Input.sub:
queue genomeID from genomelisttest.txt
Input.sh:
#!/bin/bash
genomeID=$1
cp /mnt/gluster/harrow2/trim_output/${genomeID}_trim.tar.gz ./
tar -xzf ${genomeID}_trim.tar.gz
rm ${genomeID}_trim.tar.gz
for k in `seq 86 10 126`; do
mkdir k$k
abyss-pe -C k$k name=${genomeID} k=$k lib='pe1 pe2' pe1='../${genomeID}_trim/${genomeID}_L1_1.fq.gz ../${genomeID}_trim/${genomeID}_L1_2.fq.gz' pe2='../${genomeID}_trim/${genomeID}_L2_1.fq.gz ../${genomeID}_trim/${genomeID}_L2_2.fq.gz'
done
Error that I get:
`../enome_trim/enome_L1_1.fq.gz': No such file or directory
This is where "enome" is supposed to replace with a five digit genomeID, which happens properly in the earlier part of the script up to this point, where abyss comes in.
pe1='../'"$genomeID"'_trim/'"$genomeID"'_L1_1.fq.gz ...'
I added a single quote before and after the variable

How do I use Nagios to monitor a log file that generates a random ID

This the log file that I want to monitor:
/test/James-2018-11-16_15215125111115-16.15.41.111-appserver0.log
I want Nagios to read it this log file so I can monitor a specific string.
The issue is with 15215125111115 this is the random id that gets generated
Here is my script where the Nagios is checking for the Logfile path:
Veriables:
HOSTNAMEIP=$(/bin/hostname -i)
DATE=$(date +%F)
..
CHECK=$(/usr/lib64/nagios/plugins/check_logfiles/check_logfiles
--tag='failorder' --logfile=/test/james-${date +"%F"}_-${HOSTNAMEIP}-appserver0.log
....
I am getting the following output in nagios:
could not find logfile /test/James-2018-11-16_-16.15.41.111-appserver0.log
15215125111115 This number is always generated randomly but I don't know how to get nagios to identify it. Is there a way to add a variable for this or something? I tried adding an asterisk "*" but that didn't work.
Any ideas would be much appreciated.
--tag failorder --type rotating::uniform --logfile /test/dummy \
--rotation "james-${date +"%F"}_\d+-${HOSTNAMEIP}-appserver0.log"
If you add a "-v" you can see what happens inside. Type rotating::uniform tells check_logfiles that the rotation scheme makes no difference between current log and rotated archives regarding the filename. (You frequently find something like xyz..log). What check_logfile does is to look into the directory where the logfiles are supposed to be. From /test/dummy it only uses the directory part. Then it takes all the files inside /test and compares the filenames with the --rotation argument. Those files which match are sorted by modification time. So check_logfiles knows which of the files in question was updated recently and the newest is considered to be the current logfile. And inside this file check_logfiles searches the criticalpattern.
Gerhard

How can I write own cloud-config in cloud-init?

cloud-init is powerful to inject user-data in to VM instance, and its existing module provides lots of possibility.
While to make it more easy to use, I want to define my own tag like coreos below, see detail in running coreos in openstack
#cloud-config
coreos:
etcd:
# generate a new token for each unique cluster from https://discovery.etcd.io/new
discovery: https://discovery.etcd.io/<token>
# multi-region and multi-cloud deployments need to use $public_ipv4
addr: $private_ipv4:4001
peer-addr: $private_ipv4:7001
units:
- name: etcd.service
command: start
- name: fleet.service
command: start
So I could have something like below using my own defined tag/config myapp
#cloud-config
myapp:
admin: admin
database: 192.168.2.3
I am new to cloud-init, is it called module ? it is empty in document http://cloudinit.readthedocs.org/en/latest/topics/modules.html
Can you provide some information to describe how I can write my own module ?
You need to write a "cc" module in a suitable directory, and modify a few configurations. It is not terribly easy, but certainly doable (we use it a lot).
Find the directory for cloudconfig modules. On Amazon Linux, this is /usr/lib/python2.6/site-packages/cloudinit/config/, but the directory location differs in different cloud init versions and distributions. The easiest way to find this is to find a file named cc_mounts.py.
Add a new file there, in your case cc_myapp.py. Copy some existing script as a base to know what to write there. The important function is def handle(name,cfg,cloud,log,args): which is basically the entrypoint for your script.
Implement your logic. The cfg parameter has a python object which is the parsed YAML config file. So for your case you would do something like:
myapp = cfg.get('myapp')
admin = myapp.get('admin')
database = myapp.get('database')
Ensure your script gets called by cloud-init. If your distribution uses the standard cloud-init setup, just adding the file might work. Otherwise you might need to add it to /etc/cloud/cloud.cfg.d/defaults.cfg or directly to /etc/cloud/cloud.cfg. There are keys called cloud_init_modules, cloud_config_modules, etc. which correspond to different parts of the init process where you can get your script run. If this does not work straight out of the box, you'll probably have to do a bit of investigation to find out how the modules are called on your system. For example, Amazon Linux used to have a hardcoded list of modules inside the init.d script, ignoring any lists specified in configuration files.
Also note that by default your script will run only once per instance, meaning that rerunning cloud-init will not run your script again. You need to either mark the script as being per boot by setting frequency to always in the configuration file listing your module, or remove the marker file saying that the script has run, which lives somewhere under /var/lib/cloud like in /var/lib/cloud/instances/i-86ecc763/sem/config_mounts.
paste my note for you:
config: after installed cloud-init in VM,if u want to have root permission to access with passwd, do simple config below
modify /etc/cloud/cloud.cfg like below
users:
- defaults
disable_root:0
ssh_pwauth: 1
Note: ssh_pwauth: "it will modify PasswordAuthentication in sshd_config automatically, 1 means yes
Usage:
the behavior of cloud-init can configured using user data. user data can be filled by user during the start of instance (user data is limited to 16K).
Mainly there are several ways to do (tested):
user-data script
$ cat myscript.sh
#!/bin/sh
echo "Hello World. The time is now $(date -R)!" | tee /root/output.txt
when starting instance, add parameter --user-data myscript.sh, and the instance will run the script once during startup and only once.
cloud config syntax:
It is YAML-based, see http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/files/head:/doc/examples/
run script
#cloud-config
runcmd:
- [ ls, -l, / ]
- [ sh, -xc, "echo $(date) ': hello world!'" ]
- [ sh, -c, echo "=========hello world'=========" ]
- ls -l /root
- [ wget, "http://slashdot.org", -O, /tmp/index.html ]
change hostname, password
#cloud-config
chpasswd:
list: |
root:123456
expire: False
ssh_pwauth: True
hostname: test
include format
run url script, it will download URL script and execute them sequence, this can help to manage the scripts centrally.
#include
http://hostname/script1
http://hostname/scrpt2

Resources