Slurm not running job generated by snakemake because file, directory or other parameter too long - slurm

I have a snakemake rule that has 630k input file dependencies. This rule concatenates the files together with an R script. The R script doesn't take any input files but will grab them from within the R script. When I run this on our HPC via slurm, I'm getting the following error message...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 4950
Job stats:
job count min threads max threads
------------------ ------- ------------- -------------
all_targets 1 1 1
simA_pool_clusters 1 1 1
total 2 1 1
Select jobs to execute...
[Sun Feb 12 13:30:05 2023]
rule simA_pool_clusters:
input: workflow/scripts/simA_pool_clusters.R, data/sim_a/s1_1000_1.nofilter.deseq.bray.clusters.tsv, data/sim_a/s1_1000_1.nofilter.deseq.euclidean.clusters.tsv, [snip...]
output: data/simulation_cluster_accuracy.tsv
jobid: 194145
reason: Missing output files: data/simulation_cluster_accuracy.tsv
resources: mem_mb=2000, disk_mb=1000, tmpdir=<TBD>, cores=1, partition=standard, time_min=120, job_name=rare
sbatch: error: Batch job submission failed: Pathname of a file, directory or other parameter too long
Error submitting jobscript (exit code 1):
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-02-11T093752.352072.snakemake.log
Here is the snakemake rule:
rule simA_pool_clusters:
input:
R="workflow/scripts/simA_pool_clusters.R",
tsv=expand("data/sim_a/{frac}_{depth}_{rep}.{preproc}.{transform}.{distance}.clusters.tsv",
frac = fracs, depth = depths, rep = reps, preproc = preprocs,
transform = transforms, distance = distances)
conda:
"envs/nr-modern.yml"
output:
"data/simulation_cluster_accuracy.tsv"
shell:
"""
{input.R}
"""
The input.tsv expands to 630k small files, I've shortened the list for ease of posting. The input.R is an executable R script with a shebang line - like I said, it gets the *clusters.tsv files from its own logic.
I'm wondering if it's possible that snakemake is sending the entire value of input.tsv to slurm rather than just the R script. Any suggestions to try before I run the R script manually outside of snakemake?

The R script doesn't take any input files but will grab them from within the R script
So you have an R script with 630k file paths hardcoded in it? If so, this link says the slurm script has a limit of 4MB and you are probably over it. If so, it is a limitation of slurm rather than snakemake.
If the above is about right, you could write the 630k file paths to a text file first (one rule), and then use that file as input file to another rule.
Loosely related, I wonder whether the file system is going to struggle in moving around so many files. Perhaps you could consider some refactoring...

Related

Execute a subprocess that takes an input file and write the output to a file

I am using a third-party C++ program to generate intermediate results for the python program that I am working on. The terminal command that I use looks like follows, and it works fine.
./ukb/src/ukb_wsd --ppr_w2w -K ukb/scripts/wn30g.bin -D ukb/scripts/wn30_dict.txt ../data/glass_ukb_input2.txt > ../data/glass_ukb_output2w2.txt
If I break it down into smaller pieces:
./ukb/src/ukb_wsd - executable program
--ppr_w2w - one of the options/switches
-K ukb/scripts/wn30g.bin - parameter K indicates that the next item is a file (network file)
-D ukb/scripts/wn30_dict.txt - parameter D indicate that the next item is a file (dictionary file)
../data/glass_ukb_input2.txt - input file
> - shell command to write the output to a file
../data/glass_ukb_output2w2.txt - output file
The above works fine for one instance. I am trying to do this for around 70000 items (input files). So found a way by using the subprocess module in Python. The body of the python function that I created looks like this.
with open('../data/glass_ukb_input2.txt', 'r') as input, open('../data/glass_ukb_output2w2w_subproc.txt', 'w') as output:
subprocess.run(['./ukb/src/ukb_wsd', '--ppr_w2w', '-K', 'ukb/scripts/wn30g.bin', '-D', 'ukb/scripts/wn30_dict.txt'],
stdin=input,
stdout=output)
This error is no longer there
When I execute the function, it gives an error as follows:
...
STDOUT = subprocess.STDOUT
AttributeError: module 'subprocess' has no attribute 'STDOUT'
Can anyone shed some light about solving this problem.
EDIT
The error was due to a file named subprocess.py in the source dir which masked Python's subprocess file. Once it was removed no error.
But the program could not identify the input file given in stdin. I am thinking it has to do with having 3 input files. Is there a way to provide more than one input file?
EDIT 2
This problem is now solved with the current approach:
subprocess.run('./ukb/src/ukb_wsd --ppr_w2w -K ukb/scripts/wn30g.bin -D ukb/scripts/wn30_dict.txt ../data/glass_ukb_input2.txt > ../data/glass_ukb_output2w2w_subproc.txt',shell=True)

snakemake wildcard in input files

I am very new to snakemake and I am trying to create a merged.fastq for each sample. Following is my Snakefile.
configfile: "config.yaml"
print(config['samples'])
print(config['ss_files'])
print(config['pass_files'])
rule all:
input:
expand("{sample}/data/genome_assembly/medaka/medaka.fasta", sample=config["samples"]),
expand("{pass_file}", pass_file=config["pass_files"]),
expand("{ss_file}", ss_file=config["ss_files"])
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
where, 'samples' is a list of sample names,
'pass_files' is a list of directory path to fastq_pass folder which contains small fastq files
I am trying to merge small fastq files to a large merged.fastq for each sample.
I am getting the following,
Wildcards in input files cannot be determined from output files:
'pass_file'
as the error.
Each wildcard in the input section shall have a corresponding wildcard (with the same name) in the output section. That is how Snakemake works: when the Snakemake tries to constract the DAG of jobs and finds that it needs a certain file, it looks at the output section for each rule and checks if this rule can produce the required file. This is the way how Snakemake assigns certain values to the wildcard in the output section. Every wildcard in other sections shall match one of the wildcards in the output, and that is how the input gets concrete filenames.
Now let's look at your rule merge_fastq:
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
The only wildcard that can get its value is the {sample}. The {pass_file} and {id} are dangling.
As I see, you are trying to merge the files that are not known on the design time. Take a look at the dynamic files, checkpoint and using a function in the input.
The rest of your Snakefile is hard to understand. For example I don't see how you specify the files that match this pattern: "{sample}/data/merged.fastq.gz".
Update:
Lets say, I have a
directory(/home/other_computer/jobs/data/<sample_name>/*.fastq.gz)
which is my input and output is
(/result/merged/<sample_name>/merged.fastq.gz). What I tried is having
the first path as input: {"pass_files"} (this comes from my config
file) and output : "result/merged/{sample}/merged.fastq.gz"
First, let's simplify the task a little bit and replace the {pass_file} with the hardcoded path. You have 2 degrees of freedom: the <sample_name> and the unknown files in the /home/other_computer/jobs/data/<sample_name>/ folder. The <sample_name> is a good candidate for becoming a wildcard, as this name can be derived from the target file. The unknown number of files *.fastq.gz doesn't even require any Snakemake constructs as this can be expressed using a shell command.
rule merge_fastq:
output:
"/result/merged/{sample_name}/merged.fastq.gz"
shell:
"cat /home/other_computer/jobs/data/{sample_name}/*.fastq.gz > {output}"

Snakemake slurm ouput file redirect to new directory

I'm putting together a snakemake slurm workflow and am having trouble with my working directory becoming cluttered with slurm output files. I would like my workflow to, at a minimum, direct these files to a 'slurm' directory inside my working directory. I currently have my workflow set up as follows:
config.yaml:
reads:
1:
2:
samples:
15FL1-2: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS/data/15FL1-2
15Fl1-4: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS/data/15Fl1-4
cluster.yaml:
localrules: all
__default__:
time: 0:5:0
mem: 1G
output: _{rule}_{wildcards.sample}_%A.slurm
fastqc_raw:
job_name: sm_fastqc_raw
time: 0:10:0
mem: 1G
output: slurm/_{rule}_{wildcards.sample}_{wildcards.read}_%A.slurm
Snakefile:
configfile: "config.yaml"
workdir: config["work"]
rule all:
input:
expand("analysis/fastqc_raw/{sample}_R{read}_fastqc.html", sample=config["samples"],read=config["reads"])
rule clean:
shell:
"rm -rf analysis logs"
rule fastqc_raw:
input:
'data/{sample}_R{read}.fastq.gz'
output:
'analysis/fastqc_raw/{sample}_R{read}_fastqc.html'
log:
err = 'logs/fastqc_raw/{sample}_R{read}.out',
out = 'logs/fastqc_raw/{sample}_R{read}.err'
shell:
"""
fastqc {input} --noextract --outdir 'analysis/fastqc_raw' 2> {log.err} > {log.out}
"""
I then call with:
snakemake --jobs 4 --cluster-config cluster.yaml --cluster "sbatch --mem={cluster.mem} --time={cluster.time} --job-name={cluster.job_name} --output={cluster.output}"
This does not work, as the slurm directory does not already exist. I don't want to manually make this before running my snakemake command, that will not work for scalability. Things I've tried, after reading every related question, are:
1) simply trying to capture all the output via the log within the rule, and setting cluster.output='/dev/null'. Doesn't work, the info in the slurm output isn't captured as it's not output of the rule exactly, its info on the job
2) forcing the directory to be created by adding a dummy log:
log:
err = 'logs/fastqc_raw/{sample}_R{read}.out',
out = 'logs/fastqc_raw/{sample}_R{read}.err'
jobOut = 'slurm/out.err'
I think this doesn't work because sbatch tries to find the slurm folder before implementing the rule
3) allowing the files to be made in the working directory, and adding bash code to the end of the rule to move the files into a slurm directory. I believe this doesn't work because it tries to move the files before the job has finished writing to the slurm output.
Any further ideas or tricks?
You should be able to suppress these outputs by calling sbatch with --output=/dev/null --error=/dev/null. Something like this:
snakemake ... --cluster "sbatch --output=/dev/null --error=/dev/null ..."
If you want the files to go to a directory of your choosing you can of course change the call to reflect that:
snakemake ... --cluster "sbatch --output=/home/Ensa/slurmout/%j.out --error=/home/Ensa/slurmout/%j.out ..."
So this is how I solved the issue (there's probably a better way, and if so, I hope someone will correct me). Personally I will go to great lengths to avoid hard-coding anything. I use a snakemake profile and an sbatch script.
First, I make a snakemake profile that contains a line like this:
cluster: "sbatch --output=slurm_out/slurm-%j.out --mem={resources.mem_mb} -c {resources.cpus} -J {rule}_{wildcards} --mail-type=FAIL --mail-user=me#me.edu"
You can see the --output parameter to redirect the slurm output files to a subdirectory called slurm_out in the current working directory. But AFAIK, slurm can't create that directory if it doesn't exist. So...
Next I make a small sbatch script whose only job is to make the subdirectory, then call the sbatch script to submit the workflow. This "wrapper" looks like:
#!/bin/bash
mkdir -p ./slurm_out
sbatch snake_submit.sbatch
And finally, the snake_submit.sbatch looks like:
#!/bin/bash
ml snakemake
snakemake --profile <myprofile>
In this case both the wrapper and the sbatch script that it calls will have their slurm out files in the current working directory. I prefer it that way because it's easier for me to locate them. But I think you could easily re-direct by adding another #SBATCH --output parameter to the snake_submit.sbatch script (but not the wrapper, then it's turtles all the way down, you know?).
I hope that makes sense.

Bash for loop stops after first iteration

I have the following bash code. The for loop takes values from two successive locations of an array. Then it creates a corresponding directory in a cluster. There it creates a .cpki file and runs it in the cluster. Unfortunately this code stops working after first iteration.
declare -a CT
CT=(2 0 -1 -2)
len=${#CT[#]}
for ((i=0;i<len;i++)); do
a=${CT[$i]}
b=${CT[$((i+1))]}
input=${job_type}_${a}_${b}
WorkDir=/scratch/$USER/${input}.${JOB_ID} #Directory in a cluster
mkdir -p $WorkDir;
cd $WorkDir; #Go to cluster
...
Code that creates ${filename}.cpki file using a and b
...
$MPIRUN -np $NSLOTS $CP2K -i "${filename}.cpki"> "${filename}.cpko"
done
Please check whether len variable is changed unintentionally in the
Code that creates ${filename}.cpki file using a and b
part of the script.

String method tutorial in abinit - no output file produced

I'm trying to follow this tutorial in abinit: https://docs.abinit.org/tutorial/paral_images/
When trying to run abinit for any of the tstring files, no output file is produced. For instance, I copy the files tstring_01.in and tstring.files into the subdirectory work_paral_string, edit the tstring.files with the appropriate file names, and run the command mpirun -n 20 abinit < tstring.files > log 2> err. No error message is shown but also no output is produced. (The expected output file would be tstring_01.out.)
Any suggestions?

Resources