Snakemake refuses to unpack input function when rule A is a dependency of rule B, but accepts it when rule A is the final rule - python-3.x

I have a snakemake workflow for a metagenomics project. At a point in the workflow, I map DNA sequencing reads (either single or paired-end) to metagenome assemblies made by the same workflow. I made an input function conform the Snakemake manual to map both single end and paired end reads with one rule. like so
import os.path
def get_binning_reads(wildcards):
pathpe=("data/sequencing_binning_signals/" + wildcards.binningsignal + ".trimmed_paired.R1.fastq.gz")
pathse=("data/sequencing_binning_signals/" + wildcards.binningsignal + ".trimmed.fastq.gz")
if os.path.isfile(pathpe) == True :
return {'reads' : expand("data/sequencing_binning_signals/{binningsignal}.trimmed_paired.R{PE}.fastq.gz", PE=[1,2],binningsignal=wildcards.binningsignal) }
elif os.path.isfile(pathse) == True :
return {'reads' : expand("data/sequencing_binning_signals/{binningsignal}.trimmed.fastq.gz", binningsignal=wildcards.binningsignal) }
rule backmap_bwa_mem:
input:
unpack(get_binning_reads),
index=expand("data/assembly_{{assemblytype}}/{{hostcode}}/scaffolds_bwa_index/scaffolds.{ext}",ext=['bwt','pac','ann','sa','amb'])
params:
lambda w: expand("data/assembly_{assemblytype}/{hostcode}/scaffolds_bwa_index/scaffolds",assemblytype=w.assemblytype,hostcode=w.hostcode)
output:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam"
threads: 100
log:
stdout="logs/bwa_backmap_samtools_{assemblytype}_{hostcode}.stdout",
samstderr="logs/bwa_backmap_samtools_{assemblytype}_{hostcode}.stdout",
stderr="logs/bwa_backmap_{assemblytype}_{hostcode}.stderr"
shell:
"bwa mem -t {threads} {params} {input.reads} 2> {log.stderr} | samtools view -# 12 -b -o {output} 2> {log.samstderr} > {log.stdout}"
When I make an arbitrary 'all-rule' like this, the workflow runs successfully.
rule allbackmapped:
input:
expand("data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam", binningsignal=BINNINGSIGNALS,assemblytype=ASSEMBLYTYPES,hostcode=HOSTCODES)
However, when the files created by this rule are required for subsequent rules like so:
rule backmap_samtools_sort:
input:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam"
output:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.sorted.bam"
threads: 6
resources:
mem_mb=5000
shell:
"samtools sort -# {threads} -m {mem_mb}M -o {output} {input}"
rule allsorted:
input:
expand("data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.sorted.bam",binningsignal=BINNINGSIGNALS,assemblytype=ASSEMBLYTYPES,hostcode=HOSTCODES)
The workflow closes with this error
WorkflowError in line 416 of
/stor/azolla_metagenome/Azolla_genus_metagenome/Snakefile: Can only
use unpack() on list and dict
To me, this error suggests the input function for the former rule is faulty. This however, seems not to be the case for it ran successfully when no subsequent processing was queued.
The entire project is hosted on github. The entire Snakefile and a github issue.

Related

SED style Multi address in Python?

I have an app that parses multiple Cisco show tech files. These files contain the output of multiple router commands in a structured way, let me show you an snippet of a show tech output:
`show clock`
20:20:50.771 UTC Wed Sep 07 2022
Time source is NTP
`show callhome`
callhome disabled
Callhome Information:
<SNIPET>
`show module`
Mod Ports Module-Type Model Status
--- ----- ------------------------------------- --------------------- ---------
1 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
2 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
3 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
4 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
21 0 Fabric Module N9K-C9504-FM-R ok
22 0 Fabric Module N9K-C9504-FM-R ok
23 0 Fabric Module N9K-C9504-FM-R ok
<SNIPET>
My app currently uses both SED and Python scripts to parse these files. I use SED to parse the show tech file looking for a specific command output, once I find it, I stop SED. This way I don't need to read all the file (these can get to be very big files). This is a snipet of my SED script:
sed -E -n '/`show running-config`|`show running`|`show running config`/{
p
:loop
n
p
/`show/q
b loop
}' $1/$file
As you can see I am using a multi address range in SED. My question specifically is, how can I achieve something similar in python? I have tried multiple combinations of flags: DOTALL and MULTILINE but I can't get the result I'm expecting, for example, I can get a match for the command I'm looking for, but python regex wont stop until the end of the file after the first match.
I am looking for something like this
sed -n '/`show clock`/,/`show/p'
I would like the regex match to stop parsing the file and print the results, immediately after seeing `show again , hope that makes sense and thank you all for reading me and for your help
You can use nested loops.
import re
def process_file(filename):
with open(filename) as f:
for line in f:
if re.search(r'`show running-config`|`show running`|`show running config`', line):
print(line)
for line1 in f:
print(line1)
if re.search(r'`show', line1):
return
The inner for loop will start from the next line after the one processed by the outer loop.
You can also do it with a single loop using a flag variable.
import re
def process_file(filename):
in_show = False
with open(filename) as f:
for line in f:
if re.search(r'`show running-config`|`show running`|`show running config`', line):
in_show = True
if in_show
print(line)
if re.search(r'`show', line1):
return

snakemake allocates memory twice

I am noticing that all my rules request memory twice, one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb). If I run the rules as localrules they do run faster. How can I make sure the default settings do not interfere?
resources: mem_mb=100, disk_mb=8620, tmpdir=/tmp/pop071.54835, partition=h24, qos=normal, mem_gb=100, time=120:00:00
The rules are as follows:
rule bwa_mem2_mem:
input:
R1 = "data/results/qc/{species}.{population}.{individual}_1.fq.gz",
R2 = "data/results/qc/{species}.{population}.{individual}_2.fq.gz",
R1_unp = "data/results/qc/{species}.{population}.{individual}_1_unp.fq.gz",
R2_unp = "data/results/qc/{species}.{population}.{individual}_2_unp.fq.gz",
idx= "data/results/genome/genome",
ref = "data/results/genome/genome.fa"
output:
bam = "data/results/mapped_reads/{species}.{population}.{individual}.bam",
log:
bwa ="logs/bwa_mem2/{species}.{population}.{individual}.log",
sam ="logs/samtools_view/{species}.{population}.{individual}.log",
benchmark:
"benchmark/bwa_mem2_mem/{species}.{population}.{individual}.tsv",
resources:
time = parameters["bwa_mem2"]["time"],
mem_gb = parameters["bwa_mem2"]["mem_gb"],
params:
extra = parameters["bwa_mem2"]["extra"],
tag = compose_rg_tag,
threads:
parameters["bwa_mem2"]["threads"],
shell:
"bwa-mem2 mem -t {threads} -R '{params.tag}' {params.extra} {input.idx} {input.R1} {input.R2} | "
"samtools sort -l 9 -o {output.bam} --reference {input.ref} --output-fmt CRAM -# {threads} /dev/stdin 2> {log.sam}"
and the config is:
cluster:
mkdir -p logs/{rule} && # change the log file to logs/slurm/{rule}
sbatch
--partition={resources.partition}
--time={resources.time}
--qos={resources.qos}
--cpus-per-task={threads}
--mem={resources.mem_gb}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
--parsable # Required to pass job IDs to scancel
default-resources:
- partition=h24
- qos=normal
- mem_gb=100
- time="04:00:00"
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True # Required to run with local conda enviroment
cluster-status: status-sacct.sh # Required to monitor the status of the submitted jobs
cluster-cancel: scancel # Required to cancel the jobs with Ctrl + C
cluster-cancel-nargs: 50
Cheers,
Angel
Right now there are two separate memory resource requirements:
mem_mb
mem_gb
From the perspective of snakemake these are different, so both will be passed to the cluster. A quick fix is to use the same units, e.g. if the resource really requires only 100 mb, then the default resource should be changed to:
default-resources:
- partition=h24
- qos=normal
- mem_mb=100

--latency-wait Error MissingOutputException

I am working on snakemake to make a pipeline for my work.
My input files are (Example fastq files)
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-1_R1.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-1_R2.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-2_R1.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-2_R2.fastq.gz
I have written the following code for trimming the data using FastP
"""
Author: Dave Amir
Affiliation: St Lukes
Aim: A simple Snakemake workflow to process paired-end stranded RNA-Seq.
Date: 11 June 2015
Run: snakemake -s snakefile
Latest modification:
- todo
"""
# This should be placed in the Snakefile.
##-----------------------------------------------##
## Working directory ##
## Adapt to your needs ##
##-----------------------------------------------##
BASE_DIR = "/project/ateeq/PROJECT"
WDIR = BASE_DIR + "/snakemake-example"
workdir: WDIR
#message("The current working directory is " + WDIR)
##--------------------------------------------------------------------------------------##
## Variables declaration
## Declaring some variables used by topHat and other tools...
## (GTF file, INDEX, chromosome length)
##--------------------------------------------------------------------------------------##
INDEX = BASE_DIR + "/ref_files/hg19/assembly/"
GTF = BASE_DIR + "/hg19/Hg19_CTAT_resource_lib/ref_annot.gtf"
CHR = BASE_DIR + "/static/humanhg19.annot.csv"
FASTA = BASE_DIR + "/ref_files/hg19/assembly/hg19.fasta"
##--------------------------------------------------------------------------------------##
## The list of samples to be processed
##--------------------------------------------------------------------------------------##
SAMPLES, = glob_wildcards("/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz")
NB_SAMPLES = len(SAMPLES)
for smp in SAMPLES:
message:("Sample " + smp + " will be processed")
##--------------------------------------------------------------------------------------##
## Our First rule - sample trimming
##--------------------------------------------------------------------------------------##
rule final:
input: expand("/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", smp=SAMPLES)
rule trimming:
input: fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output: fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq", rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
when I am running the pipeline it shows the following error
MissingOutputException in line 51 of /project/ateeq/PROJECT/snakemake-example/snakefile:
Missing files after 5 seconds:
/project/ateeq/PROJECT/snakemake-example/trimmed/H667-1_R1_trimmed.fastq
/project/ateeq/PROJECT/snakemake-example/trimmed/H667-1_R2_trimmed.fastq
/project/ateeq/PROJECT/snakemake-example/report/H667-1.html
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Though I have increased the latency wait period for 2000 sec it still ends up throwing an error
snakemake -s snakefile -j 30 --latency-wait 2000
I am using snakemake version 5.4.5 and Python 3.6.8. Please let me know where I am going wrong, it would be a great help to me
Thanks for kind Help,
Sincerely,
Dave
I don't know if this is due to the copy & paste but the indentation is wrong:
rule trimming:
input: fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output: fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq", rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
Shouldn't it be:
rule trimming:
input:
fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",
rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output:
fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq",
rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq",
rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""

AC_ARG_ENABLE in an m4_foreach_w loop: no help string

I wish to generate a lot of --enable-*/--disable-* options by something like:
COMPONENTS([a b c], [yes])
where the second argument is the default value of the automatic enable_* variable. My first attempt was to write an AC_ARG_ENABLE(...) within an m4_foreach_w, but so far, I'm only getting the first component to appear in the ./configure --help output.
If I add hand-written AC_ARG_ENABLEs, they work as usual.
Regardless, the --enable-*/--disable-* options work as they should, just the help text is missing.
Here's the full code to reproduce the problem:
AC_INIT([foo], 1.0)
AM_INIT_AUTOMAKE([foreign])
AC_DEFUN([COMPONENTS],
[
m4_foreach_w([component], [$1], [
AS_ECHO(["Processing [component] component with default enable=$2"])
AC_ARG_ENABLE([component],
[AS_HELP_STRING([--enable-[]component], [component] component)],
,
[enable_[]AS_TR_SH([component])=$2]
)
])
AC_ARG_ENABLE([x],
[AS_HELP_STRING([--enable-[]x], [component x])],
,
[enable_[]AS_TR_SH([x])=$2]
)
AC_ARG_ENABLE([y],
[AS_HELP_STRING([--enable-[]y], [component y])],
,
[enable_[]AS_TR_SH([y])=$2]
)
])
COMPONENTS([a b c], [yes])
for var in a b c x y; do
echo -n "\$enable_$var="
eval echo "\$enable_$var"
done
AC_CONFIG_FILES(Makefile)
AC_OUTPUT
And an empty Makefile.am. To verify that the options work:
$ ./configure --disable-a --disable-b --disable-d --disable-x
configure: WARNING: unrecognized options: --disable-d
...
Processing component a with default enable=yes
Processing component b with default enable=yes
Processing component c with default enable=yes
$enable_a=no
$enable_b=no
$enable_c=yes
$enable_x=no
$enable_y=yes
After I poked around in autoconf sources, I figured out this has to do with the m4_divert_once call in the implementation of AC_ARG_ENABLE:
# AC_ARG_ENABLE(FEATURE, HELP-STRING, [ACTION-IF-TRUE], [ACTION-IF-FALSE])
# ------------------------------------------------------------------------
AC_DEFUN([AC_ARG_ENABLE],
[AC_PROVIDE_IFELSE([AC_PRESERVE_HELP_ORDER],
[],
[m4_divert_once([HELP_ENABLE], [[
Optional Features:
--disable-option-checking ignore unrecognized --enable/--with options
--disable-FEATURE do not include FEATURE (same as --enable-FEATURE=no)
--enable-FEATURE[=ARG] include FEATURE [ARG=yes]]])])dnl
m4_divert_once([HELP_ENABLE], [$2])dnl
_AC_ENABLE_IF([enable], [$1], [$3], [$4])dnl
])# AC_ARG_ENABLE
# m4_divert_once(DIVERSION-NAME, CONTENT)
# ---------------------------------------
# Output CONTENT into DIVERSION-NAME once, if not already there.
# An end of line is appended for free to CONTENT.
m4_define([m4_divert_once],
[m4_expand_once([m4_divert_text([$1], [$2])])])
I'm guessing that the HELP-STRING argument is remembered in it's unexpanded form, so it is added just once for all components. Manually expanding the AC_HELP_STRING does what I want:
AC_DEFUN([COMPONENTS],
[
m4_foreach_w([comp], [$1], [
AS_ECHO(["Processing component 'comp' with default enable=$2"])
AC_ARG_ENABLE([comp],
m4_expand([AS_HELP_STRING([--enable-comp], enable component comp)]),
,
[enable_[]AS_TR_SH([comp])=$2]
)
])
])
COMPONENTS([a b c x y], [yes])
I couldn't find a way to properly quote components so that it appears as a string, after being used as the loop variable in m4_foreach_w, so I just renamed it to spare me the trouble.

Force lshosts command to return megabytes for "maxmem" and "maxswp" parameters

When I type "lshosts" I am given:
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
server1 X86_64 Intel_EM 60.0 12 191.9G 159.7G Yes ()
server2 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
server3 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
I am trying to return maxmem and maxswp as megabytes, not gigabytes when lshosts is called. I am trying to send Xilinx ISE jobs to my LSF, however the software expects integer, megabyte values for maxmem and maxswp. By doing debugging, it appears that the software grabs these parameters using the lshosts command.
I have already checked in my lsf.conf file that:
LSF_UNIT_FOR_LIMTS=MB
I have tried searching the IBM Knowledge Base, but to no avail.
Do you use a specific command to specify maxmem and maxswp units within the lsf.conf, lsf.shared, or other config files?
Or does LSF force return the most practical unit?
Any way to override this?
LSF_UNIT_FOR_LIMITS should work, if you completely drained the cluster of all running, pending, and finished jobs. According to the docs, MB is the default, so I'm surprised.
That said, you can use something like this to transform the results:
$ cat to_mb.awk
function to_mb(s) {
e = index("KMG", substr(s, length(s)))
m = substr(s, 0, length(s) - 1)
return m * 10^((e-2) * 3)
}
{ print $1 " " to_mb($6) " " to_mb($7) }
$ lshosts | tail -n +2 | awk -f to_mb.awk
server1 191900 159700
server2 191900 191200
server3 191900 191200
The to_mb function should also handle 'K' or 'M' units, should those pop up.
If LSF_UNIT_FOR_LIMITS is defined in lsf.conf, lshosts will always print the output as a floating point number, and in some versions of LSF the parameter is defined as 'KB' in lsf.conf upon installation.
Try searching for any definitions of the parameter in lsf.conf and commenting them all out so that the parameter is left undefined, I think in that case it defaults to printing it out as an integer in megabytes.
(Don't ask me why it works this way)

Resources