snakemake allocates memory twice - resources

I am noticing that all my rules request memory twice, one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb). If I run the rules as localrules they do run faster. How can I make sure the default settings do not interfere?
resources: mem_mb=100, disk_mb=8620, tmpdir=/tmp/pop071.54835, partition=h24, qos=normal, mem_gb=100, time=120:00:00
The rules are as follows:
rule bwa_mem2_mem:
input:
R1 = "data/results/qc/{species}.{population}.{individual}_1.fq.gz",
R2 = "data/results/qc/{species}.{population}.{individual}_2.fq.gz",
R1_unp = "data/results/qc/{species}.{population}.{individual}_1_unp.fq.gz",
R2_unp = "data/results/qc/{species}.{population}.{individual}_2_unp.fq.gz",
idx= "data/results/genome/genome",
ref = "data/results/genome/genome.fa"
output:
bam = "data/results/mapped_reads/{species}.{population}.{individual}.bam",
log:
bwa ="logs/bwa_mem2/{species}.{population}.{individual}.log",
sam ="logs/samtools_view/{species}.{population}.{individual}.log",
benchmark:
"benchmark/bwa_mem2_mem/{species}.{population}.{individual}.tsv",
resources:
time = parameters["bwa_mem2"]["time"],
mem_gb = parameters["bwa_mem2"]["mem_gb"],
params:
extra = parameters["bwa_mem2"]["extra"],
tag = compose_rg_tag,
threads:
parameters["bwa_mem2"]["threads"],
shell:
"bwa-mem2 mem -t {threads} -R '{params.tag}' {params.extra} {input.idx} {input.R1} {input.R2} | "
"samtools sort -l 9 -o {output.bam} --reference {input.ref} --output-fmt CRAM -# {threads} /dev/stdin 2> {log.sam}"
and the config is:
cluster:
mkdir -p logs/{rule} && # change the log file to logs/slurm/{rule}
sbatch
--partition={resources.partition}
--time={resources.time}
--qos={resources.qos}
--cpus-per-task={threads}
--mem={resources.mem_gb}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
--parsable # Required to pass job IDs to scancel
default-resources:
- partition=h24
- qos=normal
- mem_gb=100
- time="04:00:00"
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True # Required to run with local conda enviroment
cluster-status: status-sacct.sh # Required to monitor the status of the submitted jobs
cluster-cancel: scancel # Required to cancel the jobs with Ctrl + C
cluster-cancel-nargs: 50
Cheers,
Angel

Right now there are two separate memory resource requirements:
mem_mb
mem_gb
From the perspective of snakemake these are different, so both will be passed to the cluster. A quick fix is to use the same units, e.g. if the resource really requires only 100 mb, then the default resource should be changed to:
default-resources:
- partition=h24
- qos=normal
- mem_mb=100

Related

Nextflow with Azure Batch - Cannot find a matching VM image

While trying to set up Nextflow with Azure Batch (NF-Core), I am getting following error. I tried this on multiple workflows (sarek, ataseq etc.) I get the same error -
N E X T F L O W ~ version 22.04.0
Pulling nf-core/atacseq ...
downloaded from https://github.com/nf-core/atacseq.git
Launching `https://github.com/nf-core/atacseq` [rhl6d5529] DSL1 - revision: 1b3a832db5 [1.2.1]
Downloading plugin nf-azure#0.13.1
----------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/atacseq v1.2.1
----------------------------------------------------
Run Name : rhl6d5529
Data Type : Paired-End
Design File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/design.csv
Genome : Not supplied
Fasta File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genome.fa
GTF File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genes.gtf
Mitochondrial Contig : MT
MACS2 Genome Size : 1.2E+7
Min Consensus Reps : 1
MACS2 Narrow Peaks : No
MACS2 Broad Cutoff : 0.1
Trim R1 : 0 bp
Trim R2 : 0 bp
Trim 3' R1 : 0 bp
Trim 3' R2 : 0 bp
NextSeq Trim : 0 bp
Fingerprint Bins : 100
Save Genome Index : No
Max Resources : 6 GB memory, 2 cpus, 12h time per job
Container : docker - nfcore/atacseq:1.2.1
Output Dir : ./results
Launch Dir : /
Working Dir : /nextflow/atacseq/rhl6d5529
Script Dir : /.nextflow/assets/nf-core/atacseq
User : root
Config Profile : test,azurebatch
Config Description : Minimal test dataset to check pipeline function
Config Contact : Venkat Malladi (#vsmalladi)
Config URL : https://azure.microsoft.com/services/batch/
----------------------------------------------------
Uploading local `bin` scripts folder to az://nextflow/atacseq/rhl6d5529/tmp/66/bd55d79e42999df38ba04a81c3aa04/bin
[- ] process > CHECK_DESIGN -
[- ] process > CHECK_DESIGN [ 0%] 0 of 1
[- ] process > CHECK_DESIGN [ 0%] 0 of 1
Error executing process > 'CHECK_DESIGN (design.csv)'
Caused by:
Cannot find a matching VM image with publisher=microsoft-azure-batch; offer=centos-container; OS type=linux; verification type=verified
[58/55b7f7] process > CHECK_DESIGN (design.csv) [100%] 1 of 1, failed: 1
Error executing process > 'CHECK_DESIGN (design.csv)'
Caused by:
Cannot find a matching VM image with publisher=microsoft-azure-batch; offer=centos-container; OS type=linux; verification type=verified
I tried looking into the source code of nextflow. I found the error to be in AzBatchService.groovy (line number below).
https://github.com/nextflow-io/nextflow/blob/0e593e6ab82880810d8139a4fe6e3c47ff69a531/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L442
I did some further digging in my Azure Batch account instance. Basically, I wanted to confirm if the list of supported images being received from the Azure Batch account has the one that is required for this pipeline. I could confirm that the server did indeed respond with the required image -
What could be the issue here? I remember running the exact same pipeline a few weeks back and it did work a few times. Am I missing something?
Just had another look through the Azure Cloud docs and think this might be relevant:
By default, Nextflow creates CentOS 8-based pool nodes, but this
behavior can be customised in the pool configuration. Below the
configurations for image reference/SKU combinations to select two
popular systems.
Ubuntu 20.04:
sku = "batch.node.ubuntu 20.04"
offer = "ubuntu-server-container"
publisher = "microsoft-azure-batch"
CentOS 8 (default):
sku = "batch.node.centos 8"
offer = "centos-container"
publisher = "microsoft-azure-batch"
I think the issue here is a mismatched nodeAgentSkuId. Nextflow is expecting a CentOS 8 node agent SKU, but you have a CentOS 7 SKU. If it's not possible to change the nodeAgentSkuId somehow, the node agent SKU that Nextflow uses should be able to be overridden by adding this to your nextflow.config:
azure.batch.pools.<name>.sku = 'batch.node.centos 7'
Where <name> is the pool identifier:
azure.batch.pools.<name>.sku
Specify the ID of the Compute Node agent SKU which the pool identified with <name> supports (default: batch.node.centos 8, requires nf-azure#0.11.0).
https://www.nextflow.io/docs/edge/azure.html#advanced-settings

--latency-wait Error MissingOutputException

I am working on snakemake to make a pipeline for my work.
My input files are (Example fastq files)
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-1_R1.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-1_R2.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-2_R1.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-2_R2.fastq.gz
I have written the following code for trimming the data using FastP
"""
Author: Dave Amir
Affiliation: St Lukes
Aim: A simple Snakemake workflow to process paired-end stranded RNA-Seq.
Date: 11 June 2015
Run: snakemake -s snakefile
Latest modification:
- todo
"""
# This should be placed in the Snakefile.
##-----------------------------------------------##
## Working directory ##
## Adapt to your needs ##
##-----------------------------------------------##
BASE_DIR = "/project/ateeq/PROJECT"
WDIR = BASE_DIR + "/snakemake-example"
workdir: WDIR
#message("The current working directory is " + WDIR)
##--------------------------------------------------------------------------------------##
## Variables declaration
## Declaring some variables used by topHat and other tools...
## (GTF file, INDEX, chromosome length)
##--------------------------------------------------------------------------------------##
INDEX = BASE_DIR + "/ref_files/hg19/assembly/"
GTF = BASE_DIR + "/hg19/Hg19_CTAT_resource_lib/ref_annot.gtf"
CHR = BASE_DIR + "/static/humanhg19.annot.csv"
FASTA = BASE_DIR + "/ref_files/hg19/assembly/hg19.fasta"
##--------------------------------------------------------------------------------------##
## The list of samples to be processed
##--------------------------------------------------------------------------------------##
SAMPLES, = glob_wildcards("/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz")
NB_SAMPLES = len(SAMPLES)
for smp in SAMPLES:
message:("Sample " + smp + " will be processed")
##--------------------------------------------------------------------------------------##
## Our First rule - sample trimming
##--------------------------------------------------------------------------------------##
rule final:
input: expand("/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", smp=SAMPLES)
rule trimming:
input: fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output: fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq", rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
when I am running the pipeline it shows the following error
MissingOutputException in line 51 of /project/ateeq/PROJECT/snakemake-example/snakefile:
Missing files after 5 seconds:
/project/ateeq/PROJECT/snakemake-example/trimmed/H667-1_R1_trimmed.fastq
/project/ateeq/PROJECT/snakemake-example/trimmed/H667-1_R2_trimmed.fastq
/project/ateeq/PROJECT/snakemake-example/report/H667-1.html
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Though I have increased the latency wait period for 2000 sec it still ends up throwing an error
snakemake -s snakefile -j 30 --latency-wait 2000
I am using snakemake version 5.4.5 and Python 3.6.8. Please let me know where I am going wrong, it would be a great help to me
Thanks for kind Help,
Sincerely,
Dave
I don't know if this is due to the copy & paste but the indentation is wrong:
rule trimming:
input: fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output: fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq", rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
Shouldn't it be:
rule trimming:
input:
fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",
rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output:
fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq",
rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq",
rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""

Snakemake refuses to unpack input function when rule A is a dependency of rule B, but accepts it when rule A is the final rule

I have a snakemake workflow for a metagenomics project. At a point in the workflow, I map DNA sequencing reads (either single or paired-end) to metagenome assemblies made by the same workflow. I made an input function conform the Snakemake manual to map both single end and paired end reads with one rule. like so
import os.path
def get_binning_reads(wildcards):
pathpe=("data/sequencing_binning_signals/" + wildcards.binningsignal + ".trimmed_paired.R1.fastq.gz")
pathse=("data/sequencing_binning_signals/" + wildcards.binningsignal + ".trimmed.fastq.gz")
if os.path.isfile(pathpe) == True :
return {'reads' : expand("data/sequencing_binning_signals/{binningsignal}.trimmed_paired.R{PE}.fastq.gz", PE=[1,2],binningsignal=wildcards.binningsignal) }
elif os.path.isfile(pathse) == True :
return {'reads' : expand("data/sequencing_binning_signals/{binningsignal}.trimmed.fastq.gz", binningsignal=wildcards.binningsignal) }
rule backmap_bwa_mem:
input:
unpack(get_binning_reads),
index=expand("data/assembly_{{assemblytype}}/{{hostcode}}/scaffolds_bwa_index/scaffolds.{ext}",ext=['bwt','pac','ann','sa','amb'])
params:
lambda w: expand("data/assembly_{assemblytype}/{hostcode}/scaffolds_bwa_index/scaffolds",assemblytype=w.assemblytype,hostcode=w.hostcode)
output:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam"
threads: 100
log:
stdout="logs/bwa_backmap_samtools_{assemblytype}_{hostcode}.stdout",
samstderr="logs/bwa_backmap_samtools_{assemblytype}_{hostcode}.stdout",
stderr="logs/bwa_backmap_{assemblytype}_{hostcode}.stderr"
shell:
"bwa mem -t {threads} {params} {input.reads} 2> {log.stderr} | samtools view -# 12 -b -o {output} 2> {log.samstderr} > {log.stdout}"
When I make an arbitrary 'all-rule' like this, the workflow runs successfully.
rule allbackmapped:
input:
expand("data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam", binningsignal=BINNINGSIGNALS,assemblytype=ASSEMBLYTYPES,hostcode=HOSTCODES)
However, when the files created by this rule are required for subsequent rules like so:
rule backmap_samtools_sort:
input:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam"
output:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.sorted.bam"
threads: 6
resources:
mem_mb=5000
shell:
"samtools sort -# {threads} -m {mem_mb}M -o {output} {input}"
rule allsorted:
input:
expand("data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.sorted.bam",binningsignal=BINNINGSIGNALS,assemblytype=ASSEMBLYTYPES,hostcode=HOSTCODES)
The workflow closes with this error
WorkflowError in line 416 of
/stor/azolla_metagenome/Azolla_genus_metagenome/Snakefile: Can only
use unpack() on list and dict
To me, this error suggests the input function for the former rule is faulty. This however, seems not to be the case for it ran successfully when no subsequent processing was queued.
The entire project is hosted on github. The entire Snakefile and a github issue.

Mysql seconds_behind master very high

Hi we have mysql master slave replication, master is mysql 5.6 and slave is mysql 5.7, seconds behind master is 245000, how I make it catch up faster. Right now it is taking more than 6 hours to copy 100 000 seconds.
My slave ram is 128 GB. Below is my my.cnf
[mysqld]
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
innodb_buffer_pool_size = 110G
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
# These are commonly set, remove the # and set as required.
basedir = /usr/local/mysql
datadir = /disk1/mysqldata
port = 3306
#server_id = 3
socket = /var/run/mysqld/mysqld.sock
user=mysql
log_error = /var/log/mysql/error.log
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
join_buffer_size = 256M
sort_buffer_size = 128M
read_rnd_buffer_size = 2M
#copied from old config
#key_buffer = 16M
max_allowed_packet = 256M
thread_stack = 192K
thread_cache_size = 8
query_cache_limit = 1M
#disabling query_cache_size and type, for replication purpose, need to enable it when going live
query_cache_size = 0
#query_cache_size = 64M
#query_cache_type = 1
query_cache_type = OFF
#GroupBy
sql_mode=STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
#sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
enforce-gtid-consistency
gtid-mode = ON
log_slave_updates=0
slave_transaction_retries = 100
#replication related changes
server-id = 2
relay-log = /disk1/mysqllog/mysql-relay-bin.log
log_bin = /disk1/mysqllog/binlog/mysql-bin.log
binlog_do_db = brandmanagement
#replicate_wild_do_table=brandmanagement.%
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data\_recent
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data
replicate-wild-ignore-table=brandmanagement.t\_fb\_rt\_data
replicate-wild-ignore-table=brandmanagement.t\_keyword\_tweets
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data\_old
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data\_new
binlog_format=row
report-host=10.125.133.220
report-port=3306
#sync-master-info=1
read-only=1
net_read_timeout = 7200
net_write_timeout = 7200
innodb_flush_log_at_trx_commit = 2
sync_binlog=0
sync_relay_log_info=0
max_relay_log_size=268435456
Lots of possible solutions. But I'll go with the simplest one. Have you got enough network bandwidth to send all changes over the network? You're using "row" binlog, which may be good in case of random, unindexed updates. But if you're changing a lot of data using indexes only, then "mixed" binlog may be better.

Force lshosts command to return megabytes for "maxmem" and "maxswp" parameters

When I type "lshosts" I am given:
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
server1 X86_64 Intel_EM 60.0 12 191.9G 159.7G Yes ()
server2 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
server3 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
I am trying to return maxmem and maxswp as megabytes, not gigabytes when lshosts is called. I am trying to send Xilinx ISE jobs to my LSF, however the software expects integer, megabyte values for maxmem and maxswp. By doing debugging, it appears that the software grabs these parameters using the lshosts command.
I have already checked in my lsf.conf file that:
LSF_UNIT_FOR_LIMTS=MB
I have tried searching the IBM Knowledge Base, but to no avail.
Do you use a specific command to specify maxmem and maxswp units within the lsf.conf, lsf.shared, or other config files?
Or does LSF force return the most practical unit?
Any way to override this?
LSF_UNIT_FOR_LIMITS should work, if you completely drained the cluster of all running, pending, and finished jobs. According to the docs, MB is the default, so I'm surprised.
That said, you can use something like this to transform the results:
$ cat to_mb.awk
function to_mb(s) {
e = index("KMG", substr(s, length(s)))
m = substr(s, 0, length(s) - 1)
return m * 10^((e-2) * 3)
}
{ print $1 " " to_mb($6) " " to_mb($7) }
$ lshosts | tail -n +2 | awk -f to_mb.awk
server1 191900 159700
server2 191900 191200
server3 191900 191200
The to_mb function should also handle 'K' or 'M' units, should those pop up.
If LSF_UNIT_FOR_LIMITS is defined in lsf.conf, lshosts will always print the output as a floating point number, and in some versions of LSF the parameter is defined as 'KB' in lsf.conf upon installation.
Try searching for any definitions of the parameter in lsf.conf and commenting them all out so that the parameter is left undefined, I think in that case it defaults to printing it out as an integer in megabytes.
(Don't ask me why it works this way)

Resources