How can I get qstat to give me full job names?
I know qstat -r gives detailed information about the task, but it's too much and the resource requirements are included.
The qstat -r output is like:
131806 0.25001 tumor_foca ajalali qw 09/29/2014 15:49:41 1 2-100:1
Full jobname: tumor_focality-TCGA-THCA-ratboost_linear_svc
Hard Resources: distribution=wheezy (0.000000)
h_rt=72000 (0.000000)
mem_free=15G (0.000000)
h_vmem=15G (0.000000)
h_stack=256M (0.000000)
Soft Resources:
131807 0.25001 vital_stat ajalali qw 09/29/2014 15:49:41 1 2-100:1
Full jobname: vital_status-TCGA-LGG-ratboost_linear_svc
Hard Resources: distribution=wheezy (0.000000)
h_rt=72000 (0.000000)
mem_free=15G (0.000000)
h_vmem=15G (0.000000)
h_stack=256M (0.000000)
Soft Resources:
Right now my only option is to grep the output as I need:
$ qstat -r | grep "Full jobname" -B1
--
131806 0.25001 tumor_foca ajalali qw 09/29/2014 15:49:41 1 2-100:1
Full jobname: tumor_focality-TCGA-THCA-ratboost_linear_svc
--
131807 0.25001 vital_stat ajalali qw 09/29/2014 15:49:41 1 2-100:1
Full jobname: vital_status-TCGA-LGG-ratboost_linear_svc
Can I do it better to have a nicer output?
This on is a bit messy, but it works as a simple solution to have in the command history. All standard tools. Output is pretty much the same as what you get from a normal qstat call, but you won't get the headers:
One-liner:
qstat -xml | tr '\n' ' ' | sed 's#<job_list[^>]*>#\n#g' \
| sed 's#<[^>]*>##g' | grep " " | column -t
Description of commands:
List jobs as XML:
qstat -xml
Remove all newlines:
tr '\n' ' '
Add newline before each job entry in the list:
sed 's#<job_list[^>]*>#\n#g'
Remove all XML stuff:
sed 's#<[^>]*>##g'
Hack to add newline at the end:
grep " "
Columnize:
column -t
Example output
351996 0.50502 ProjectA_XXXXXXXXX_XXXX_XXXXXX user123 r 2015-06-25T15:38:41 xxxxx-sim01#xxxxxx02.xxxxx.xxx 1
351997 0.50502 ProjectA_XXX_XXXX_XXX user123 r 2015-06-25T15:39:26 xxxxx-sim01#xxxxxx23.xxxxx.xxx 1
351998 0.50502 ProjectA_XXXXXXXXXXXXX_XXXX_XXXX user123 r 2015-06-25T15:40:26 xxxxx-sim01#xxxxxx14.xxxxx.xxx 1
351999 0.50502 ProjectA_XXXXXXXXXXXXXXXXX_XXXX_XXXX user123 r 2015-06-25T15:42:11 xxxxx-sim01#xxxxxx19.xxxxx.xxx 1
352001 0.50502 ProjectA_XXXXXXXXXXXXXXXXXXXXXXX_XXXX_XXXX user123 r 2015-06-25T15:42:11 xxxxx-sim01#xxxxxx11.xxxxx.xxx 1
352008 0.50501 runXXXX69 usr1 r 2015-06-25T15:49:04 xxxxx-sim01#xxxxxx17.xxxxx.xxx 1
352009 0.50501 runXXXX70 usr1 r 2015-06-25T15:49:04 xxxxx-sim01#xxxxxx01.xxxxx.xxx 1
352010 0.50501 runXXXX71 usr1 r 2015-06-25T15:49:04 xxxxx-sim01#xxxxxx06.xxxxx.xxx 1
352011 0.50501 runXXXX72 usr1 r 2015-06-25T15:49:04 xxxxx-sim01#xxxxxx21.xxxxx.xxx 1
352012 0.50501 runXXXX73 usr1 r 2015-06-25T15:49:04 xxxxx-sim01#xxxxxx13.xxxxx.xxx 1
352013 0.50501 runXXXX74 usr1 r 2015-06-25T15:49:04 xxxxx-sim01#xxxxxx11.xxxxx.xxx 1
Maybe an easier solution: set SGE_LONG_JOB_NAMES to -1, and qstat will figure out the size of the name column:
export SGE_LONG_JOB_NAMES=-1
qstat -u username
Works for me.
Cheers!
This script works pretty well. It looks like it is from cambridge. http://www.hep.ph.ic.ac.uk/~dbauer/grid/myqstat.py
For Python 3:
#!/usr/bin/python
import xml.dom.minidom
import os
import sys
import string
f=os.popen('qstat -u \* -xml -r')
dom=xml.dom.minidom.parse(f)
jobs=dom.getElementsByTagName('job_info')
run=jobs[0]
runjobs=run.getElementsByTagName('job_list')
def fakeqstat(joblist):
for r in joblist:
try:
jobname=r.getElementsByTagName('JB_name')[0].childNodes[0].data
jobown=r.getElementsByTagName('JB_owner')[0].childNodes[0].data
jobstate=r.getElementsByTagName('state')[0].childNodes[0].data
jobnum=r.getElementsByTagName('JB_job_number')[0].childNodes[0].data
jobtime='not set'
if(jobstate=='r'):
jobtime=r.getElementsByTagName('JAT_start_time')[0].childNodes[0].data
elif(jobstate=='dt'):
jobtime=r.getElementsByTagName('JAT_start_time')[0].childNodes[0].data
else:
jobtime=r.getElementsByTagName('JB_submission_time')[0].childNodes[0].data
print(jobnum, '\t', jobown.ljust(16), '\t', jobname.ljust(16),'\t', jobstate,'\t',jobtime)
except Exception as e:
print(e)
fakeqstat(runjobs)
For Python 2:
#!/usr/bin/python
import xml.dom.minidom
import os
import sys
import string
#import re
f=os.popen('qstat -u \* -xml -r')
dom=xml.dom.minidom.parse(f)
jobs=dom.getElementsByTagName('job_info')
run=jobs[0]
runjobs=run.getElementsByTagName('job_list')
def fakeqstat(joblist):
for r in joblist:
jobname=r.getElementsByTagName('JB_name')[0].childNodes[0].data
jobown=r.getElementsByTagName('JB_owner')[0].childNodes[0].data
jobstate=r.getElementsByTagName('state')[0].childNodes[0].data
jobnum=r.getElementsByTagName('JB_job_number')[0].childNodes[0].data
jobtime='not set'
if(jobstate=='r'):
jobtime=r.getElementsByTagName('JAT_start_time')[0].childNodes[0].data
elif(jobstate=='dt'):
jobtime=r.getElementsByTagName('JAT_start_time')[0].childNodes[0].data
else:
jobtime=r.getElementsByTagName('JB_submission_time')[0].childNodes[0].data
print jobnum, '\t', jobown.ljust(16), '\t', jobname.ljust(16),'\t', jobstate,'\t',jobtime
fakeqstat(runjobs)
I am currently writing my own qstat wrapper in order to get a clean, useful and customizable output.
Here is the github repository. The project has grown too much for the code to be pasted in this message.
It comes with an installer and should work without any problem with both Python 2.7 and 3 (the installation script makes the modifications if needed). qjobs -h provides some help on the available options. I will write a more complete documentation in the following days on the github wiki.
I will update this message as often as possible to stick to the current state of the project. Please feel free to comment here (or on github) to ask for features/report problems.
In the near future, I will try to add a fully interactive mode to browse the job list more easily. Of course, the classic text output will be still available (it could be useful to e-mail the output, or for a quick check of the pending/running jobs).
Example output
Command qjobs gives:
5599109 short_name r 2015-06-25 10:27:39 queue1
5599110 jobName r 2015-06-25 10:35:39 queue2
5599111 a_long_job_name qw 2015-06-25 10:40:39
5599112 foo qw 2015-06-25 10:40:39
5599113 bar qw 2015-06-25 10:40:39
5599114 baz qw 2015-06-25 10:40:39
5599115 beer qw 2015-06-25 10:40:39
tot: 7
r: 2 qw: 5
Command qjobs -o gives:
tot: 7
r: 2 qw: 5
Command qjobs -o inek -t gives (e is elapsed time since start/sub time, the format is customizable using the Format Spec. Mini-Language of Python; k is complete queue name, with domain):
5598985 SpongeBob 522:02 (21.75 days) queue1#node23.domain.fake
5598987 ping_java 521:47 (21.74 days) queue1#node39.domain.fake
5598988 run3.14 521:46 (21.74 days) queue2#node40.domain.fake
5598990 strange_job_42 521:42 (21.74 days) queue3#node36.domain.fake
5598991 coffee-maker 521:39 (21.74 days) queue2#node34.domain.fake
5598992 dumbtask 521:29 (21.73 days) queue1#node14.domain.fake
qjobs -i gives a complete list of the available 'items'. Each of this item is available as:
a column output (with -o ITEMS);
as a criteria to count the job and produces total output, with -t (e.g. -t s to count by state as in the two first examples);
as a criteria to sort the job with -s, default is -s ips meaning that the job list is sorted by ID, then by priority and finally by state before being printed.
The result of qjobs -i is:
i: job id
p: job priority
n: job name
o: job owner
s: job state
t: job start/submission time
e: elapsed time since start/submission
q: queue name without domain
d: queue domain
k: queue name with domain
r: requested queue(s)
l: number of slots used
Thanks to JLT for nice simple code. I've expanded it a bit to fit my needs and make it look nice.
Sample Output:
Job ID Job Name Owner Status
------ ------------------------------------ ------ ------
201716 AtacSilN100400K mtsige R
201771 IsoOnGrap400K mtsige R
202067 AtacOnSilica400K mtsige R
202100 AtacGrapN100400K mtsige R
202135 AtacOnSilc400K mtsige R
202145 AtacOnGrap400K mtsige R
202152 AtacOnGraphN3360K mtsige R
202161 AtacticSilicaN10 mtsige R
202163 AtacGrapN10 mtsige R
202169 AtacSilcN10 mtsige R
202192 wallpmma07 am110 R
202193 wallpmma03 am110 R
202194 att03wpm_95solps am110 R
202202 AtacticSilicaN3 mtsige R
203260 8test18_trop_2p ico R
203359 parseAll_Bob/Sub951By50/Cyl20A_atom1 oge1 R
203360 parseAll_Bob/Sub951By50/Cyl30A_atom1 oge1 R
203361 parseAll_Bob/Sub951By50/Cyl30A_atom2 oge1 R
Code:
#!/opt/bin/python3
import os
import xml.etree.ElementTree as ET
#Fields
fields=['Job_Id','Job_Name','Job_Owner','job_state']
names=['Job ID','Job Name','Owner','Status']
#Get job info
f = os.popen('qstat -x')
tree = ET.parse(f)
root = tree.getroot()
n_fields=len(fields)
jobs=[[job.find(field).text for field in fields] for job in root]
max_lengths=[len(name) for name in names]
sep=' '
#Identify max characer length per field
for j in jobs:
for i in range(n_fields):
#Chop off anything after and including '#' or '.' from all fields
if j[i].find('#')>0:
j[i]=j[i][:j[i].find('#')]
if j[i].find('.')>0:
j[i]=j[i][:j[i].find('.')]
if(len(j[i])>max_lengths[i]):
max_lengths[i]=len(j[i])
#Field names
for i in range(n_fields):
print('{s:^{length}}'.format(s=names[i],length=max_lengths[i]),end=sep)
print()
#Dashes
for i in range(n_fields):
print('-'*max_lengths[i],end=sep)
print()
#Jobs
for j in jobs:
for i in range(n_fields):
if j[i].find('#')>0:
j[i]=j[i][:j[i].find('#')]
print('{s:<{length}}'.format(s=j[i],length=max_lengths[i]),end=sep)
print()
If you just want the names:
qstat -f | grep 'Job_Name'
Example of output:
Job_Name = File.output
Job_Name = file.out
For me the script of Physical Chemist didn't work so I wrote a very simple script using the xml.tree.ElementTree module which i regard as somewhat easier than xml.dom.minidom
import os
import xml.etree.ElementTree as ET
f = os.popen('qstat -x')
tree = ET.parse(f)
root = tree.getroot()
print "Job_Id walltime state nodes Job_Name"
print "------ -------- ----- --------------- --------------------------"
for job in root:
print job.find('Job_Id').text, " ",
print job.find('resources_used').find('walltime').text, " ",
print job.find('job_state').text, " ",
print job.find('Resource_List').find('nodes').text, " ",
print job.find('Job_Name').text
A poor KISS solution :
qstat -xml -f -u \* | fgrep JB_name | wc -l
python code
import xmltodict
import subprocess as sp
import pandas as pd
qstat_xml = sp.check_output(['qstat','--xml'], stderr=sp.STDOUT) # read xml
stat_dict = xmltodict.parse(qstat_xml) # convert to dict
job_list = stat_dict['Data']['Job'] # select job_list
job_df = pd.DataFrame(job_list) # convert to dataframe
print('columns', job_df.columns) # print available columns
column_list = ['Job_Id', 'Job_Name']
selection_df = job_df[column_list] # select columns
print(selection_df)
Related
I have an app that parses multiple Cisco show tech files. These files contain the output of multiple router commands in a structured way, let me show you an snippet of a show tech output:
`show clock`
20:20:50.771 UTC Wed Sep 07 2022
Time source is NTP
`show callhome`
callhome disabled
Callhome Information:
<SNIPET>
`show module`
Mod Ports Module-Type Model Status
--- ----- ------------------------------------- --------------------- ---------
1 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
2 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
3 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
4 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
21 0 Fabric Module N9K-C9504-FM-R ok
22 0 Fabric Module N9K-C9504-FM-R ok
23 0 Fabric Module N9K-C9504-FM-R ok
<SNIPET>
My app currently uses both SED and Python scripts to parse these files. I use SED to parse the show tech file looking for a specific command output, once I find it, I stop SED. This way I don't need to read all the file (these can get to be very big files). This is a snipet of my SED script:
sed -E -n '/`show running-config`|`show running`|`show running config`/{
p
:loop
n
p
/`show/q
b loop
}' $1/$file
As you can see I am using a multi address range in SED. My question specifically is, how can I achieve something similar in python? I have tried multiple combinations of flags: DOTALL and MULTILINE but I can't get the result I'm expecting, for example, I can get a match for the command I'm looking for, but python regex wont stop until the end of the file after the first match.
I am looking for something like this
sed -n '/`show clock`/,/`show/p'
I would like the regex match to stop parsing the file and print the results, immediately after seeing `show again , hope that makes sense and thank you all for reading me and for your help
You can use nested loops.
import re
def process_file(filename):
with open(filename) as f:
for line in f:
if re.search(r'`show running-config`|`show running`|`show running config`', line):
print(line)
for line1 in f:
print(line1)
if re.search(r'`show', line1):
return
The inner for loop will start from the next line after the one processed by the outer loop.
You can also do it with a single loop using a flag variable.
import re
def process_file(filename):
in_show = False
with open(filename) as f:
for line in f:
if re.search(r'`show running-config`|`show running`|`show running config`', line):
in_show = True
if in_show
print(line)
if re.search(r'`show', line1):
return
I am noticing that all my rules request memory twice, one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb). If I run the rules as localrules they do run faster. How can I make sure the default settings do not interfere?
resources: mem_mb=100, disk_mb=8620, tmpdir=/tmp/pop071.54835, partition=h24, qos=normal, mem_gb=100, time=120:00:00
The rules are as follows:
rule bwa_mem2_mem:
input:
R1 = "data/results/qc/{species}.{population}.{individual}_1.fq.gz",
R2 = "data/results/qc/{species}.{population}.{individual}_2.fq.gz",
R1_unp = "data/results/qc/{species}.{population}.{individual}_1_unp.fq.gz",
R2_unp = "data/results/qc/{species}.{population}.{individual}_2_unp.fq.gz",
idx= "data/results/genome/genome",
ref = "data/results/genome/genome.fa"
output:
bam = "data/results/mapped_reads/{species}.{population}.{individual}.bam",
log:
bwa ="logs/bwa_mem2/{species}.{population}.{individual}.log",
sam ="logs/samtools_view/{species}.{population}.{individual}.log",
benchmark:
"benchmark/bwa_mem2_mem/{species}.{population}.{individual}.tsv",
resources:
time = parameters["bwa_mem2"]["time"],
mem_gb = parameters["bwa_mem2"]["mem_gb"],
params:
extra = parameters["bwa_mem2"]["extra"],
tag = compose_rg_tag,
threads:
parameters["bwa_mem2"]["threads"],
shell:
"bwa-mem2 mem -t {threads} -R '{params.tag}' {params.extra} {input.idx} {input.R1} {input.R2} | "
"samtools sort -l 9 -o {output.bam} --reference {input.ref} --output-fmt CRAM -# {threads} /dev/stdin 2> {log.sam}"
and the config is:
cluster:
mkdir -p logs/{rule} && # change the log file to logs/slurm/{rule}
sbatch
--partition={resources.partition}
--time={resources.time}
--qos={resources.qos}
--cpus-per-task={threads}
--mem={resources.mem_gb}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
--parsable # Required to pass job IDs to scancel
default-resources:
- partition=h24
- qos=normal
- mem_gb=100
- time="04:00:00"
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True # Required to run with local conda enviroment
cluster-status: status-sacct.sh # Required to monitor the status of the submitted jobs
cluster-cancel: scancel # Required to cancel the jobs with Ctrl + C
cluster-cancel-nargs: 50
Cheers,
Angel
Right now there are two separate memory resource requirements:
mem_mb
mem_gb
From the perspective of snakemake these are different, so both will be passed to the cluster. A quick fix is to use the same units, e.g. if the resource really requires only 100 mb, then the default resource should be changed to:
default-resources:
- partition=h24
- qos=normal
- mem_mb=100
I am trying to write Chinese characters to a CSV file based on their Unicode code points found in a text file in unicode.org/Public/zipped/13.0.0/Unihan.zip. For instance, one example character is U+9109.
In the example below I can get the correct output by hard coding the value (line 8), but keep getting it wrong with every permutation I've tried at generating the bytes from the code point (lines 14-16).
I'm running this in Python 3.8.3 on a Debian-based Linux distro.
Minimal working (broken) example:
1 #!/usr/bin/env python3
2
3 def main():
4
5 output = open("test.csv", "wb")
6
7 # Hardcoded values work just fine
8 output.write('\u9109'.encode("utf-8"))
9
10 # Comma separation
11 output.write(','.encode("utf-8"))
12
13 # Problem is here
14 codepoint = '9109'
15 u_str = '\\' + 'u' + codepoint
16 output.write(u_str.encode("utf-8"))
17
18 # End with newline
19 output.write('\n'.encode("utf-8"))
20
21 output.close()
22
23 if __name__ == "__main__":
24 main()
Executing and viewing results:
example $
example $./test.py
example $
example $cat test.csv
鄉,\u9109
example $
The expected output would look like this (Chinese character occurring on both sides of the comma):
example $
example $./test.py
example $cat test.csv
鄉,鄉
example $
chr is used to convert integers to code points in Python 3. Your code could use:
output.write(chr(0x9109).encode("utf-8"))
But if you specify the encoding in the open instead of using binary mode you don't have to manually encode everything. print to a file handles newlines for you as well.
with open("test.txt",'w',encoding='utf-8') as output:
for i in range(0x4e00,0x4e10):
print(f'U+{i:04X} {chr(i)}',file=output)
Output:
U+4E00 一
U+4E01 丁
U+4E02 丂
U+4E03 七
U+4E04 丄
U+4E05 丅
U+4E06 丆
U+4E07 万
U+4E08 丈
U+4E09 三
U+4E0A 上
U+4E0B 下
U+4E0C 丌
U+4E0D 不
U+4E0E 与
U+4E0F 丏
I am working on snakemake to make a pipeline for my work.
My input files are (Example fastq files)
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-1_R1.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-1_R2.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-2_R1.fastq.gz
/project/ateeq/PROJECT/snakemake-example/raw_data/H667-2_R2.fastq.gz
I have written the following code for trimming the data using FastP
"""
Author: Dave Amir
Affiliation: St Lukes
Aim: A simple Snakemake workflow to process paired-end stranded RNA-Seq.
Date: 11 June 2015
Run: snakemake -s snakefile
Latest modification:
- todo
"""
# This should be placed in the Snakefile.
##-----------------------------------------------##
## Working directory ##
## Adapt to your needs ##
##-----------------------------------------------##
BASE_DIR = "/project/ateeq/PROJECT"
WDIR = BASE_DIR + "/snakemake-example"
workdir: WDIR
#message("The current working directory is " + WDIR)
##--------------------------------------------------------------------------------------##
## Variables declaration
## Declaring some variables used by topHat and other tools...
## (GTF file, INDEX, chromosome length)
##--------------------------------------------------------------------------------------##
INDEX = BASE_DIR + "/ref_files/hg19/assembly/"
GTF = BASE_DIR + "/hg19/Hg19_CTAT_resource_lib/ref_annot.gtf"
CHR = BASE_DIR + "/static/humanhg19.annot.csv"
FASTA = BASE_DIR + "/ref_files/hg19/assembly/hg19.fasta"
##--------------------------------------------------------------------------------------##
## The list of samples to be processed
##--------------------------------------------------------------------------------------##
SAMPLES, = glob_wildcards("/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz")
NB_SAMPLES = len(SAMPLES)
for smp in SAMPLES:
message:("Sample " + smp + " will be processed")
##--------------------------------------------------------------------------------------##
## Our First rule - sample trimming
##--------------------------------------------------------------------------------------##
rule final:
input: expand("/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", smp=SAMPLES)
rule trimming:
input: fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output: fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq", rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
when I am running the pipeline it shows the following error
MissingOutputException in line 51 of /project/ateeq/PROJECT/snakemake-example/snakefile:
Missing files after 5 seconds:
/project/ateeq/PROJECT/snakemake-example/trimmed/H667-1_R1_trimmed.fastq
/project/ateeq/PROJECT/snakemake-example/trimmed/H667-1_R2_trimmed.fastq
/project/ateeq/PROJECT/snakemake-example/report/H667-1.html
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Though I have increased the latency wait period for 2000 sec it still ends up throwing an error
snakemake -s snakefile -j 30 --latency-wait 2000
I am using snakemake version 5.4.5 and Python 3.6.8. Please let me know where I am going wrong, it would be a great help to me
Thanks for kind Help,
Sincerely,
Dave
I don't know if this is due to the copy & paste but the indentation is wrong:
rule trimming:
input: fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output: fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq", rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq", rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
Shouldn't it be:
rule trimming:
input:
fwd="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R1.fastq.gz",
rev="/project/ateeq/PROJECT/snakemake-example/raw_data/{smp}_R2.fastq.gz"
output:
fwd="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R1_trimmed.fastq",
rev="/project/ateeq/PROJECT/snakemake-example/trimmed/{smp}_R2_trimmed.fastq",
rep="/project/ateeq/PROJECT/snakemake-example/report/{smp}.html"
threads: 30
message: """--- Trimming."""
shell: """
fastp -i {input.fwd} -I {input.rev} -o {output.fwd} -O {output.rev} --detect_adapter_for_pe --disable_length_filtering --correction --qualified_quality_phred 30 --thread 16 --html {output.rep} --report_title "Fastq Quality Control Report" &>>{input.fwd}.log
"""
I have a snakemake workflow for a metagenomics project. At a point in the workflow, I map DNA sequencing reads (either single or paired-end) to metagenome assemblies made by the same workflow. I made an input function conform the Snakemake manual to map both single end and paired end reads with one rule. like so
import os.path
def get_binning_reads(wildcards):
pathpe=("data/sequencing_binning_signals/" + wildcards.binningsignal + ".trimmed_paired.R1.fastq.gz")
pathse=("data/sequencing_binning_signals/" + wildcards.binningsignal + ".trimmed.fastq.gz")
if os.path.isfile(pathpe) == True :
return {'reads' : expand("data/sequencing_binning_signals/{binningsignal}.trimmed_paired.R{PE}.fastq.gz", PE=[1,2],binningsignal=wildcards.binningsignal) }
elif os.path.isfile(pathse) == True :
return {'reads' : expand("data/sequencing_binning_signals/{binningsignal}.trimmed.fastq.gz", binningsignal=wildcards.binningsignal) }
rule backmap_bwa_mem:
input:
unpack(get_binning_reads),
index=expand("data/assembly_{{assemblytype}}/{{hostcode}}/scaffolds_bwa_index/scaffolds.{ext}",ext=['bwt','pac','ann','sa','amb'])
params:
lambda w: expand("data/assembly_{assemblytype}/{hostcode}/scaffolds_bwa_index/scaffolds",assemblytype=w.assemblytype,hostcode=w.hostcode)
output:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam"
threads: 100
log:
stdout="logs/bwa_backmap_samtools_{assemblytype}_{hostcode}.stdout",
samstderr="logs/bwa_backmap_samtools_{assemblytype}_{hostcode}.stdout",
stderr="logs/bwa_backmap_{assemblytype}_{hostcode}.stderr"
shell:
"bwa mem -t {threads} {params} {input.reads} 2> {log.stderr} | samtools view -# 12 -b -o {output} 2> {log.samstderr} > {log.stdout}"
When I make an arbitrary 'all-rule' like this, the workflow runs successfully.
rule allbackmapped:
input:
expand("data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam", binningsignal=BINNINGSIGNALS,assemblytype=ASSEMBLYTYPES,hostcode=HOSTCODES)
However, when the files created by this rule are required for subsequent rules like so:
rule backmap_samtools_sort:
input:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.bam"
output:
"data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.sorted.bam"
threads: 6
resources:
mem_mb=5000
shell:
"samtools sort -# {threads} -m {mem_mb}M -o {output} {input}"
rule allsorted:
input:
expand("data/assembly_{assemblytype}_binningsignals/{hostcode}/{binningsignal}.sorted.bam",binningsignal=BINNINGSIGNALS,assemblytype=ASSEMBLYTYPES,hostcode=HOSTCODES)
The workflow closes with this error
WorkflowError in line 416 of
/stor/azolla_metagenome/Azolla_genus_metagenome/Snakefile: Can only
use unpack() on list and dict
To me, this error suggests the input function for the former rule is faulty. This however, seems not to be the case for it ran successfully when no subsequent processing was queued.
The entire project is hosted on github. The entire Snakefile and a github issue.