snakemake - replacing wildcards in input directive by anonymous function

snakemake - replacing wildcards in input directive by anonymous function - python-3.x

I am writing a snakemake that will run a bioinformatics pipeline for several input samples. These input files (two for each analysis, one with the partial string match R1 and the second with the partial string match R2) start with a pattern and end with the extension .fastq.gz. Eventually I want to perform multiple operations, though, for this example I just want to align the fastq reads against a reference genome using bwa mem. So for this example my input file is NIPT-N2002394-LL_S19_R1_001.fastq.gz and I want to generate NIPT-N2002394-LL.bam (see code below specifying the directories where input and output are).
My config.yaml file looks like so:
# Run_ID
run: "200311_A00154_0454_AHHHKMDRXX"
# Base directory: the analysis directory from which I will fetch the samples
bd: "/nexusb/nipt/"
# Define the prefix
# will be used to subset the folders in bd
prefix: "NIPT"
# Reference:
ref: "/nexus/bhinckel/19/ONT_projects/PGD_breakpoint/ref_hg19_local/hg19_chr1-y.fasta"
And below is my snakefile
import os
import re
#############
# config file
#############
configfile: "config.yaml"
#######################################
# Parsing variables from config.yaml
#######################################
RUN = config['run']
BD = config['bd']
PREFIX = config['prefix']
FQDIR = f'/nexusb/Novaseq/{RUN}/Unaligned/'
BASEDIR = BD + RUN
SAMPLES = [sample for sample in os.listdir(BASEDIR) if sample.startswith(PREFIX)]
# explanation: in BASEDIR I have multiple subdirectories. The names of the subdirectories starting with PREFIX will be the name of the elements I want to have in the list SAMPLES, which eventually shall be my {sample} wildcard
#############
# RULES
#############
rule all:
input:
expand("aligned/{sample}.bam", sample = SAMPLES)
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
But I am getting:
Building DAG of jobs...
WildcardError in line 55 of /nexusb/nipt/200311_A00154_0454_AHHHKMDRXX/testMetrics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
's'
When calling snakemake -np
I believe my error lies in the definitions of R1 and R2 in the input directive. I find it puzzling because according to the official documentation snakemake should interpret any wildcard as the regex .+. But it is not doing that for sample NIPT-PearlPPlasma-05-PPx, whose R1 and R2 should be NIPT-PearlPPlasma-05-PPx_S5_R1_001.fastq.gz and NIPT-PearlPPlasma-05-PPx_S5_R2_001.fastq.gz, respectively.

Take a look again at the snakemake tutorial on how input is inferred from output, anyways I think the problem lies in this piece of code:
output:
expand("aligned/{sample}.bam", sample = SAMPLES)
And needs to be changed into
output:
"aligned/{sample}.bam"
What you had didn't work because before expand("aligned/{sample}.bam", sample = SAMPLES) basically becomes a list like this ["aligned/sample0.bam","aligned/sample1.bam"]. When you remove the expand, you only give a "description" of how the output should look like, and thus snakemake can infer the wildcards and input.
edit:
It's difficult to test it since I don't have the actual files, but you should do something like this. Won't work if multiple S-thingies exist.
def get_reads(wildcards):
R1 = FQDIR + f"{wildcards.sample}_S{{s}}_R1_001.fastq.gz"
R2 = FQDIR + f"{wildcards.sample}_S{{s}}_R2_001.fastq.gz"
globbed = glob_wildcards(R1)
R1, R2 = expand([R1, R2], s=globbed.s)
return {"R1": R1, "R2": R2}
rule bwa_map:
input:
unpack(get_reads),
REF = config['ref']
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"

The problem is here:
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
Your output clearly defines a pattern where {sample} is a wildcard. When Snakemake builds the DAG and finds that any other rule requires a file that matches this pattern, it sets a concrete value to the wildcard.sample. At this moment all the inputs shall be defined, but you are introducing one more level of indirection: the wildcard {s} which is not defined.
The value of {s} shall be clearly inferred from the output. If you can do it in design time, substitute the it with the concrete values, otherwise you may use checkpoint feature of Snakemake.

Related

Nextflow script to process all files in given directory

I have a nextflow script that runs a couple of processes on a single vcf file. The name of the file is 'bos_taurus.vcf' and it is located in the directory /input_files/bos_taurus.vcf. The directory input_files/ contains also another file 'sacharomyces_cerevisea.vcf'. I would like my nextflow script to process both files. I was trying to use a glob pattern like ch_1 = channel.fromPath("/input_files/*.vcf"), but sadly I can't find a working solution. Any help would be really appreciated.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// here I tried to use globbing
params.input_files = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/input_files/*.vcf"
params.results_dir = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/results"
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
// how can I make this process work on two files simultanously
process FILTERING {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
path(input_files)
output:
path("*")
script:
"""
vcftools --vcf ${input_files} --mac 1 --minQ 20 --recode --recode-INFO-all --out after_filtering.vcf
"""
}

Note that if your VCF files are actually bgzip compressed and tabix indexed, you could instead use the fromFilePairs factory method to create your input channel. For example:
params.vcf_files = "./input_files/*.vcf.gz{,.tbi}"
params.results_dir = "./results"
process FILTERING {
tag { sample }
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(sample), path(indexed_vcf)
output:
tuple val(sample), path("${sample}.filtered.vcf")
"""
vcftools \\
--vcf "${indexed_vcf.first()}" \\
--mac 1 \\
--minQ 20 \\
--recode \\
--recode-INFO-all \\
--out "${sample}.filtered.vcf"
"""
}
workflow {
vcf_files = Channel.fromFilePairs( params.vcf_files, checkIfExists: true )
FILTERING( vcf_files ).view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [thirsty_torricelli] DSL2 - revision: 8f69ad5638
executor > local (3)
[7d/dacad6] process > FILTERING (C) [100%] 3 of 3 ✔
[A, /path/to/work/84/f9f00097bcd2b012d3a5e105b9d828/A.filtered.vcf]
[B, /path/to/work/cb/9f6f78213f0943013990d30dbb9337/B.filtered.vcf]
[C, /path/to/work/7d/dacad693f06025a6301c33fd03157b/C.filtered.vcf]
Note that BCFtools is actively maintained and is intended as a replacement for VCFtools. In a production pipeline, BCFtools should be preferred.

Here is a little example for starters. First, you should specify a unique output name in each process. Currently, after_filtering.vcf is hardcoded so this will overwrite each other once copied to the publishDir. You can do that with the baseName operator as below and permanently store it in the input file channel, first element being the sample name and second one the actual file. I made an example process that just runs head on the vcf, you can then adapt as needed for what you actually need.
#! /usr/bin/env nextflow
nextflow.enable.dsl = 2
params.input_files = "/Users/atpoint/vcf/*.vcf"
params.results_dir = "/Users/atpoint/vcf/"
// A channel that contains a map with sample name and the file itself
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
.map { it -> [it.baseName, it] }
// An example process just head-ing the vcf
process VcfHead {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(name), path(vcf_in)
output:
path("*_head.vcf")
script:
"""
head -n 1 $vcf_in > ${name}_head.vcf
"""
}
// Run it
workflow {
VcfHead(file_channel)
}
The file_channel channel looks like this if you add a .view() to it:
[one, /Users/atpoint/vcf/one.vcf]
[two, /Users/atpoint/vcf/two.vcf]

How to call a forward the value of a variable created in the script in Nextflow to a value output channel?

i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.

If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env

Angr can't solve the googlectf beginner problem

I am a student studying angr, first time.
I'm watching the code in this url.
https://github.com/Dvd848/CTFs/blob/master/2020_GoogleCTF/Beginner.md
import angr
import claripy
FLAG_LEN = 15
STDIN_FD = 0
base_addr = 0x100000 # To match addresses to Ghidra
proj = angr.Project("./a.out", main_opts={'base_addr': base_addr})
flag_chars = [claripy.BVS('flag_%d' % i, 8) for i in range(FLAG_LEN)]
flag = claripy.Concat( *flag_chars + [claripy.BVV(b'\n')]) # Add \n for scanf() to accept the input
state = proj.factory.full_init_state(
args=['./a.out'],
add_options=angr.options.unicorn,
stdin=flag,
)
# Add constraints that all characters are printable
for k in flag_chars:
state.solver.add(k >= ord('!'))
state.solver.add(k <= ord('~'))
simgr = proj.factory.simulation_manager(state)
find_addr = 0x101124 # SUCCESS
avoid_addr = 0x10110d # FAILURE
simgr.explore(find=find_addr, avoid=avoid_addr)
if (len(simgr.found) > 0):
for found in simgr.found:
print(found.posix.dumps(STDIN_FD))
https://github.com/google/google-ctf/tree/master/2020/quals/reversing-beginner/attachments
Which is the answer of googlectf beginner.
But, the above code does not work. It doesn't give me the answer.
I want to know why the code is not working.
When I execute this code, the output was empty.
I run the code with python3 in Ubuntu 20.04 in wsl2
Thank you.

I believe this script isn't printing anything because angr fails to find a solution and then exits. You can prove this by appending the following to your script:
else:
raise Exception('Could not find the solution')
If the exception raises, a valid solution was not found.
In terms of why it doesn't work, this code looks like copy & paste from a few different sources, and so it's fairly convoluted.
For example, the way the flag symbol is passed to stdin is not ideal. By default, stdin is a SimPackets, so it's best to keep it that way.
The following script solves the challenge, I have commented it to help you understand. You will notice that changing stdin=angr.SimPackets(name='stdin', content=[(flag, 15)]) to stdin=flag will cause the script to fail, due to the reason mentioned above.
import angr
import claripy
base = 0x400000 # Default angr base
project = angr.Project("./a.out")
flag = claripy.BVS("flag", 15 * 8) # length is expected in bits here
initial_state = project.factory.full_init_state(
stdin=angr.SimPackets(name='stdin', content=[(flag, 15)]), # provide symbol and length (in bytes)
add_options ={
angr.options.SYMBOL_FILL_UNCONSTRAINED_MEMORY,
angr.options.SYMBOL_FILL_UNCONSTRAINED_REGISTERS
}
)
# constrain flag to common alphanumeric / punctuation characters
[initial_state.solver.add(byte >= 0x20, byte <= 0x7f) for byte in flag.chop(8)]
sim = project.factory.simgr(initial_state)
sim.explore(
find=lambda s: b"SUCCESS" in s.posix.dumps(1), # search for a state with this result
avoid=lambda s: b"FAILURE" in s.posix.dumps(1) # states that meet this constraint will be added to the avoid stash
)
if sim.found:
solution_state = sim.found[0]
print(f"[+] Success! Solution is: {solution_state.posix.dumps(0)}") # dump whatever was sent to stdin to reach this state
else:
raise Exception('Could not find the solution') # Tell us if angr failed to find a solution state
A bit of Trivia - there are actually multiple 'solutions' that the program would accept, I guess the CTF flag server only accepts one though.
❯ echo -ne 'CTF{\x00\xe0MD\x17\xd1\x93\x1b\x00n)' | ./a.out
Flag: SUCCESS

execute snakemake rule as last rule

I tried to create a snakemake file to run sortmeRNA pipeline:
SAMPLES = ['test']
READS=["R1", "R2"]
rule all:
input: expand("Clean/4.Unmerge/{exp}.non_rRNA_{read}.fastq", exp = SAMPLES, read = READS)
rule unzip:
input:
fq = "trimmed/{exp}.{read}.trimd.fastq.gz"
output:
ofq = "Clean/1.Unzipped/{exp}.{read}.trimd.fastq"
shell: "gzip -dkc < {input.fq} > {output.ofq}"
rule merge_paired:
input:
read1 = "Clean/1.Unzipped/{exp}.R1.trimd.fastq",
read2 = "Clean/1.Unzipped/{exp}.R2.trimd.fastq"
output:
il = "Clean/2.interleaved/{exp}.il.trimd.fastq"
shell: "merge-paired-reads.sh {input.read1} {input.read2} {output.il}"
rule sortmeRNA:
input:
ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
output:
reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
params:
silvabac = "rRNA_databases/silva-bac-16s-id90.fasta,index/silva-bac-16s-db:rRNA_databases/silva-bac-23s-id98.fasta,index/silva-bac-23s-db",
silvaarc = "rRNA_databases/silva-arc-16s-id95.fasta,index/silva-arc-16s-db:rRNA_databases/silva-arc-23s-id98.fasta,index/silva-arc-23s-db",
silvaeuk = "rRNA_databases/silva-euk-18s-id95.fasta,index/silva-euk-18s-db:rRNA_databases/silva-euk-28s-id98.fasta,index/silva-euk-28s-db",
rfam = "rRNA_databases/rfam-5s-database-id98.fasta,index/rfam-5s-db:rRNA_databases/rfam-5.8s-database-id98.fasta,index/rfam-5.8s-db",
acc = "--num_alignments 1 --fastx --log -a 20 -m 64000 --paired_in -v"
log:
"Clean/sortmeRNAlogs/{exp}_sortmeRNA.log"
shell:'''
sortmerna --ref {params.silvabac}:{params.silvaarc}:{params.silvaeuk}:{params.rfam} --reads {input.ilfq} --aligned {output.reads_rRNA} --other {output.non_rRNA} {params.acc}
'''
rule unmerge_paired:
input:
inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq"
output:
R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
shell:"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"
This worked fine !. But for 1 sample it produced an output with size ~53 GB. I have 90 samples to run and cannot afford huge disk space. I tried to make output of rules unzip,merge_paired,sortmeRNA as temp(), but upon executing unmerge_paired raises "Missing input files exception" error.
I also tried to add rule_remove to delete all those intermediate directories. But that is not executed as last rule, rather somewhere in the middle raising error again !. Is there any efficient way to do this ?
The error that occurs is:
MissingInputException in line 45 of sortmeRNA_pipeline_memv2.0.snakefile:
Missing input files for rule unmerge_paired:
Clean/3.sorted/test_reads_nonRNA.fastq
Also please note that, rule sortmeRNA requires a string for output and produces string.fastq file, which is then input into rule unmerge_paired !
Thanks.

For Snakemake to connect the input of one rule to the output of another, they will need to be identical. The way you describe the output of sortmeRNA and the input unmerge_paired do not work together, whether you put temp() around it or not.
rule sortmeRNA:
input:
ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
output:
reads_rRNA = temp("Clean/3.sorted/{exp}_reads_rRNA.fastq"),
non_rRNA = temp("Clean/3.sorted/{exp}_reads_nonRNA.fastq")
params:
reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
shell:
'''
sortmerna --aligned {params.reads_rRNA} --other {params.non_rRNA} ...
'''
rule unmerge_paired:
input:
inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq" # or rules.sortmeRNA.output.non_rRNA
output:
R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
shell:
"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"
I removed all the stuff which isn't necessary to understand what is going on, you will have to place those back obviously. I changed the output of sortmeRNA to the actual output of the rule (and made them temp). I also added two params, which are the same as the output but then without the fastq extension.

Combining mandatory and optional args with subcommands

I am struggling with command-line parsing and argparse, how to handle global variables, subcommands and optional params to these subcommands
I'm writting a python3 wrapper around python-libvirt to manage my VMs. The wrapper will handle creation, removal, stop/start, snapshots, etc.
A partial list of the options follows, that shows the different ways to pass params to my script:
# Connection option for all commands:
# ---
# vmman.py [-c hypervisor] (defaults to qemu:///system)
# Generic VM commands:
# ---
# vmman.py show : list all vms, with their state
# vmman.py {up|down|reboot|rm} domain : boots, shuts down, reboots
or deletes the domain
# Snapshot management:
# ---
# vmman.py lssnap domain : list snapshots attached to domain
# vmman.py snaprev domain [snapsname] : reverts domain to latest
snapshot or to snapname
# Resource management:
# ---
# vmman.py domain resdel [disk name] [net iface]
And then some code used to test the first subcommand :
def setConnectionString(args):
print('Arg = %s' % args.cstring)
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers()
parserConnect = subparsers.add_parser('ConnectionURI')
parserConnect.set_defaults(func=setConnectionString)
parserConnect.add_argument('-c', '--connect', dest='host')
args = parser.parse_args()
args.func(args)
print("COMPLETED")
Now, the argparse() doc on docs.python.org is dense and a bit confusing of a python newbie as I am... I would have expected the output to be something like:
`Arg = oslo`
What I get is :
[10:21:40|jfgratton#bergen:kvmman.py]: ./argstest.py -c oslo
usage: argstest.py [-h] {ConnectionURI} ...
argstest.py: error: invalid choice: 'connectionURI' (choose from 'ConnectionURI')
I obviously miss something, and I'm only testing the one I thought would be the easiest of the lot (global param); haven't even figured yet on how to handle optional subparams and all.

Your error output lists 'connectionURI' with lowercase 'c' as invalid choice, while it also says "choose from 'ConnectionURI'" with capital letter 'C'.
Fix: Call your test with:
./argstest.py ConnectionURI oslo
Maybe you should start simple (without subparsers) and build from there:
import argparse
def setConnectionString(hostname):
print('Arg = {}'.format(hostname))
parser = argparse.ArgumentParser(description='python3 wrapper around python-libvirt to manage VMs')
parser.add_argument('hostname')
args = parser.parse_args()
setConnectionString(args.hostname)
print("COMPLETED")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

snakemake - replacing wildcards in input directive by anonymous function - python-3.x

Related

Nextflow script to process all files in given directory

How to call a forward the value of a variable created in the script in Nextflow to a value output channel?

Angr can't solve the googlectf beginner problem

execute snakemake rule as last rule

Combining mandatory and optional args with subcommands

Categories

Resources