How to call a forward the value of a variable created in the script in Nextflow to a value output channel? - groovy

i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.

If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env

Related

Nextflow script to process all files in given directory

I have a nextflow script that runs a couple of processes on a single vcf file. The name of the file is 'bos_taurus.vcf' and it is located in the directory /input_files/bos_taurus.vcf. The directory input_files/ contains also another file 'sacharomyces_cerevisea.vcf'. I would like my nextflow script to process both files. I was trying to use a glob pattern like ch_1 = channel.fromPath("/input_files/*.vcf"), but sadly I can't find a working solution. Any help would be really appreciated.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// here I tried to use globbing
params.input_files = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/input_files/*.vcf"
params.results_dir = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/results"
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
// how can I make this process work on two files simultanously
process FILTERING {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
path(input_files)
output:
path("*")
script:
"""
vcftools --vcf ${input_files} --mac 1 --minQ 20 --recode --recode-INFO-all --out after_filtering.vcf
"""
}
Note that if your VCF files are actually bgzip compressed and tabix indexed, you could instead use the fromFilePairs factory method to create your input channel. For example:
params.vcf_files = "./input_files/*.vcf.gz{,.tbi}"
params.results_dir = "./results"
process FILTERING {
tag { sample }
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(sample), path(indexed_vcf)
output:
tuple val(sample), path("${sample}.filtered.vcf")
"""
vcftools \\
--vcf "${indexed_vcf.first()}" \\
--mac 1 \\
--minQ 20 \\
--recode \\
--recode-INFO-all \\
--out "${sample}.filtered.vcf"
"""
}
workflow {
vcf_files = Channel.fromFilePairs( params.vcf_files, checkIfExists: true )
FILTERING( vcf_files ).view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [thirsty_torricelli] DSL2 - revision: 8f69ad5638
executor > local (3)
[7d/dacad6] process > FILTERING (C) [100%] 3 of 3 ✔
[A, /path/to/work/84/f9f00097bcd2b012d3a5e105b9d828/A.filtered.vcf]
[B, /path/to/work/cb/9f6f78213f0943013990d30dbb9337/B.filtered.vcf]
[C, /path/to/work/7d/dacad693f06025a6301c33fd03157b/C.filtered.vcf]
Note that BCFtools is actively maintained and is intended as a replacement for VCFtools. In a production pipeline, BCFtools should be preferred.
Here is a little example for starters. First, you should specify a unique output name in each process. Currently, after_filtering.vcf is hardcoded so this will overwrite each other once copied to the publishDir. You can do that with the baseName operator as below and permanently store it in the input file channel, first element being the sample name and second one the actual file. I made an example process that just runs head on the vcf, you can then adapt as needed for what you actually need.
#! /usr/bin/env nextflow
nextflow.enable.dsl = 2
params.input_files = "/Users/atpoint/vcf/*.vcf"
params.results_dir = "/Users/atpoint/vcf/"
// A channel that contains a map with sample name and the file itself
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
.map { it -> [it.baseName, it] }
// An example process just head-ing the vcf
process VcfHead {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(name), path(vcf_in)
output:
path("*_head.vcf")
script:
"""
head -n 1 $vcf_in > ${name}_head.vcf
"""
}
// Run it
workflow {
VcfHead(file_channel)
}
The file_channel channel looks like this if you add a .view() to it:
[one, /Users/atpoint/vcf/one.vcf]
[two, /Users/atpoint/vcf/two.vcf]

Nextflow capture output file by partial pattern

I've got a Nextflow process that looks like:
process my_app {
publishDir "${outdir}/my_app", mode: params.publish_dir_mode
input:
path input_bam
path input_bai
val output_bam
val max_mem
val threads
val container_home
val outdir
output:
tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), emit: tuple_ch
shell:
'''
my_script.sh \
!{input_bam} \
!{output_bam} \
!{max_mem} \
!{threads}
output_prefix=$(echo !{output_bam} | sed "s#.bam##")
'''
}
This process is only creating two .bam .bai files but my_script.sh is also creating other .vcf that are not being published in the output directory.
I tried it by doing in order to retrieve the files created by the script but without success:
output:
tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), path("${output_prefix}.*.vcf"), emit: mt_validation_simulation_tuple_ch
but in logs I can see:
Error executing process caused by:
Missing output file(s) `null.*.vcf` expected by process `my_app_wf:my_app`
What I am missing? Could you help me? Thank you in advance!
The problem is that the output_prefix has only been defined inside of the shell block. If all you need for your output prefix is the file's basename (without extension), you can just use a regular script block to check file attributes. Note that variables defined in the script block (but outside the command string) are global (within the process scope) unless they're defined using the def keyword:
process my_app {
...
output:
tuple val(output_prefix), path("${output_bam}{,.bai}"), path("${output_prefix}.*.vcf")
script:
output_prefix = output_bam.baseName
"""
my_script.sh \\
"${input_bam}" \\
"${output_bam}" \\
"${max_mem}" \\
"${threads}"
"""
}
If the process creates the BAM (and index) it might even be possible to refactor away the multiple input channels if an output prefix can be supplied up front. Usually this makes more sense, but I don't have enough details to say one way or the other. The following might suffice as an example; you may need/prefer to combine/change the output declaration(s) to suit, but hopefully you get the idea:
params.publish_dir = './results'
params.publish_mode = 'copy'
process my_app {
publishDir "${params.publish_dir}/my_app", mode: params.publish_mode
cpus 1
memory 1.GB
input:
tuple val(prefix), path(indexed_bam)
output:
tuple val(prefix), path("${prefix}.bam{,.bai}"), emit: bam_files
tuple val(prefix), path("${prefix}.*.vcf"), emit: vcf_files
"""
my_script.sh \\
"${indexed_bam.first()}" \\
"${prefix}.bam" \\
"${task.memory.toGiga()}G" \\
"${task.cpus}"
"""
}
Note that the indexed_bam expects a tuple in the form: tuple(bam, bai)

snakemake - replacing wildcards in input directive by anonymous function

I am writing a snakemake that will run a bioinformatics pipeline for several input samples. These input files (two for each analysis, one with the partial string match R1 and the second with the partial string match R2) start with a pattern and end with the extension .fastq.gz. Eventually I want to perform multiple operations, though, for this example I just want to align the fastq reads against a reference genome using bwa mem. So for this example my input file is NIPT-N2002394-LL_S19_R1_001.fastq.gz and I want to generate NIPT-N2002394-LL.bam (see code below specifying the directories where input and output are).
My config.yaml file looks like so:
# Run_ID
run: "200311_A00154_0454_AHHHKMDRXX"
# Base directory: the analysis directory from which I will fetch the samples
bd: "/nexusb/nipt/"
# Define the prefix
# will be used to subset the folders in bd
prefix: "NIPT"
# Reference:
ref: "/nexus/bhinckel/19/ONT_projects/PGD_breakpoint/ref_hg19_local/hg19_chr1-y.fasta"
And below is my snakefile
import os
import re
#############
# config file
#############
configfile: "config.yaml"
#######################################
# Parsing variables from config.yaml
#######################################
RUN = config['run']
BD = config['bd']
PREFIX = config['prefix']
FQDIR = f'/nexusb/Novaseq/{RUN}/Unaligned/'
BASEDIR = BD + RUN
SAMPLES = [sample for sample in os.listdir(BASEDIR) if sample.startswith(PREFIX)]
# explanation: in BASEDIR I have multiple subdirectories. The names of the subdirectories starting with PREFIX will be the name of the elements I want to have in the list SAMPLES, which eventually shall be my {sample} wildcard
#############
# RULES
#############
rule all:
input:
expand("aligned/{sample}.bam", sample = SAMPLES)
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
But I am getting:
Building DAG of jobs...
WildcardError in line 55 of /nexusb/nipt/200311_A00154_0454_AHHHKMDRXX/testMetrics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
's'
When calling snakemake -np
I believe my error lies in the definitions of R1 and R2 in the input directive. I find it puzzling because according to the official documentation snakemake should interpret any wildcard as the regex .+. But it is not doing that for sample NIPT-PearlPPlasma-05-PPx, whose R1 and R2 should be NIPT-PearlPPlasma-05-PPx_S5_R1_001.fastq.gz and NIPT-PearlPPlasma-05-PPx_S5_R2_001.fastq.gz, respectively.
Take a look again at the snakemake tutorial on how input is inferred from output, anyways I think the problem lies in this piece of code:
output:
expand("aligned/{sample}.bam", sample = SAMPLES)
And needs to be changed into
output:
"aligned/{sample}.bam"
What you had didn't work because before expand("aligned/{sample}.bam", sample = SAMPLES) basically becomes a list like this ["aligned/sample0.bam","aligned/sample1.bam"]. When you remove the expand, you only give a "description" of how the output should look like, and thus snakemake can infer the wildcards and input.
edit:
It's difficult to test it since I don't have the actual files, but you should do something like this. Won't work if multiple S-thingies exist.
def get_reads(wildcards):
R1 = FQDIR + f"{wildcards.sample}_S{{s}}_R1_001.fastq.gz"
R2 = FQDIR + f"{wildcards.sample}_S{{s}}_R2_001.fastq.gz"
globbed = glob_wildcards(R1)
R1, R2 = expand([R1, R2], s=globbed.s)
return {"R1": R1, "R2": R2}
rule bwa_map:
input:
unpack(get_reads),
REF = config['ref']
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
The problem is here:
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
Your output clearly defines a pattern where {sample} is a wildcard. When Snakemake builds the DAG and finds that any other rule requires a file that matches this pattern, it sets a concrete value to the wildcard.sample. At this moment all the inputs shall be defined, but you are introducing one more level of indirection: the wildcard {s} which is not defined.
The value of {s} shall be clearly inferred from the output. If you can do it in design time, substitute the it with the concrete values, otherwise you may use checkpoint feature of Snakemake.

execute snakemake rule as last rule

I tried to create a snakemake file to run sortmeRNA pipeline:
SAMPLES = ['test']
READS=["R1", "R2"]
rule all:
input: expand("Clean/4.Unmerge/{exp}.non_rRNA_{read}.fastq", exp = SAMPLES, read = READS)
rule unzip:
input:
fq = "trimmed/{exp}.{read}.trimd.fastq.gz"
output:
ofq = "Clean/1.Unzipped/{exp}.{read}.trimd.fastq"
shell: "gzip -dkc < {input.fq} > {output.ofq}"
rule merge_paired:
input:
read1 = "Clean/1.Unzipped/{exp}.R1.trimd.fastq",
read2 = "Clean/1.Unzipped/{exp}.R2.trimd.fastq"
output:
il = "Clean/2.interleaved/{exp}.il.trimd.fastq"
shell: "merge-paired-reads.sh {input.read1} {input.read2} {output.il}"
rule sortmeRNA:
input:
ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
output:
reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
params:
silvabac = "rRNA_databases/silva-bac-16s-id90.fasta,index/silva-bac-16s-db:rRNA_databases/silva-bac-23s-id98.fasta,index/silva-bac-23s-db",
silvaarc = "rRNA_databases/silva-arc-16s-id95.fasta,index/silva-arc-16s-db:rRNA_databases/silva-arc-23s-id98.fasta,index/silva-arc-23s-db",
silvaeuk = "rRNA_databases/silva-euk-18s-id95.fasta,index/silva-euk-18s-db:rRNA_databases/silva-euk-28s-id98.fasta,index/silva-euk-28s-db",
rfam = "rRNA_databases/rfam-5s-database-id98.fasta,index/rfam-5s-db:rRNA_databases/rfam-5.8s-database-id98.fasta,index/rfam-5.8s-db",
acc = "--num_alignments 1 --fastx --log -a 20 -m 64000 --paired_in -v"
log:
"Clean/sortmeRNAlogs/{exp}_sortmeRNA.log"
shell:'''
sortmerna --ref {params.silvabac}:{params.silvaarc}:{params.silvaeuk}:{params.rfam} --reads {input.ilfq} --aligned {output.reads_rRNA} --other {output.non_rRNA} {params.acc}
'''
rule unmerge_paired:
input:
inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq"
output:
R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
shell:"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"
This worked fine !. But for 1 sample it produced an output with size ~53 GB. I have 90 samples to run and cannot afford huge disk space. I tried to make output of rules unzip,merge_paired,sortmeRNA as temp(), but upon executing unmerge_paired raises "Missing input files exception" error.
I also tried to add rule_remove to delete all those intermediate directories. But that is not executed as last rule, rather somewhere in the middle raising error again !. Is there any efficient way to do this ?
The error that occurs is:
MissingInputException in line 45 of sortmeRNA_pipeline_memv2.0.snakefile:
Missing input files for rule unmerge_paired:
Clean/3.sorted/test_reads_nonRNA.fastq
Also please note that, rule sortmeRNA requires a string for output and produces string.fastq file, which is then input into rule unmerge_paired !
Thanks.
For Snakemake to connect the input of one rule to the output of another, they will need to be identical. The way you describe the output of sortmeRNA and the input unmerge_paired do not work together, whether you put temp() around it or not.
rule sortmeRNA:
input:
ilfq = "Clean/2.interleaved/{exp}.il.trimd.fastq"
output:
reads_rRNA = temp("Clean/3.sorted/{exp}_reads_rRNA.fastq"),
non_rRNA = temp("Clean/3.sorted/{exp}_reads_nonRNA.fastq")
params:
reads_rRNA = "Clean/3.sorted/{exp}_reads_rRNA",
non_rRNA = "Clean/3.sorted/{exp}_reads_nonRNA"
shell:
'''
sortmerna --aligned {params.reads_rRNA} --other {params.non_rRNA} ...
'''
rule unmerge_paired:
input:
inun = "Clean/3.sorted/{exp}_reads_nonRNA.fastq" # or rules.sortmeRNA.output.non_rRNA
output:
R1 = "Clean/4.Unmerge/{exp}.non_rRNA_R1.fastq",
R2 = "Clean/4.Unmerge/{exp}.non_rRNA_R2.fastq"
shell:
"unmerge-paired-reads.sh {input.inun} {output.R1} {output.R2}"
I removed all the stuff which isn't necessary to understand what is going on, you will have to place those back obviously. I changed the output of sortmeRNA to the actual output of the rule (and made them temp). I also added two params, which are the same as the output but then without the fastq extension.

Use groovy script output as input for another groovy script

I'll apologise in advance, I'm new to groovy. The problem I have is I have 3 groovy scripts which perform different functionality, and I need to call them from my main groovy script, using the output from script 1 as input for script 2 and script 2's output as input for script 3.
I've tried the following code:
script = new GroovyShell(binding)
script.run(new File("script1.groovy"), "--p", "$var" ) | script.run(new File("script2.groovy"), "<", "$var" )
When I run the above code the first script runs successfully but the 2nd doesn't run at all.
Script 1 takes an int as a parameter using the "--p", "$var" code. This runs successfully in the main script using: script.run(new File("script1.groovy"), "--p", "$var" ) - Script 1's output is an xml file.
When I run script.run(new File("script2.groovy"), "<", "$var" ) on its own in the main groovy script nothing happens and the system hangs.
I can run script 2 from the command line using groovy script2.groovy < input_file and it works fine.
Any help would be greatly appreciated.
You cannot pass the < as an argument to the script as redirection is handled by the Shell when you run things from the command line...
Redirecting output from Scripts into other scripts is notoriously difficult, and basically relies on you changing System.out for the duration of each script (and hoping that nothing else in the JVM prints and messes up your data)
Better to use java processes like the following:
Given these 3 scripts:
script1.groovy
// For each argument
args.each {
// Wrap it in xml and write it out
println "<woo>$it</woo>"
}
linelength.groovy
// read input
System.in.eachLine { line ->
// Write out the number of chars in each line
println line.length()
}
pretty.groovy
// For each line print out a nice report
int index = 1
System.in.eachLine { line ->
println "Line $index contains $line chars (including the <woo></woo> bit)"
index++
}
We can then write something like this to get a new groovy process to run each in turn, and pipe the outputs into each other (using the overloaded or operator on Process):
def s1 = 'groovy script1.groovy arg1 andarg2'.execute()
def s2 = 'groovy linelength.groovy'.execute()
def s3 = 'groovy pretty.groovy'.execute()
// pipe the output of process1 to process2, and the output
// of process2 to process3
s1 | s2 | s3
s3.waitForProcessOutput( System.out, System.err )
Which prints out:
Line 1 contains 15 chars (including the <woo></woo> bit)
Line 2 contains 18 chars (including the <woo></woo> bit)
//store standard I/O
PrintStream systemOut = System.out
InputStream systemIn = System.in
//Buffer for exchanging data between scripts
ByteArrayOutputStream buffer = new ByteArrayOutputStream()
PrintStream out = new PrintStream(buffer)
//Redirecting "out" of 1st stream to buffer
System.out = out
//RUN 1st script
evaluate("println 'hello'")
out.flush()
//Redirecting buffer to "in" of 2nd script
System.in = new ByteArrayInputStream(buffer.toByteArray())
//set standard "out"
System.out = systemOut
//RUN 2nd script
evaluate("println 'message from the first script: ' + new Scanner(System.in).next()")
//set standard "in"
System.in = systemIn
result is: 'message from the first script: hello'

Resources