I've got a Nextflow process that looks like:
process my_app {
publishDir "${outdir}/my_app", mode: params.publish_dir_mode
input:
path input_bam
path input_bai
val output_bam
val max_mem
val threads
val container_home
val outdir
output:
tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), emit: tuple_ch
shell:
'''
my_script.sh \
!{input_bam} \
!{output_bam} \
!{max_mem} \
!{threads}
output_prefix=$(echo !{output_bam} | sed "s#.bam##")
'''
}
This process is only creating two .bam .bai files but my_script.sh is also creating other .vcf that are not being published in the output directory.
I tried it by doing in order to retrieve the files created by the script but without success:
output:
tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), path("${output_prefix}.*.vcf"), emit: mt_validation_simulation_tuple_ch
but in logs I can see:
Error executing process caused by:
Missing output file(s) `null.*.vcf` expected by process `my_app_wf:my_app`
What I am missing? Could you help me? Thank you in advance!
The problem is that the output_prefix has only been defined inside of the shell block. If all you need for your output prefix is the file's basename (without extension), you can just use a regular script block to check file attributes. Note that variables defined in the script block (but outside the command string) are global (within the process scope) unless they're defined using the def keyword:
process my_app {
...
output:
tuple val(output_prefix), path("${output_bam}{,.bai}"), path("${output_prefix}.*.vcf")
script:
output_prefix = output_bam.baseName
"""
my_script.sh \\
"${input_bam}" \\
"${output_bam}" \\
"${max_mem}" \\
"${threads}"
"""
}
If the process creates the BAM (and index) it might even be possible to refactor away the multiple input channels if an output prefix can be supplied up front. Usually this makes more sense, but I don't have enough details to say one way or the other. The following might suffice as an example; you may need/prefer to combine/change the output declaration(s) to suit, but hopefully you get the idea:
params.publish_dir = './results'
params.publish_mode = 'copy'
process my_app {
publishDir "${params.publish_dir}/my_app", mode: params.publish_mode
cpus 1
memory 1.GB
input:
tuple val(prefix), path(indexed_bam)
output:
tuple val(prefix), path("${prefix}.bam{,.bai}"), emit: bam_files
tuple val(prefix), path("${prefix}.*.vcf"), emit: vcf_files
"""
my_script.sh \\
"${indexed_bam.first()}" \\
"${prefix}.bam" \\
"${task.memory.toGiga()}G" \\
"${task.cpus}"
"""
}
Note that the indexed_bam expects a tuple in the form: tuple(bam, bai)
Related
I have a nextflow script that runs a couple of processes on a single vcf file. The name of the file is 'bos_taurus.vcf' and it is located in the directory /input_files/bos_taurus.vcf. The directory input_files/ contains also another file 'sacharomyces_cerevisea.vcf'. I would like my nextflow script to process both files. I was trying to use a glob pattern like ch_1 = channel.fromPath("/input_files/*.vcf"), but sadly I can't find a working solution. Any help would be really appreciated.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// here I tried to use globbing
params.input_files = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/input_files/*.vcf"
params.results_dir = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/results"
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
// how can I make this process work on two files simultanously
process FILTERING {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
path(input_files)
output:
path("*")
script:
"""
vcftools --vcf ${input_files} --mac 1 --minQ 20 --recode --recode-INFO-all --out after_filtering.vcf
"""
}
Note that if your VCF files are actually bgzip compressed and tabix indexed, you could instead use the fromFilePairs factory method to create your input channel. For example:
params.vcf_files = "./input_files/*.vcf.gz{,.tbi}"
params.results_dir = "./results"
process FILTERING {
tag { sample }
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(sample), path(indexed_vcf)
output:
tuple val(sample), path("${sample}.filtered.vcf")
"""
vcftools \\
--vcf "${indexed_vcf.first()}" \\
--mac 1 \\
--minQ 20 \\
--recode \\
--recode-INFO-all \\
--out "${sample}.filtered.vcf"
"""
}
workflow {
vcf_files = Channel.fromFilePairs( params.vcf_files, checkIfExists: true )
FILTERING( vcf_files ).view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [thirsty_torricelli] DSL2 - revision: 8f69ad5638
executor > local (3)
[7d/dacad6] process > FILTERING (C) [100%] 3 of 3 ✔
[A, /path/to/work/84/f9f00097bcd2b012d3a5e105b9d828/A.filtered.vcf]
[B, /path/to/work/cb/9f6f78213f0943013990d30dbb9337/B.filtered.vcf]
[C, /path/to/work/7d/dacad693f06025a6301c33fd03157b/C.filtered.vcf]
Note that BCFtools is actively maintained and is intended as a replacement for VCFtools. In a production pipeline, BCFtools should be preferred.
Here is a little example for starters. First, you should specify a unique output name in each process. Currently, after_filtering.vcf is hardcoded so this will overwrite each other once copied to the publishDir. You can do that with the baseName operator as below and permanently store it in the input file channel, first element being the sample name and second one the actual file. I made an example process that just runs head on the vcf, you can then adapt as needed for what you actually need.
#! /usr/bin/env nextflow
nextflow.enable.dsl = 2
params.input_files = "/Users/atpoint/vcf/*.vcf"
params.results_dir = "/Users/atpoint/vcf/"
// A channel that contains a map with sample name and the file itself
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
.map { it -> [it.baseName, it] }
// An example process just head-ing the vcf
process VcfHead {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(name), path(vcf_in)
output:
path("*_head.vcf")
script:
"""
head -n 1 $vcf_in > ${name}_head.vcf
"""
}
// Run it
workflow {
VcfHead(file_channel)
}
The file_channel channel looks like this if you add a .view() to it:
[one, /Users/atpoint/vcf/one.vcf]
[two, /Users/atpoint/vcf/two.vcf]
New to NextFlow, here, and struggling with some basic concepts. I'm in the process of converting a set of bash scripts from a previous publication into a NextFlow workflow.
I'm converting a simple bash script (included below for convenience) that did some basic prep work and submitted a new job to the cluster scheduler for each iteration.
Ultimate question: What is the most NextFlow-like way to incorporate this script into a NextFlow workflow (preferably using the new DSL2 schema)?
Possible subquestion: Is it possible to emit a list of lists based on bash variables? I've seen ways to pass lists from workflows into processes, but not out of process. I could print each set of parameters to a file and then emit that file, but that doesn't seem very NextFlow-like.
I would really appreciate any guidance on how to incorporate the following bash script into a NextFlow workflow. I have added comments and indicate the four variables that I need to emit as a set of parameters.
Thanks!
# Input variables. I know how to take these in.
GVCF_DIR=$1
GATK_bed=$2
RESULT_DIR=$3
CAMO_MASK_REF_PREFIX=$4
GATK_JAR=$5
# For each directory
for dir in ${GVCF_DIR}/*
do
# Do some some basic prep work defining
# variables and setting up results directory
ploidy=$(basename $dir)
repeat=$((${ploidy##*_} / 2))
result_dir="${RESULT_DIR}/genotyped_by_region/${ploidy}" # Needs to be emitted
mkdir -p $result_dir
# Create a new file with a list of files. This file
# will be used as input to the downstream NextFlow process
gvcf_list="${ploidy}.gvcfs.list" # Needs to be emitted
find $dir -name "*.g.vcf" > $gvcf_list
REF="${CAMO_MASK_REF_PREFIX}.${ploidy}.fa" # Needs to be emitted
# For each line in the $GATK_bed file where
# column 5 == repeat (defined above), submit
# a new job to the scheduler with that region.
awk "\$5 == $repeat {print \$1\":\"\$2\"-\"\$3}" $GATK_bed | \
while read region # Needs to be emitted
do
qsub combine_and_genotype.ogs \
$gvcf_list \
$region \
$result_dir \
$REF \
$GATK_JAR
done
done
What is the most NextFlow-like way to incorporate this script into a
NextFlow workflow
In some cases, it is possible to incorporate third-party scripts that do not need to be compiled "as-is" by making them executable and moving them into a folder called 'bin' in the root directory of your project repository. Nextflow automatically adds this folder to the $PATH in the execution environment.
However, some scripts do not lend themselves for inclusion in this manner. This is especially the case if the objective is to produce a portable and reproducible workflow, which is how I interpret "the most Nextflow-like way". The objective ultimately becomes how run each process step in isolation. Given your example, below is my take on this:
nextflow.enable.dsl=2
params.GVCF_DIRECTORY = './path/to/directories'
params.GATK_BED_FILE = './path/to/file.bed'
params.CAMO_MASK_REF_PREFIX = 'someprefix'
params.publish_dir = './results'
process combine_and_genotype {
publishDir "${params.publish_dir}/${dirname}"
container 'quay.io/biocontainers/gatk4:4.2.4.1--hdfd78af_0'
cpus 1
memory 40.GB
input:
tuple val(dirname), val(region_string), path(ref_fasta), path(gvcf_files)
output:
tuple val(dirname), val(region_string), path("full_cohort.combined.${region}.g.vcf")
script:
region = region_string.replaceAll(':', '_')
def avail_mem = task.memory ? task.memory.toGiga() : 0
def Xmx = avail_mem >= 8 ? "-Xmx${avail_mem - 1}G" : ''
def Xms = avail_mem >= 8 ? "-Xms${avail_mem.intdiv(2)}G" : ''
"""
cat << __EOF__ > "${dirname}.gvcf.list"
${gvcf_files.join('\n'+' '*4)}
__EOF__
gatk \\
--java-options "${Xmx} ${Xms} -XX:+UseSerialGC" \\
CombineGVCFs \\
-R "${ref_fasta}" \\
-L "${region_string}" \\
-O "full_cohort.combined.${region}.g.vcf" \\
-V "${dirname}.gvcf.list"
gatk \\
--java-options "${Xmx} ${Xms} -XX:+UseSerialGC" \\
GenotypeGVCFs \\
-R "${ref_fasta}" \\
-L "${region_string}" \\
-O "full_cohort.combined.${region}.vcf" \\
-V "full_cohort.combined.${region}.g.vcf" \\
-A GenotypeSummaries
"""
}
workflow {
GVCF_DIRECTORY = file( params.GVCF_DIRECTORY )
GATK_BED_FILE = file( params.GATK_BED_FILE )
Channel.fromPath( params.GATK_BED_FILE ) \
| splitCsv(sep: '\t') \
| map { row ->
tuple( row[4].toInteger(), "${row[0]}:${row[1]}-${row[2]}" )
} \
| set { regions }
Channel.fromPath( "${GVCF_DIRECTORY.toString()}/**/*.g.vcf" ) \
| map { tuple( GVCF_DIRECTORY.relativize(it).subpath(0,1).name, it ) } \
| groupTuple() \
| map { dirname, gvcf_files ->
def ploidy = dirname.replaceFirst(/^.*_/, "").toInteger()
def repeat = ploidy.intdiv(2)
def ref_fasta = file( "${params.CAMO_MASK_REF_PREFIX}.${dirname}.fa" )
tuple( repeat, dirname, ref_fasta, gvcf_files )
} \
| combine( regions, by: 0 ) \
| map { repeat, dirname, ref_fasta, gvcf_files, region ->
tuple( dirname, region, ref_fasta, gvcf_files )
} \
| combine_and_genotype
}
From the GATK docs, I couldn't actually see where the variant inputs could be a list of files. Maybe this feature was only available using an older GATK. The code above is untested.
Also, you will need to ensure your code is indented using four spaces. The above will throw some error if tab indentation is used, or if you were to indent using a different number of spaces.
i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.
If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env
Is it possible to add multiple commands using karate.fork()? I tried adding the commands using ; or && separation but the second command doesn't seem to be getting executed.
I am trying to cd to a particular directory before executing bash on a shell script.
* def command =
"""
function(line) {
var proc = karate.fork({ redirectErrorStream: false, useShell: true, line: line });
proc.waitSync();
karate.set('sysOut', proc.sysOut);
karate.set('sysErr', proc.sysErr);
karate.set('exitCode', proc.exitCode);
}
"""
* call command('cd ../testDirectory ; bash example.sh')
Note that instead of line - args as an array of command line arguments is supported, so try that as well - e.g. something like:
karate.fork({ args: ['cd', 'foo;', 'bash', 'example.sh'] })
But yes this may need some investigation. You can always try to have all the commands in a single batch file which should work.
Would be good if you can try the 1.0 RC since some improvements may have been added: https://github.com/intuit/karate/wiki/1.0-upgrade-guide
Say I have two processes.
Channel
.fromFilePairs("${params.dir}/{SPB_50k_exome_seq,FE_50k_exome_seq}.{bed,bim,fam}",size:3) {
file -> file.baseName
}
.filter { key, files -> key in params.pops }
.set { plink_data }
process pling_1 {
publishDir "${params.outputDir}/filtered"
input:
set pop, file(pl_files) from plink_data
output:
file "${pop}_filtered.{bed,fam,bim}" into pling1_results
script:
output_file = "${pop}_filtered"
base = pl_files[0].baseName
"""
plink2 \
--bfile $pop \
--hwe 0.00001 \
--make-bed \
--out ${output_file} \
"""
}
process pling_2 {
publishDir "${params.outputDir}/filtered_vcf"
input:
set file(bed), file(bim), file(fam) from pling1_results.collect()
file(fam1) from fam_for_plink2
output:
file("${base}.vcf.gz") into pling2_results
script:
base = bed.baseName
output_file = "${base}"
"""
plink2 \
--bfile $base \
--keep-fam ${params.fam}/50k_exome_seq_filtered_for_VEP_ID.txt \
--recode vcf-iid bgz --out ${output_file}
"""
}
The result of pling_1 process is two list of elements,
[/work/SPB_50k_exome_seq.bed, /work/SPB_50k_exome_seq.bim,/work/SPB_50k_exome_seq.fam]
[/work/FE_50k_exome_seq.bed, /work/FE_50k_exome_seq.bim,/work/FE_50k_exome_seq.fam]
Therefore, in the ping_2 is not I am not able to process SPB_50k_exome_seq and FE_50k_exome_seq in one go. The base = bed.baseName is only taking SPB_50k_exome_seq and omitting FE_50k_exome_seq from the second list. In this case, how can I pass both SPB_50k_exome_seq and FE_50k_exome_seq to the pling_2 process?
Any help or suggestions are much appreciated.
Thanks
After a lot of experiments, I found the solution. Would help someone who is looking for solutions.
The reason for the problem was the process pling_1 produces so many files as output, all of which are outputted into the same channel. Therefore, you need to break up and re-group those files into a tuple of the format.
For that, I used the following channel where you can use operators like combine & flatten.
pling1_results
.collect()
.flatten()
.map { file -> tuple(file.baseName, file)}
.groupTuple(by: 0)
.map { input -> tuple(input[0], input[1][0], input[1][1], input[1][2])}
.set { pl1 }
Then use the channel pl1 as input for the next process. In addition to that, you should pass all the items through the channel into the process as a single input. Like the following,
set val(pop1),file(bed), file(bim), file(fam),file(fam1) from pl1.combine(fam_for_plink2)
With these two changes, now the workflow is running for each sample/input in the tuple.instead for just one.
Thanks