Nextflow script to process all files in given directory - groovy

I have a nextflow script that runs a couple of processes on a single vcf file. The name of the file is 'bos_taurus.vcf' and it is located in the directory /input_files/bos_taurus.vcf. The directory input_files/ contains also another file 'sacharomyces_cerevisea.vcf'. I would like my nextflow script to process both files. I was trying to use a glob pattern like ch_1 = channel.fromPath("/input_files/*.vcf"), but sadly I can't find a working solution. Any help would be really appreciated.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// here I tried to use globbing
params.input_files = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/input_files/*.vcf"
params.results_dir = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/results"
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
// how can I make this process work on two files simultanously
process FILTERING {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
path(input_files)
output:
path("*")
script:
"""
vcftools --vcf ${input_files} --mac 1 --minQ 20 --recode --recode-INFO-all --out after_filtering.vcf
"""
}

Note that if your VCF files are actually bgzip compressed and tabix indexed, you could instead use the fromFilePairs factory method to create your input channel. For example:
params.vcf_files = "./input_files/*.vcf.gz{,.tbi}"
params.results_dir = "./results"
process FILTERING {
tag { sample }
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(sample), path(indexed_vcf)
output:
tuple val(sample), path("${sample}.filtered.vcf")
"""
vcftools \\
--vcf "${indexed_vcf.first()}" \\
--mac 1 \\
--minQ 20 \\
--recode \\
--recode-INFO-all \\
--out "${sample}.filtered.vcf"
"""
}
workflow {
vcf_files = Channel.fromFilePairs( params.vcf_files, checkIfExists: true )
FILTERING( vcf_files ).view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [thirsty_torricelli] DSL2 - revision: 8f69ad5638
executor > local (3)
[7d/dacad6] process > FILTERING (C) [100%] 3 of 3 ✔
[A, /path/to/work/84/f9f00097bcd2b012d3a5e105b9d828/A.filtered.vcf]
[B, /path/to/work/cb/9f6f78213f0943013990d30dbb9337/B.filtered.vcf]
[C, /path/to/work/7d/dacad693f06025a6301c33fd03157b/C.filtered.vcf]
Note that BCFtools is actively maintained and is intended as a replacement for VCFtools. In a production pipeline, BCFtools should be preferred.

Here is a little example for starters. First, you should specify a unique output name in each process. Currently, after_filtering.vcf is hardcoded so this will overwrite each other once copied to the publishDir. You can do that with the baseName operator as below and permanently store it in the input file channel, first element being the sample name and second one the actual file. I made an example process that just runs head on the vcf, you can then adapt as needed for what you actually need.
#! /usr/bin/env nextflow
nextflow.enable.dsl = 2
params.input_files = "/Users/atpoint/vcf/*.vcf"
params.results_dir = "/Users/atpoint/vcf/"
// A channel that contains a map with sample name and the file itself
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
.map { it -> [it.baseName, it] }
// An example process just head-ing the vcf
process VcfHead {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(name), path(vcf_in)
output:
path("*_head.vcf")
script:
"""
head -n 1 $vcf_in > ${name}_head.vcf
"""
}
// Run it
workflow {
VcfHead(file_channel)
}
The file_channel channel looks like this if you add a .view() to it:
[one, /Users/atpoint/vcf/one.vcf]
[two, /Users/atpoint/vcf/two.vcf]

Related

Nextflow capture output file by partial pattern

I've got a Nextflow process that looks like:
process my_app {
publishDir "${outdir}/my_app", mode: params.publish_dir_mode
input:
path input_bam
path input_bai
val output_bam
val max_mem
val threads
val container_home
val outdir
output:
tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), emit: tuple_ch
shell:
'''
my_script.sh \
!{input_bam} \
!{output_bam} \
!{max_mem} \
!{threads}
output_prefix=$(echo !{output_bam} | sed "s#.bam##")
'''
}
This process is only creating two .bam .bai files but my_script.sh is also creating other .vcf that are not being published in the output directory.
I tried it by doing in order to retrieve the files created by the script but without success:
output:
tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), path("${output_prefix}.*.vcf"), emit: mt_validation_simulation_tuple_ch
but in logs I can see:
Error executing process caused by:
Missing output file(s) `null.*.vcf` expected by process `my_app_wf:my_app`
What I am missing? Could you help me? Thank you in advance!
The problem is that the output_prefix has only been defined inside of the shell block. If all you need for your output prefix is the file's basename (without extension), you can just use a regular script block to check file attributes. Note that variables defined in the script block (but outside the command string) are global (within the process scope) unless they're defined using the def keyword:
process my_app {
...
output:
tuple val(output_prefix), path("${output_bam}{,.bai}"), path("${output_prefix}.*.vcf")
script:
output_prefix = output_bam.baseName
"""
my_script.sh \\
"${input_bam}" \\
"${output_bam}" \\
"${max_mem}" \\
"${threads}"
"""
}
If the process creates the BAM (and index) it might even be possible to refactor away the multiple input channels if an output prefix can be supplied up front. Usually this makes more sense, but I don't have enough details to say one way or the other. The following might suffice as an example; you may need/prefer to combine/change the output declaration(s) to suit, but hopefully you get the idea:
params.publish_dir = './results'
params.publish_mode = 'copy'
process my_app {
publishDir "${params.publish_dir}/my_app", mode: params.publish_mode
cpus 1
memory 1.GB
input:
tuple val(prefix), path(indexed_bam)
output:
tuple val(prefix), path("${prefix}.bam{,.bai}"), emit: bam_files
tuple val(prefix), path("${prefix}.*.vcf"), emit: vcf_files
"""
my_script.sh \\
"${indexed_bam.first()}" \\
"${prefix}.bam" \\
"${task.memory.toGiga()}G" \\
"${task.cpus}"
"""
}
Note that the indexed_bam expects a tuple in the form: tuple(bam, bai)

The most NextFlow-like (DSL2) way to incorporate a former bash scheduler submission script to a NextFlow workflow

New to NextFlow, here, and struggling with some basic concepts. I'm in the process of converting a set of bash scripts from a previous publication into a NextFlow workflow.
I'm converting a simple bash script (included below for convenience) that did some basic prep work and submitted a new job to the cluster scheduler for each iteration.
Ultimate question: What is the most NextFlow-like way to incorporate this script into a NextFlow workflow (preferably using the new DSL2 schema)?
Possible subquestion: Is it possible to emit a list of lists based on bash variables? I've seen ways to pass lists from workflows into processes, but not out of process. I could print each set of parameters to a file and then emit that file, but that doesn't seem very NextFlow-like.
I would really appreciate any guidance on how to incorporate the following bash script into a NextFlow workflow. I have added comments and indicate the four variables that I need to emit as a set of parameters.
Thanks!
# Input variables. I know how to take these in.
GVCF_DIR=$1
GATK_bed=$2
RESULT_DIR=$3
CAMO_MASK_REF_PREFIX=$4
GATK_JAR=$5
# For each directory
for dir in ${GVCF_DIR}/*
do
# Do some some basic prep work defining
# variables and setting up results directory
ploidy=$(basename $dir)
repeat=$((${ploidy##*_} / 2))
result_dir="${RESULT_DIR}/genotyped_by_region/${ploidy}" # Needs to be emitted
mkdir -p $result_dir
# Create a new file with a list of files. This file
# will be used as input to the downstream NextFlow process
gvcf_list="${ploidy}.gvcfs.list" # Needs to be emitted
find $dir -name "*.g.vcf" > $gvcf_list
REF="${CAMO_MASK_REF_PREFIX}.${ploidy}.fa" # Needs to be emitted
# For each line in the $GATK_bed file where
# column 5 == repeat (defined above), submit
# a new job to the scheduler with that region.
awk "\$5 == $repeat {print \$1\":\"\$2\"-\"\$3}" $GATK_bed | \
while read region # Needs to be emitted
do
qsub combine_and_genotype.ogs \
$gvcf_list \
$region \
$result_dir \
$REF \
$GATK_JAR
done
done
What is the most NextFlow-like way to incorporate this script into a
NextFlow workflow
In some cases, it is possible to incorporate third-party scripts that do not need to be compiled "as-is" by making them executable and moving them into a folder called 'bin' in the root directory of your project repository. Nextflow automatically adds this folder to the $PATH in the execution environment.
However, some scripts do not lend themselves for inclusion in this manner. This is especially the case if the objective is to produce a portable and reproducible workflow, which is how I interpret "the most Nextflow-like way". The objective ultimately becomes how run each process step in isolation. Given your example, below is my take on this:
nextflow.enable.dsl=2
params.GVCF_DIRECTORY = './path/to/directories'
params.GATK_BED_FILE = './path/to/file.bed'
params.CAMO_MASK_REF_PREFIX = 'someprefix'
params.publish_dir = './results'
process combine_and_genotype {
publishDir "${params.publish_dir}/${dirname}"
container 'quay.io/biocontainers/gatk4:4.2.4.1--hdfd78af_0'
cpus 1
memory 40.GB
input:
tuple val(dirname), val(region_string), path(ref_fasta), path(gvcf_files)
output:
tuple val(dirname), val(region_string), path("full_cohort.combined.${region}.g.vcf")
script:
region = region_string.replaceAll(':', '_')
def avail_mem = task.memory ? task.memory.toGiga() : 0
def Xmx = avail_mem >= 8 ? "-Xmx${avail_mem - 1}G" : ''
def Xms = avail_mem >= 8 ? "-Xms${avail_mem.intdiv(2)}G" : ''
"""
cat << __EOF__ > "${dirname}.gvcf.list"
${gvcf_files.join('\n'+' '*4)}
__EOF__
gatk \\
--java-options "${Xmx} ${Xms} -XX:+UseSerialGC" \\
CombineGVCFs \\
-R "${ref_fasta}" \\
-L "${region_string}" \\
-O "full_cohort.combined.${region}.g.vcf" \\
-V "${dirname}.gvcf.list"
gatk \\
--java-options "${Xmx} ${Xms} -XX:+UseSerialGC" \\
GenotypeGVCFs \\
-R "${ref_fasta}" \\
-L "${region_string}" \\
-O "full_cohort.combined.${region}.vcf" \\
-V "full_cohort.combined.${region}.g.vcf" \\
-A GenotypeSummaries
"""
}
workflow {
GVCF_DIRECTORY = file( params.GVCF_DIRECTORY )
GATK_BED_FILE = file( params.GATK_BED_FILE )
Channel.fromPath( params.GATK_BED_FILE ) \
| splitCsv(sep: '\t') \
| map { row ->
tuple( row[4].toInteger(), "${row[0]}:${row[1]}-${row[2]}" )
} \
| set { regions }
Channel.fromPath( "${GVCF_DIRECTORY.toString()}/**/*.g.vcf" ) \
| map { tuple( GVCF_DIRECTORY.relativize(it).subpath(0,1).name, it ) } \
| groupTuple() \
| map { dirname, gvcf_files ->
def ploidy = dirname.replaceFirst(/^.*_/, "").toInteger()
def repeat = ploidy.intdiv(2)
def ref_fasta = file( "${params.CAMO_MASK_REF_PREFIX}.${dirname}.fa" )
tuple( repeat, dirname, ref_fasta, gvcf_files )
} \
| combine( regions, by: 0 ) \
| map { repeat, dirname, ref_fasta, gvcf_files, region ->
tuple( dirname, region, ref_fasta, gvcf_files )
} \
| combine_and_genotype
}
From the GATK docs, I couldn't actually see where the variant inputs could be a list of files. Maybe this feature was only available using an older GATK. The code above is untested.
Also, you will need to ensure your code is indented using four spaces. The above will throw some error if tab indentation is used, or if you were to indent using a different number of spaces.

How to call a forward the value of a variable created in the script in Nextflow to a value output channel?

i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.
If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env

How to pass outputs from two different lists to consecutive process in next flow

Say I have two processes.
Channel
.fromFilePairs("${params.dir}/{SPB_50k_exome_seq,FE_50k_exome_seq}.{bed,bim,fam}",size:3) {
file -> file.baseName
}
.filter { key, files -> key in params.pops }
.set { plink_data }
process pling_1 {
publishDir "${params.outputDir}/filtered"
input:
set pop, file(pl_files) from plink_data
output:
file "${pop}_filtered.{bed,fam,bim}" into pling1_results
script:
output_file = "${pop}_filtered"
base = pl_files[0].baseName
"""
plink2 \
--bfile $pop \
--hwe 0.00001 \
--make-bed \
--out ${output_file} \
"""
}
process pling_2 {
publishDir "${params.outputDir}/filtered_vcf"
input:
set file(bed), file(bim), file(fam) from pling1_results.collect()
file(fam1) from fam_for_plink2
output:
file("${base}.vcf.gz") into pling2_results
script:
base = bed.baseName
output_file = "${base}"
"""
plink2 \
--bfile $base \
--keep-fam ${params.fam}/50k_exome_seq_filtered_for_VEP_ID.txt \
--recode vcf-iid bgz --out ${output_file}
"""
}
The result of pling_1 process is two list of elements,
[/work/SPB_50k_exome_seq.bed, /work/SPB_50k_exome_seq.bim,/work/SPB_50k_exome_seq.fam]
[/work/FE_50k_exome_seq.bed, /work/FE_50k_exome_seq.bim,/work/FE_50k_exome_seq.fam]
Therefore, in the ping_2 is not I am not able to process SPB_50k_exome_seq and FE_50k_exome_seq in one go. The base = bed.baseName is only taking SPB_50k_exome_seq and omitting FE_50k_exome_seq from the second list. In this case, how can I pass both SPB_50k_exome_seq and FE_50k_exome_seq to the pling_2 process?
Any help or suggestions are much appreciated.
Thanks
After a lot of experiments, I found the solution. Would help someone who is looking for solutions.
The reason for the problem was the process pling_1 produces so many files as output, all of which are outputted into the same channel. Therefore, you need to break up and re-group those files into a tuple of the format.
For that, I used the following channel where you can use operators like combine & flatten.
pling1_results
.collect()
.flatten()
.map { file -> tuple(file.baseName, file)}
.groupTuple(by: 0)
.map { input -> tuple(input[0], input[1][0], input[1][1], input[1][2])}
.set { pl1 }
Then use the channel pl1 as input for the next process. In addition to that, you should pass all the items through the channel into the process as a single input. Like the following,
set val(pop1),file(bed), file(bim), file(fam),file(fam1) from pl1.combine(fam_for_plink2)
With these two changes, now the workflow is running for each sample/input in the tuple.instead for just one.
Thanks

snakemake - replacing wildcards in input directive by anonymous function

I am writing a snakemake that will run a bioinformatics pipeline for several input samples. These input files (two for each analysis, one with the partial string match R1 and the second with the partial string match R2) start with a pattern and end with the extension .fastq.gz. Eventually I want to perform multiple operations, though, for this example I just want to align the fastq reads against a reference genome using bwa mem. So for this example my input file is NIPT-N2002394-LL_S19_R1_001.fastq.gz and I want to generate NIPT-N2002394-LL.bam (see code below specifying the directories where input and output are).
My config.yaml file looks like so:
# Run_ID
run: "200311_A00154_0454_AHHHKMDRXX"
# Base directory: the analysis directory from which I will fetch the samples
bd: "/nexusb/nipt/"
# Define the prefix
# will be used to subset the folders in bd
prefix: "NIPT"
# Reference:
ref: "/nexus/bhinckel/19/ONT_projects/PGD_breakpoint/ref_hg19_local/hg19_chr1-y.fasta"
And below is my snakefile
import os
import re
#############
# config file
#############
configfile: "config.yaml"
#######################################
# Parsing variables from config.yaml
#######################################
RUN = config['run']
BD = config['bd']
PREFIX = config['prefix']
FQDIR = f'/nexusb/Novaseq/{RUN}/Unaligned/'
BASEDIR = BD + RUN
SAMPLES = [sample for sample in os.listdir(BASEDIR) if sample.startswith(PREFIX)]
# explanation: in BASEDIR I have multiple subdirectories. The names of the subdirectories starting with PREFIX will be the name of the elements I want to have in the list SAMPLES, which eventually shall be my {sample} wildcard
#############
# RULES
#############
rule all:
input:
expand("aligned/{sample}.bam", sample = SAMPLES)
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
But I am getting:
Building DAG of jobs...
WildcardError in line 55 of /nexusb/nipt/200311_A00154_0454_AHHHKMDRXX/testMetrics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
's'
When calling snakemake -np
I believe my error lies in the definitions of R1 and R2 in the input directive. I find it puzzling because according to the official documentation snakemake should interpret any wildcard as the regex .+. But it is not doing that for sample NIPT-PearlPPlasma-05-PPx, whose R1 and R2 should be NIPT-PearlPPlasma-05-PPx_S5_R1_001.fastq.gz and NIPT-PearlPPlasma-05-PPx_S5_R2_001.fastq.gz, respectively.
Take a look again at the snakemake tutorial on how input is inferred from output, anyways I think the problem lies in this piece of code:
output:
expand("aligned/{sample}.bam", sample = SAMPLES)
And needs to be changed into
output:
"aligned/{sample}.bam"
What you had didn't work because before expand("aligned/{sample}.bam", sample = SAMPLES) basically becomes a list like this ["aligned/sample0.bam","aligned/sample1.bam"]. When you remove the expand, you only give a "description" of how the output should look like, and thus snakemake can infer the wildcards and input.
edit:
It's difficult to test it since I don't have the actual files, but you should do something like this. Won't work if multiple S-thingies exist.
def get_reads(wildcards):
R1 = FQDIR + f"{wildcards.sample}_S{{s}}_R1_001.fastq.gz"
R2 = FQDIR + f"{wildcards.sample}_S{{s}}_R2_001.fastq.gz"
globbed = glob_wildcards(R1)
R1, R2 = expand([R1, R2], s=globbed.s)
return {"R1": R1, "R2": R2}
rule bwa_map:
input:
unpack(get_reads),
REF = config['ref']
output:
"aligned/{sample}.bam"
shell:
"bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"
The problem is here:
rule bwa_map:
input:
REF = config['ref'],
R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
output:
"aligned/{sample}.bam"
Your output clearly defines a pattern where {sample} is a wildcard. When Snakemake builds the DAG and finds that any other rule requires a file that matches this pattern, it sets a concrete value to the wildcard.sample. At this moment all the inputs shall be defined, but you are introducing one more level of indirection: the wildcard {s} which is not defined.
The value of {s} shall be clearly inferred from the output. If you can do it in design time, substitute the it with the concrete values, otherwise you may use checkpoint feature of Snakemake.

Resources