Multiple inputs and outputs in a single rule Snakemake file - python-3.x

I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial.
I want to create a single rule snakefile to download multiple files in linux one by one.
The 'expand' can not be used in the output because the files need to be downloaded one by one and wildcards can not be used because it is the target rule.
The only way comes to my mind is something like this which doesnt work properly. I can not figure out how to send the downloaded items to specific directory with specific names such as 'downloaded_files.dwn' using {output} to be used in later steps:
links=[link1,link2,link3,....]
rule download:
output:
"outdir/{downloaded_file}.dwn"
params:
shellCallFile='callscript',
run:
callString=''
for item in links:
callString+='wget str(item) -O '+{output}+'\n'
call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True)
call(callString, shell=True)
I appreciate any hint on how this should be solved and which part of snakemake I didnt understand well.

Here is a commented example that could help you solve your problem:
# Create some way of associating output files with links
# The output file names will be built from the keys: "chain_{key}.gz"
# One could probably directly use output file names as keys
links = {
"1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
"2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
"3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}
rule download:
output:
# We inform snakemake that this rule will generate
# the following list of files:
# ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"]
# Note that we don't need to use {output} in the "run" or "shell" part.
# This list will be used if we later add rules
# that use the files generated by the present rule.
expand("outdir/chain_{n}.gz", n=links.keys())
run:
# The sort is there to ensure the files are in the 1, 2, 3 order.
# We could use an OrderedDict if we wanted an arbitrary order.
for link_num in sorted(links.keys()):
shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num))
And here is another way of doing, that uses arbitrary names for the downloaded files and uses output (although a bit artificially):
links = [
("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"),
("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"),
("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")]
rule download:
output:
# We inform snakemake that this rule will generate
# the following list of files:
# ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"]
["outdir/{f}".format(f=filename) for (filename, _) in links]
run:
for i in range(len(links)):
# output is a list, so we can access its items by index
shell("wget {link} -O {chain_file}".format(
link=links[i][1], chain_file=output[i]))
# using a direct loop over the pairs (filename, link)
# could be considered "cleaner"
# for (filename, link) in links:
# shell("wget {link} -0 outdir/{filename}".format(
# link=link, filename=filename))
An example where the three downloads can be done in parallel using snakemake -j 3:
# To use os.path.join,
# which is more robust than manually writing the separator.
import os
# Association between output files and source links
links = {
"foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
"bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
"baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}
# Make this association accessible via a function of wildcards
def chainfile2link(wildcards):
return links[wildcards.chainfile]
# First rule will drive the rest of the workflow
rule all:
input:
# expand generates the list of the final files we want
expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys())
rule download:
output:
# We inform snakemake what this rule will generate
os.path.join("outdir", "{chainfile}")
params:
# using a function of wildcards in params
link = chainfile2link,
shell:
"""
wget {params.link} -O {output}
"""

Related

Using regex and cp: cannot stat

I am trying to copy files over from an old file structure where data are stored in folders with improper names to a new (better) structure, but as there are 671 participants I need to copy, I want to use regex in order to streamline it (each participant has files saved in the same format). However, I keep getting a cp: cannot stat error message saying that no file/directory exists. I had assumed this meant that I had missed a / or put "" in the wrong location but I cannot see anything in the code that would suggest it.
My code is as follows (which I add a lot of comments so other collaborators can understand):
#!/bin/bash
# This code below copies the initial .nii file.
# These data are copied into my Trial Participant folders.
# Create a variable called parent_folder1 that describes initial mask directory e.g. folders for each participant which contains the files.
parent_folder1="/this/path/here/contains/Trial_Participants"
# The original folders are named according to ClinicalID_scandate_randomdigits e.g. folder 1234567890_20000101_987654.
# The destination folders are named according to TrialIDNumber e.g. LBC100001.
# The .nii files are saved under TrialIDNumber_1_ICV.nii.gz e.g. LBC1000001_1_ICV.nii.gz.
# These files need copied over from their directories into the Trial Participant folders, using the for loop function.
# The * symbol is used as a wildcard.
for i in $(ls -1d "${parent_folder1}"/*_20*); do
lbc=$(ls ${i}/finalMasks/*ICV* | sed 's/^.*\///'); lbc=${lbc:0:9}
cp "${parent_folder1}/${i}"/finalMasks/*_1_ICV.nii.gz /this/path/is/the/destination/path/${lbc}/
done
# This code uses regular expression to find the initial ICV file.
# ls asks for a list, -1 makes each new folder on a new line, d is for directory.
# *_20* refers to the name of the folders. The * covers the ClinicalID, _20* refers to the scan date and random digits.
# I have no idea what the | sed 's/^.*\///' does, but I think it strips the path.
# lbc=${lbc:0:9} is used to keep the ID numbers.
# cp copies the files that are named under TrialIDNumber(replaced by *)_1_ICV.nii.gz to the destination under the respective folder.
So after a bit of fooling around, I changed the code a lot (took out sed as it confuses me), and came up with this that worked. Thanks to those who commented!
# Create a variable called parent_folder1 that describes initial mask directory.
parent_folder1="/original/path/here"
# Iterate over directories in parent_folder1
for i in $(ls -1d "${parent_folder1}"/*_20*); do
# Extract the base name of the file in the finalMasks directory
lbc=$(basename $(ls "${i}"/finalMasks/*ICV*))
# Extract the LBC number from the file name
lbc=${lbc:0:9}
# Copy the file to the specific folder
cp "${i}"/finalMasks/${lbc}_1_ICV.nii.gz /destination/path/here/${lbc}/
done

Snakemake: create multiple wildcards for the same argument

I am trying to run a GenotypeGVCFs on many vcf files. The command line wants every single vcf files be listed as:
java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs \
-R my.fasta \
-V bob.vcf \
-V smith.vcf \
-V kelly.vcf \
-o {output.out}
How to do this in snakemake? This is my code, but I do not know how to create a wildcards for -V.
workdir: "/path/to/workdir/"
SAMPLES=["bob","smith","kelly]
print (SAMPLES)
rule all:
input:
"all_genotyped.vcf"
rule genotype_GVCFs:
input:
lambda w: "-V" + expand("{sample}.vcf", sample=SAMPLES)
params:
ref="my.fasta"
output:
out="all_genotyped.vcf"
shell:
"""
java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R {params.ref} {input} -o {output.out}
"""
You are putting the cart before the horse. Wildcards are needed for rule generalization: you can define a pattern for a rule where wildcards are used to define generic parts. In your example there are no patterns: everything is defined by the value of SAMPLES. This is not a recommended way to use Snakemake; the pipeline should be defined by the filesystem: which files are present on your disk.
By the way, your code will not work, as the input shall define the list of filenames, while in your example you are (incorrectly) trying to define the strings like "-V filename".
So, you have the output: "all_genotyped.vcf". You have the input: ["bob.vcf", "smith.vcf", "kelly.vcf"]. You don't even need to use a lambda here, as the input doesn't depend on any wildcard. So, you have:
rule genotype_GVCFs:
input:
expand("{sample}.vcf", sample=SAMPLES)
output:
"all_genotyped.vcf"
...
Actually you don't even need input section. If you know for sure that the files from SAMPLES list exist, you may skip it.
The values for -V can be defined in params:
rule genotype_GVCFs:
#input:
# expand("{sample}.vcf", sample=SAMPLES)
output:
"all_genotyped.vcf"
params:
ref = "my.fasta",
vcf = expand("-V {sample}", sample=SAMPLES)
shell:
"""
java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R {params.ref} {params.vcf} -o {output}
"""
This should solve your issue, but I would advise you to rethink your solution. The use of SAMPLE list smells. Alternatively: do you really need Snakemake if you have all dependencies defined already?

snakemake wildcard in input files

I am very new to snakemake and I am trying to create a merged.fastq for each sample. Following is my Snakefile.
configfile: "config.yaml"
print(config['samples'])
print(config['ss_files'])
print(config['pass_files'])
rule all:
input:
expand("{sample}/data/genome_assembly/medaka/medaka.fasta", sample=config["samples"]),
expand("{pass_file}", pass_file=config["pass_files"]),
expand("{ss_file}", ss_file=config["ss_files"])
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
where, 'samples' is a list of sample names,
'pass_files' is a list of directory path to fastq_pass folder which contains small fastq files
I am trying to merge small fastq files to a large merged.fastq for each sample.
I am getting the following,
Wildcards in input files cannot be determined from output files:
'pass_file'
as the error.
Each wildcard in the input section shall have a corresponding wildcard (with the same name) in the output section. That is how Snakemake works: when the Snakemake tries to constract the DAG of jobs and finds that it needs a certain file, it looks at the output section for each rule and checks if this rule can produce the required file. This is the way how Snakemake assigns certain values to the wildcard in the output section. Every wildcard in other sections shall match one of the wildcards in the output, and that is how the input gets concrete filenames.
Now let's look at your rule merge_fastq:
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
The only wildcard that can get its value is the {sample}. The {pass_file} and {id} are dangling.
As I see, you are trying to merge the files that are not known on the design time. Take a look at the dynamic files, checkpoint and using a function in the input.
The rest of your Snakefile is hard to understand. For example I don't see how you specify the files that match this pattern: "{sample}/data/merged.fastq.gz".
Update:
Lets say, I have a
directory(/home/other_computer/jobs/data/<sample_name>/*.fastq.gz)
which is my input and output is
(/result/merged/<sample_name>/merged.fastq.gz). What I tried is having
the first path as input: {"pass_files"} (this comes from my config
file) and output : "result/merged/{sample}/merged.fastq.gz"
First, let's simplify the task a little bit and replace the {pass_file} with the hardcoded path. You have 2 degrees of freedom: the <sample_name> and the unknown files in the /home/other_computer/jobs/data/<sample_name>/ folder. The <sample_name> is a good candidate for becoming a wildcard, as this name can be derived from the target file. The unknown number of files *.fastq.gz doesn't even require any Snakemake constructs as this can be expressed using a shell command.
rule merge_fastq:
output:
"/result/merged/{sample_name}/merged.fastq.gz"
shell:
"cat /home/other_computer/jobs/data/{sample_name}/*.fastq.gz > {output}"

How to apply Praat script to an audio file?

I'm trying to change formants of the audio file with praat in Colab. I found the script that does that, it's code and the code for calculating formants. I installed praat:
!sudo apt-get update -y -qqq --fix-missing && apt-get install -y -qqq praat > /dev/null
!wget -qqq http://www.praatvocaltoolkit.com/downloads/plugin_VocalToolkit.zip
!unzip -qqq /content/plugin_VocalToolkit.zip > /dev/null
with open('/content/script.praat', 'w') as f:
f.write(r"""writeInfoLine: preferencesDirectory$""")
!praat /content/script.praat
/root/.praat-dir
!mv /content/plugin_VocalToolkit/* /root/.praat-dir
!praat --version
Praat 6.0.37 (February 3 2018)
How can I apply this script to multiple wav files without UI, using linux command line or python?
The general answer
You don't. You run a script, and it's entirely up to the script how it works, what objects it works on, where those objects are fetched, how they are fetched, etc.
So you always have to look at how to apply a specific script, and that always entails figuring out how that script wants its input, and how to get to that point.
The specific answer
The page for the script you want says
This command [does something on] each selected Sound
so the first thing will be to open the files you want and select them.
Let's assume you'll be working with a small enough number of sounds to open them all in one go. If you are working on a lot of sound files, or files that are too large to hold in memory, you'll have to batch the job into smaller chunks.
One way to do this would be with a wrapper script that opened your files, selected them, and executed the other script you want:
# Get a list of all your files
files = Create Strings as file list: "list", "/some/path/*.wav"
total_files = Get number of strings
# Open each of them
for i to total_files
selectObject: files
filename$ = Get string: i
sounds[i] = Read from file: "/some/path/" + filename$
endfor
# Clear the selection
nocheck selectObject(undefined)
# Add each sound to your selection
for i to total_files
plusObject: sounds[i]
endfor
# Run your script
runScript: path_to_script$, ...
# where the ... is the list of arguments your script expects
# In your specific case, it would be something like
runScript: preferencesDirectory$ + "/plugin_VocalToolkit/changeformants.praat",
... 500, 1500, 2500, 0, 0, 5500, "yes", "yes"
# ,-´ ,-´ ,--´ ,--´ ,-´ ^ ^ ^
# New F1, F2, F3, F4, and F5 means | | |
# Max formant | |
# Process only voiced parts |
# Retrieve intensity contour
# Do something with whatever the script gives you
My Praat is pretty rusty, but this should at least give you an idea of what to do (disclaimer: I haven't run any of the above, but the concepts should be fine).
With that "wrapper" script stored somewhere, you can then execute it from the command line:
$ praat /path/to/wrapper.praat

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.
(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:
Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)
Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)
Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.
Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.
for sequence in ./sequences/{.,}*;
do
echo "$sequence";
blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
done
When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".
If I can provide any further information to clarify anything, please let me know.
If you're only interested in *.fa files I would limit your input to only those matching files like this:
for sequence in sequences/*.fa;
do
I can propose you the following improvements:
for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
# ${variable%something} is the part of $variable
# before the string "something"
# basename path/to/file is the name of the file
# without the full path
# $(some command) allows you to use the result of the command as a string
# Combining the above, we can form a string based on our fasta file
# This string can be useful to name stuff in a clean manner later
sequence_name=$(basename ${fasta_file%.fa})
echo ${sequence_name}
# Create a directory for the results for this sequence
# -p option avoids a failure in case the directory already exists
mkdir -p ${sequence_name}
# Define the name of the file for the results
# (including our previously created directory in its path)
blast_results=${sequence_name}/${sequence_name}_blast.txt
blastp -query ${fasta_file} -db database.faa \
-out ${blast_results} \
-evalue 1e-10 -outfmt 7
# Define a file name for the top hits
top_hits=${sequence_name}/${sequence_name}_top_hits.txt
# alternatively, using "%"
#top_hits=${blast_results%_blast.txt}_top_hits.txt
# No need to cat: awk can take a file as argument
awk '/hits found/{getline;print}' ${blast_results} \
| grep -v "#" > ${sequence_name}_top_hits.txt
done
I made more intermediate variables, with (hopefully) meaningful names.
I used \ to escape line ends and allow putting commands in several lines.
I hope this improves code readability.
I haven't tested. There may be typos.
You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using
mkdir 'folder_name'
then you need to redirect your -o outputs to those files, something like this
'command' -o /path/to/output/folder
To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.
One last thing, be careful with your use of colons, it should look something like this:
for filename in *.fa; do 'command'; done

Resources