snakemake wildcard in input files - python-3.x

I am very new to snakemake and I am trying to create a merged.fastq for each sample. Following is my Snakefile.
configfile: "config.yaml"
print(config['samples'])
print(config['ss_files'])
print(config['pass_files'])
rule all:
input:
expand("{sample}/data/genome_assembly/medaka/medaka.fasta", sample=config["samples"]),
expand("{pass_file}", pass_file=config["pass_files"]),
expand("{ss_file}", ss_file=config["ss_files"])
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
where, 'samples' is a list of sample names,
'pass_files' is a list of directory path to fastq_pass folder which contains small fastq files
I am trying to merge small fastq files to a large merged.fastq for each sample.
I am getting the following,
Wildcards in input files cannot be determined from output files:
'pass_file'
as the error.

Each wildcard in the input section shall have a corresponding wildcard (with the same name) in the output section. That is how Snakemake works: when the Snakemake tries to constract the DAG of jobs and finds that it needs a certain file, it looks at the output section for each rule and checks if this rule can produce the required file. This is the way how Snakemake assigns certain values to the wildcard in the output section. Every wildcard in other sections shall match one of the wildcards in the output, and that is how the input gets concrete filenames.
Now let's look at your rule merge_fastq:
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
The only wildcard that can get its value is the {sample}. The {pass_file} and {id} are dangling.
As I see, you are trying to merge the files that are not known on the design time. Take a look at the dynamic files, checkpoint and using a function in the input.
The rest of your Snakefile is hard to understand. For example I don't see how you specify the files that match this pattern: "{sample}/data/merged.fastq.gz".
Update:
Lets say, I have a
directory(/home/other_computer/jobs/data/<sample_name>/*.fastq.gz)
which is my input and output is
(/result/merged/<sample_name>/merged.fastq.gz). What I tried is having
the first path as input: {"pass_files"} (this comes from my config
file) and output : "result/merged/{sample}/merged.fastq.gz"
First, let's simplify the task a little bit and replace the {pass_file} with the hardcoded path. You have 2 degrees of freedom: the <sample_name> and the unknown files in the /home/other_computer/jobs/data/<sample_name>/ folder. The <sample_name> is a good candidate for becoming a wildcard, as this name can be derived from the target file. The unknown number of files *.fastq.gz doesn't even require any Snakemake constructs as this can be expressed using a shell command.
rule merge_fastq:
output:
"/result/merged/{sample_name}/merged.fastq.gz"
shell:
"cat /home/other_computer/jobs/data/{sample_name}/*.fastq.gz > {output}"

Related

How to remove some number of files using a wild card before adding stuff back to them?

I have a set of .txt files named my_file_1.txt, my_file_2.txt, ..., my_file_n.txt where n is finite integer. As my python code is running (in a directory with path ~/simulation/some_code), it is adding data into these files using the following for loop:
for realization in np.arange(1, n+1):
# Identifying the file_path
some_name = 'output/{}/Info/{}/{}/parameter_{:.3f}/my_file_{}.txt'.format(size, name, status, value, realization)
# do some stuff
with open(some_name, "a") as filename:
print('{}'.format(some_list), file=filename)
filename.close()
However, to begin with, these files are not empty and need to be emptied. To do so, I am running the following line ahead of time (in the home directory ~/ which is two levels up with respect to the directory of the above code) to make sure files are empty before being modified:
os.system('> output/{}/Info/{}/{}/parameter_{:.3f}/my_file_*.txt'.format(size, name, status, value))
While I expected to see * symbol should act as a wild card to empty all similar text files above, it seems that files are accumulating from previous data instead of removing initial data and adding the stuff above. Am I using * wild card incorrectly? Is this problem fixable without changing the path of my codes?
Your understanding of the wildcard is correct. The mistake is with redirection. By default you only redirect (>) to one output. You can use the program tee to redirect std out to multiple other files like this:
(echo -n, echoes nothing)
echo -n | tee *
The pipe | passes the stdout of echo -n to the stdin of tee.
Then the wildcard will expand to all files in the directory.
echo -n | tee my_file_1.txt, my_file_2.txt, ..., my_file_n.txt

grep empty output file

I made a shell script the purpose of which is to find files that don't contain a particular string, then display the first line that isn't empty or otherwise useless. My script works well in the console, but for some reason when I try to direct the output to a .txt file, it comes out empty.
Here's my script:
#!/bin/bash
# takes user input.
echo "Input substance:"
read substance
echo "Listing media without $substance:"
cd media
# finds names of files that don't feature the substance given, then puts them inside an array.
searchresult=($(grep -L "$substance" *))
# iterates the array and prints the first line of each - contains both the number and the medium name.
# however, some files start with "Microorganisms" and the actual number and name feature after several empty lines
# the script checks for that occurence - and prints the first line that doesnt match these criteria.
for i in "${searchresult[#]}"
do
grep -m 1 -v "Microorganisms\|^$" $i
done >> output.txt
I've tried moving the >>output.txt to right after the grep line inside the loop, tried switching >> to > and 2>&1, tried using tee. No go.
I'm honestly feeling utterly stuck as to what the issue could be. I'm sure there's something I'm missing, but I'm nowhere near good enough with this to notice. I would very much appreciate any help.
EDIT: Added files to better illustrate what I'm working with. Sample inputs I tried: Glucose, Yeast extract, Agar. Link to files [140kB] - the folder was unzipped beforehand.
The script was given full permissions to execute. I don't think the output is being rewritten because even if I don't iterate and just run a single line of the loop, the file is empty.

Multiple inputs and outputs in a single rule Snakemake file

I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial.
I want to create a single rule snakefile to download multiple files in linux one by one.
The 'expand' can not be used in the output because the files need to be downloaded one by one and wildcards can not be used because it is the target rule.
The only way comes to my mind is something like this which doesnt work properly. I can not figure out how to send the downloaded items to specific directory with specific names such as 'downloaded_files.dwn' using {output} to be used in later steps:
links=[link1,link2,link3,....]
rule download:
output:
"outdir/{downloaded_file}.dwn"
params:
shellCallFile='callscript',
run:
callString=''
for item in links:
callString+='wget str(item) -O '+{output}+'\n'
call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True)
call(callString, shell=True)
I appreciate any hint on how this should be solved and which part of snakemake I didnt understand well.
Here is a commented example that could help you solve your problem:
# Create some way of associating output files with links
# The output file names will be built from the keys: "chain_{key}.gz"
# One could probably directly use output file names as keys
links = {
"1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
"2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
"3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}
rule download:
output:
# We inform snakemake that this rule will generate
# the following list of files:
# ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"]
# Note that we don't need to use {output} in the "run" or "shell" part.
# This list will be used if we later add rules
# that use the files generated by the present rule.
expand("outdir/chain_{n}.gz", n=links.keys())
run:
# The sort is there to ensure the files are in the 1, 2, 3 order.
# We could use an OrderedDict if we wanted an arbitrary order.
for link_num in sorted(links.keys()):
shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num))
And here is another way of doing, that uses arbitrary names for the downloaded files and uses output (although a bit artificially):
links = [
("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"),
("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"),
("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")]
rule download:
output:
# We inform snakemake that this rule will generate
# the following list of files:
# ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"]
["outdir/{f}".format(f=filename) for (filename, _) in links]
run:
for i in range(len(links)):
# output is a list, so we can access its items by index
shell("wget {link} -O {chain_file}".format(
link=links[i][1], chain_file=output[i]))
# using a direct loop over the pairs (filename, link)
# could be considered "cleaner"
# for (filename, link) in links:
# shell("wget {link} -0 outdir/{filename}".format(
# link=link, filename=filename))
An example where the three downloads can be done in parallel using snakemake -j 3:
# To use os.path.join,
# which is more robust than manually writing the separator.
import os
# Association between output files and source links
links = {
"foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
"bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
"baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}
# Make this association accessible via a function of wildcards
def chainfile2link(wildcards):
return links[wildcards.chainfile]
# First rule will drive the rest of the workflow
rule all:
input:
# expand generates the list of the final files we want
expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys())
rule download:
output:
# We inform snakemake what this rule will generate
os.path.join("outdir", "{chainfile}")
params:
# using a function of wildcards in params
link = chainfile2link,
shell:
"""
wget {params.link} -O {output}
"""

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.
(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:
Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)
Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)
Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.
Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.
for sequence in ./sequences/{.,}*;
do
echo "$sequence";
blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
done
When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".
If I can provide any further information to clarify anything, please let me know.
If you're only interested in *.fa files I would limit your input to only those matching files like this:
for sequence in sequences/*.fa;
do
I can propose you the following improvements:
for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
# ${variable%something} is the part of $variable
# before the string "something"
# basename path/to/file is the name of the file
# without the full path
# $(some command) allows you to use the result of the command as a string
# Combining the above, we can form a string based on our fasta file
# This string can be useful to name stuff in a clean manner later
sequence_name=$(basename ${fasta_file%.fa})
echo ${sequence_name}
# Create a directory for the results for this sequence
# -p option avoids a failure in case the directory already exists
mkdir -p ${sequence_name}
# Define the name of the file for the results
# (including our previously created directory in its path)
blast_results=${sequence_name}/${sequence_name}_blast.txt
blastp -query ${fasta_file} -db database.faa \
-out ${blast_results} \
-evalue 1e-10 -outfmt 7
# Define a file name for the top hits
top_hits=${sequence_name}/${sequence_name}_top_hits.txt
# alternatively, using "%"
#top_hits=${blast_results%_blast.txt}_top_hits.txt
# No need to cat: awk can take a file as argument
awk '/hits found/{getline;print}' ${blast_results} \
| grep -v "#" > ${sequence_name}_top_hits.txt
done
I made more intermediate variables, with (hopefully) meaningful names.
I used \ to escape line ends and allow putting commands in several lines.
I hope this improves code readability.
I haven't tested. There may be typos.
You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using
mkdir 'folder_name'
then you need to redirect your -o outputs to those files, something like this
'command' -o /path/to/output/folder
To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.
One last thing, be careful with your use of colons, it should look something like this:
for filename in *.fa; do 'command'; done

Obtaining file names from directory in Bash

I am trying to create a zsh script to test my project. The teacher supplied us with some input files and expected output files. I need to diff the output files from myExecutable with the expected output files.
Question: Does $iF contain a string in the following code or some kind of bash reference to the file?
#!/bin/bash
inputFiles=~/project/tests/input/*
outputFiles=~/project/tests/output
for iF in $inputFiles
do
./myExecutable $iF > $outputFiles/$iF.out
done
Note:
Any tips in fulfilling my objectives would be nice. I am new to shell scripting and I am using the following websites to quickly write the script (since I have to focus on the project development and not wasting time on extra stuff):
Grammar for bash language
Begginer guide for bash
As your code is, $iF contains full path of file as a string.
N.B: Don't use for iF in $inputFiles
use for iF in ~/project/tests/input/* instead. Otherwise your code will fail if path contains spaces or newlines.
If you need to diff the files you can do another for loop on your output files. Grab just the file name with the basename command and then put that all together in a diff and output to a ".diff" file using the ">" operator to redirect standard out.
Then diff each one with the expected file, something like:
expectedOutput=~/<some path here>
diffFiles=~/<some path>
for oF in ~/project/tests/output/* ; do
file=`basename ${oF}`
diff $oF "${expectedOutput}/${file}" > "${diffFiles}/${file}.diff"
done

Resources