Multiple named inputs in Snakefile - python-3.x

I want to make a pipeline that looks like this:
For each dataset extract some features
Make a unique list of all features
Extract the unique list from all the original datasets.
Here is a basic example of where I am
input_dict = {"data1": "/path/to/data1", "data2": "/path/to/data2"}
rule all:
input:
expand('data/{dataset}.processed', dataset=input_dict.keys())
rule extract_master:
output:
'data/{dataset}.processed'
input:
master = rules.master_list.output, dataset = lambda wildcards: input_dict[wildcards.dataset]
shell:
"./extract_master.py --input {input.dataset} --out {output} --master {input.master}"
rule master_list:
output:
'data/master.txt'
input:
expand('data/{dataset}.chunk', dataset=input_dict.keys())
shell:
'./master_list.py --input {input} --output {output}'
rule get_chunk:
input:
lambda wildcards: input_dict[wildcards.dataset]
output:
'data/{dataset}.chunk'
shell:
"./get_chunk.py --input {input} --output {output}"
I get an error:
'Rules' object has no attribute 'master_list'
I don't know how to specify two named inputs, where each input is not a simple string. If there is syntax I can use for the input section in the extract_master rule to fix this, that would be great. Otherwise, any thoughts on a better approach would be gladly received.

Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames.
Source
That is, in your example, rule master_list should be defined before rule extract_master.

Related

Snakemake: create multiple wildcards for the same argument

I am trying to run a GenotypeGVCFs on many vcf files. The command line wants every single vcf files be listed as:
java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs \
-R my.fasta \
-V bob.vcf \
-V smith.vcf \
-V kelly.vcf \
-o {output.out}
How to do this in snakemake? This is my code, but I do not know how to create a wildcards for -V.
workdir: "/path/to/workdir/"
SAMPLES=["bob","smith","kelly]
print (SAMPLES)
rule all:
input:
"all_genotyped.vcf"
rule genotype_GVCFs:
input:
lambda w: "-V" + expand("{sample}.vcf", sample=SAMPLES)
params:
ref="my.fasta"
output:
out="all_genotyped.vcf"
shell:
"""
java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R {params.ref} {input} -o {output.out}
"""
You are putting the cart before the horse. Wildcards are needed for rule generalization: you can define a pattern for a rule where wildcards are used to define generic parts. In your example there are no patterns: everything is defined by the value of SAMPLES. This is not a recommended way to use Snakemake; the pipeline should be defined by the filesystem: which files are present on your disk.
By the way, your code will not work, as the input shall define the list of filenames, while in your example you are (incorrectly) trying to define the strings like "-V filename".
So, you have the output: "all_genotyped.vcf". You have the input: ["bob.vcf", "smith.vcf", "kelly.vcf"]. You don't even need to use a lambda here, as the input doesn't depend on any wildcard. So, you have:
rule genotype_GVCFs:
input:
expand("{sample}.vcf", sample=SAMPLES)
output:
"all_genotyped.vcf"
...
Actually you don't even need input section. If you know for sure that the files from SAMPLES list exist, you may skip it.
The values for -V can be defined in params:
rule genotype_GVCFs:
#input:
# expand("{sample}.vcf", sample=SAMPLES)
output:
"all_genotyped.vcf"
params:
ref = "my.fasta",
vcf = expand("-V {sample}", sample=SAMPLES)
shell:
"""
java-jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R {params.ref} {params.vcf} -o {output}
"""
This should solve your issue, but I would advise you to rethink your solution. The use of SAMPLE list smells. Alternatively: do you really need Snakemake if you have all dependencies defined already?

snakemake wildcard in input files

I am very new to snakemake and I am trying to create a merged.fastq for each sample. Following is my Snakefile.
configfile: "config.yaml"
print(config['samples'])
print(config['ss_files'])
print(config['pass_files'])
rule all:
input:
expand("{sample}/data/genome_assembly/medaka/medaka.fasta", sample=config["samples"]),
expand("{pass_file}", pass_file=config["pass_files"]),
expand("{ss_file}", ss_file=config["ss_files"])
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
where, 'samples' is a list of sample names,
'pass_files' is a list of directory path to fastq_pass folder which contains small fastq files
I am trying to merge small fastq files to a large merged.fastq for each sample.
I am getting the following,
Wildcards in input files cannot be determined from output files:
'pass_file'
as the error.
Each wildcard in the input section shall have a corresponding wildcard (with the same name) in the output section. That is how Snakemake works: when the Snakemake tries to constract the DAG of jobs and finds that it needs a certain file, it looks at the output section for each rule and checks if this rule can produce the required file. This is the way how Snakemake assigns certain values to the wildcard in the output section. Every wildcard in other sections shall match one of the wildcards in the output, and that is how the input gets concrete filenames.
Now let's look at your rule merge_fastq:
rule merge_fastq:
input:
directory("{pass_file}")
output:
"{sample}/data/merged.fastq.gz"
wildcard_constraints:
id="*.fastq.gz"
shell:
"cat {input}/{id} > {output}"
The only wildcard that can get its value is the {sample}. The {pass_file} and {id} are dangling.
As I see, you are trying to merge the files that are not known on the design time. Take a look at the dynamic files, checkpoint and using a function in the input.
The rest of your Snakefile is hard to understand. For example I don't see how you specify the files that match this pattern: "{sample}/data/merged.fastq.gz".
Update:
Lets say, I have a
directory(/home/other_computer/jobs/data/<sample_name>/*.fastq.gz)
which is my input and output is
(/result/merged/<sample_name>/merged.fastq.gz). What I tried is having
the first path as input: {"pass_files"} (this comes from my config
file) and output : "result/merged/{sample}/merged.fastq.gz"
First, let's simplify the task a little bit and replace the {pass_file} with the hardcoded path. You have 2 degrees of freedom: the <sample_name> and the unknown files in the /home/other_computer/jobs/data/<sample_name>/ folder. The <sample_name> is a good candidate for becoming a wildcard, as this name can be derived from the target file. The unknown number of files *.fastq.gz doesn't even require any Snakemake constructs as this can be expressed using a shell command.
rule merge_fastq:
output:
"/result/merged/{sample_name}/merged.fastq.gz"
shell:
"cat /home/other_computer/jobs/data/{sample_name}/*.fastq.gz > {output}"

CLI arguments that take their own arguments with Clap

My program takes several file names as command line arguments, for example:
./myProgram -F file1 file2
This simple case works fine with Clap, in fact it's the doc example of Arg::multiple().
However, I'd also like each file to take arguments of its own, which changes the behavior of that particular file. Simplified example:
./myProgram -F --name file1 ---format csv --priority 2 -F --name file2 --priority 1
Here, file1 has a higher priority and a different format from file2.
Simply using Arg::multiple() no longer works, since the file-specific arguments (format, priority) get parsed as independent arguments, with no way to know which file they belong to.
Arg::allow_hyphen_values() seems to get me part of the way there. But it just parses every occurrence of --name, file1, --format etc as a value to the -F option, with no way to know which --priority argument belongs to which file. I thought about using a different syntax for the file specific arguments and parsing those manually, but with this restriction, I can't even do that.
Is there any way to to do this with Clap?

Register a variable output with Ansible CLI / Ad-Hoc

Can I register the output of a task? Is there an argument with ansible command for that ?
This is my command:
ansible all -m ios_command -a"commands='show run'" -i Resources/Inventory/hosts
I need this, because the output is a dictionary and I only need the value for one key. If this is not possible, is there a way to save the value of that key to a file?
I have found that you can convert ansible output to json when executing playbooks with "ANSIBLE_STDOUT_CALLBACK=json" preceding the "ansible-playbook" command. Example:
ANSIBLE_STDOUT_CALLBACK=json ansible-playbook Resources/.Scripts/.Users.yml
This will give you a large output because it also shows each host's facts, but will have a key for each host on each task.
This method is not possible with ansible command, but it's output is similar to json. It just shows "10.20.30.111 | SUCCESS =>" before the main bracket.
Source
Set the following in your ansible.cfg under the [defaults] group
bin_ansible_callbacks=True
Then as #D_Esc mentioned, you can use
ANSIBLE_STDOUT_CALLBACK=json ansible all -m ios_command -a"commands='show run'" -i Resources/Inventory/hosts
and can get the json output which you can try to parse.
I have not found a way to register the output to a variable using ad-hoc commands

Multiple inputs and outputs in a single rule Snakemake file

I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial.
I want to create a single rule snakefile to download multiple files in linux one by one.
The 'expand' can not be used in the output because the files need to be downloaded one by one and wildcards can not be used because it is the target rule.
The only way comes to my mind is something like this which doesnt work properly. I can not figure out how to send the downloaded items to specific directory with specific names such as 'downloaded_files.dwn' using {output} to be used in later steps:
links=[link1,link2,link3,....]
rule download:
output:
"outdir/{downloaded_file}.dwn"
params:
shellCallFile='callscript',
run:
callString=''
for item in links:
callString+='wget str(item) -O '+{output}+'\n'
call('echo "' + callString + '\n" >> ' + params.shellCallFile, shell=True)
call(callString, shell=True)
I appreciate any hint on how this should be solved and which part of snakemake I didnt understand well.
Here is a commented example that could help you solve your problem:
# Create some way of associating output files with links
# The output file names will be built from the keys: "chain_{key}.gz"
# One could probably directly use output file names as keys
links = {
"1" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
"2" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
"3" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}
rule download:
output:
# We inform snakemake that this rule will generate
# the following list of files:
# ["outdir/chain_1.gz", "outdir/chain_2.gz", "outdir/chain_3.gz"]
# Note that we don't need to use {output} in the "run" or "shell" part.
# This list will be used if we later add rules
# that use the files generated by the present rule.
expand("outdir/chain_{n}.gz", n=links.keys())
run:
# The sort is there to ensure the files are in the 1, 2, 3 order.
# We could use an OrderedDict if we wanted an arbitrary order.
for link_num in sorted(links.keys()):
shell("wget {link} -O outdir/chain_{n}.gz".format(link=links[link_num], n=link_num))
And here is another way of doing, that uses arbitrary names for the downloaded files and uses output (although a bit artificially):
links = [
("foo_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz"),
("bar_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz"),
("baz_chain.gz", "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz")]
rule download:
output:
# We inform snakemake that this rule will generate
# the following list of files:
# ["outdir/foo_chain.gz", "outdir/bar_chain.gz", "outdir/baz_chain.gz"]
["outdir/{f}".format(f=filename) for (filename, _) in links]
run:
for i in range(len(links)):
# output is a list, so we can access its items by index
shell("wget {link} -O {chain_file}".format(
link=links[i][1], chain_file=output[i]))
# using a direct loop over the pairs (filename, link)
# could be considered "cleaner"
# for (filename, link) in links:
# shell("wget {link} -0 outdir/{filename}".format(
# link=link, filename=filename))
An example where the three downloads can be done in parallel using snakemake -j 3:
# To use os.path.join,
# which is more robust than manually writing the separator.
import os
# Association between output files and source links
links = {
"foo_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAptMan1.over.chain.gz",
"bar_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToAquChr2.over.chain.gz",
"baz_chain.gz" : "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToBisBis1.over.chain.gz"}
# Make this association accessible via a function of wildcards
def chainfile2link(wildcards):
return links[wildcards.chainfile]
# First rule will drive the rest of the workflow
rule all:
input:
# expand generates the list of the final files we want
expand(os.path.join("outdir", "{chainfile}"), chainfile=links.keys())
rule download:
output:
# We inform snakemake what this rule will generate
os.path.join("outdir", "{chainfile}")
params:
# using a function of wildcards in params
link = chainfile2link,
shell:
"""
wget {params.link} -O {output}
"""

Resources