Snakemake: Parameter as wildcard used in parallel script runs - python-3.x

I'm fairly new to snakemake and inherited a kind of huge worflow that consists in a sequence of 17 rules that run in serial.
Each rule takes outputs from the previous rules and uses them to run a python script. Everything has worked great so far except that now I'm trying to improve the worflow since some of the rules can be run in parallel.
A rough example of what I'm trying to achieve, my understanding is that wildcards should allow me to solve this.
grid = [ 10 , 20 ]
rule all:
input:
expand("path/to/C/{grid}/file_C" ,grid = grid)
rule process_A:
input:
path_A = "path/to/A/file_A"
path_B = "path/to/B/{grid}/file_B" # A rule further in the worflow could need a file from a previous rule saved with this structure
params:
grid = lambda wc: wc.get(grid)
output:
path_C = "path/to/C/{grid}/file_C"
script:
"script_A.py"
And inside the script I retrieve the grid size parameter:
grid = snakemake.params.grid
In the end the whole rule process_A should be rerun with grid = 10 and with grid = 20 and save each result to a folder whose path depends on grid also.
I know there are several things wrong with this, but I can't seem to find were to start from to figure this out. The error I'm getting now is:
name 'params' is not defined
Any help as to where to start from?

It would be useful to post the error stack trace of name 'params' is not defined to know exactly what is causing it. For now...
And inside the script I retrieve the grid size parameter:
grid = snakemake.params.grid
I suspect you are mixing the script directive with the shell directive. Probably you want something like:
rule process_A:
input: ...
output: ...
params: ...
script:
"script_A.py"
inside script_A.py snakemake will replace snakemake.params.grid with the actual param value.
Alternatively, write a standalone python script that parses command line arguments and you execute like any other program using the shell directive. (I tend to prefer this solution as it makes things more explicit and easier to debug but it also means more boiler-plate code to write a standalone script).

Related

Attempting to append all content into file, last iteration is the only one filling text document

I'm trying to Create a file and append all the content being calculated into that file, but when I run the script the very last iteration is written inside the file and nothing else.
My code is on pastebin, it's too long, and I feel like you would have to see exactly how the iteration is happening.
Try to summarize it, Go through an array of model numbers, if the model number matches call the function that calculates that MAC_ADDRESS, when done calculating store all the content inside a the file.
I have tried two possible routes and both have failed, giving the same result. There is no error in the code (it runs) but it just doesn't store the content into the file properly there should be 97 different APs and it's storing only 1.
The difference between the first and second attempt,
1 attempt) I open/create file in the beginning of the script and close at the very end.
2 attempt) I open/create file and close per-iteration.
First Attempt:
https://pastebin.com/jCpLGMCK
#Beginning of code
File = open("All_Possibilities.txt", "a+")
#End of code
File.close()
Second Attempt:
https://pastebin.com/cVrXQaAT
#Per function
File = open("All_Possibilities.txt", "a+")
#per function
File.close()
If I'm not suppose to reference other websites, please let me know and I'll just paste the code in his post.
Rather than close(), please use with:
with open('All_Possibilities.txt', 'a') as file_out:
file_out.write('some text\n')
The documentation explains that you don't need + to append writes to a file.
You may want to add some debugging console print() statements, or use a debugger like pdb, to verify that the write() statement actually ran, and that the variable you were writing actually contained the text you thought it did.
You have several loops that could be a one-liner using readlines().
Please do this:
$ pip install flake8
$ flake8 *.py
That is, please run the flake8 lint utility against your source code,
and follow the advice that it offers you.
In particular, it would be much better to name your identifier file than to name it File.
The initial capital letter means something to humans reading your code -- it is
used when naming classes, rather than local variables. Good luck!

How to run a string from an input file as python code?

I am creating something along the likes of a text adventure game. I have a .yaml file that is my input. This file looks something like this
node_type:
action
title:
Do some stuff
info:
This does some stuff and things
script:
'print("hello world")
print(ret_val)
foo.bar(True)
ret_val = (foo.bar() == True)
if (thing):
print(thing)
print(ret_val)
'
My end goal is to have my python program run the script portion of the yaml file exactly as if it had been copy pasted into the main code. (I know there are about ten bazillion security reasons I should not be running user input like this, but I am the only one writing these nodes, and the only one using this program so I'm mostly just ignoring this fact...)
Currently my attempt goes like this: I load my yaml file as a dict using pyyaml
node = yaml.safe_load(file.yaml)
Then I'm trying to use exec to run my code and hitting a lot of problems, I can't run if statements, I simply get a syntax error, and I can't get any sort of return value from my code. I've tried this as a work around:
def main()
ret_val = "test";
thing = exec(node['script'], globals(),locals())
print(ret_val)
which when run with the above .yaml file prints
>> hello world
>> test
>> True
>> test
for some reason not actually modifying any of my main variables even though I fed them to exec.
Is there any way for me to work around these issues or is there an all together better way to be doing this?
One way of doing this would be to parse the code out and save it to a .py file, from which it can be imported dynamically, for example by importlib.
You might want to encapsulate parsed code into a function, which you can then easily call to invoke your action. Also, it would make sense to specify some default imports there.

Using nohup to help run a loop of python code while disconnecting from ssh

I'm looking for help running a python script that takes some time to run.
It is a long running process that takes about 2hours per test observation. For example, these observations could be the 50 states of the usa.
I dont want to baby sit this process all day - I'd like to kick it off then drive home from work - or have it run while I'm sleeping.
Since this a loop - I would need to call one python script that loops through my code going over each of the 50 states - and a 2nd that runs my actual code that does things.
I've heard of NOHUP, but I have very limited knowledge. I saw nohup ipython mypython.py but then when I google I get alot of other people chiming in with other methods and so I don't know what is the ideal approach for someone like me. Additionally, I am essentially looking to run this as a loop - so don't know how that complicates things.
Please give me something simple and easier to understand. I don't know linux all that well or I wouldn't be asking as this seems like a common sort of command/activity...
Basic example of my code:
Two files: code_file.py and loop_file.py
Code_file.py does all the work. Loop file just passes in the list of things to run the stuff for.
code_file.py
output = each_state + ' needs some help!'
print output
loop_file.py
states = ['AL','CO','CA','NY','VA','TX']
for each_state in states:
code_file.py
Regarding the loop - I have also heard that I can't pass in parameters or something via nohup? I can fix this part within my python code....for example reading from a CSV in my code and deleting the current record from that CSV file and then re-writing it out...that way I can always select the top record in the CSV file for my loop (the list of states)
May be you could modify your loop_file.py like this:
import os
states = ['AL','CO','CA','NY','VA','TX']
for each_state in states:
os.system("python /dir_of_your_code/code_file.py")
Then in a shell, you could run the loop_file.py with:
nohup python loop_file.py & # & is not necessary, it just redirect all output of the file to a file named nohup.out instead of printing it on screen.

Organize code in unix bash scripting

I am used to object oriented programming. Now, I have just started learning unix bash scripting via linux.
I have a unix script with me. I wanted to break it down into "modules" or preferably programs similar to "more", "ls", etc., and then use pipes to link all my programs together. E.g., "some input" myProg1 | myProg2 | myProg3.
I want to organize my code and make it look neater, instead of all in one script. Also, it will be easy to do testing and development.
Is it possible to do this, especially as a newbie ?
There are a few things you could take a look at, for example the usage of aliases in bash and storing them in either bashrc or a seperate file called by bashrc
that will make running commands easier..
take a look here for expanding commands into aliases (simple aliases are easy)
You can also look into using functions in your code (lots of bash scripts in above link's home folder to make sense of functions browse this site :) which has much better examples...
Take a look here for some piping tails into script
pipe tail output into another script
The thing with bash is its flexibility, so for example if something starts to get too messy for bash you could always write a perl/Java any lang and then call this from within your bash script, capture its output and do something else..
Unsure why all the pipes anyways here is something that may be of help:
./example.sh 20
function one starts with 20
In function 2 20 + 10 = 30
Function three returns 10 + 10 = 40
------------------------------------------------
------------------------------------------------
Local function variables global:
Result2: 30 - Result3: 40 - value2: 10 - value1: 20
The script:
example.sh
#!/bin/bash
input=$1;
source ./shared.sh
one
echo "------------------------------------------------"
echo "------------------------------------------------"
echo "Local function variables global:"
echo "Result2: $result2 - Result3: $result3 - value2: $value2 - value1: $value1"
shared.sh
function one() {
value1=$input
echo "function one starts with $value1"
two;
}
function two() {
value2=10;
result2=$(expr $value1 + $value2)
echo "In function 2 $value1 + $value2 = $result2"
three;
}
function three() {
local value3=10;
result3=$(expr $value2 + $result2;)
echo "Function three returns $value2 + $value3 = $result3"
}
I think the pipes you mean can actually be functions and each function can call one another.. and then you give the script the value which it passes through the functions..
bash is pretty flexible about passing values around, so long as the function being called before has the variable the next function being called by it can reuse it or it can be called from main program
I also split out the functions which can be sourced by another script to carry out the same functions
E2A Thanks for the upvote, I have also decided to include this link
http://tldp.org/LDP/abs/html/sample-bashrc.html
There is an awesome .bashrc to be reused, it has a lot of functions which will also give some insight into how to simplify a lot of daily repetitive commands such as that require piping, an alias can be written to do all of them for you..
You can do one thing.
Just as a C program can be divided into a header file and a source file for reducing complexity, you can divide your bash script into two scripts - a header and a main script but with some differences.
Header file - This will contain all the common variables defined and functions defined which will be used by your main script.
Your script - This will only contain function calls and other logic.You need to use "source <"header-file path">" in your script at starting to get all the functions and variables declared in the header available to your script.
Shell scripts have standard input and output like any other program on Unix, so you can use them in pipes. Splitting your scripts is a good solution because you can later use them in pipes with other commands.
I organize my Bash projects in the following way :
Each command is put in its own file
Reusable functions are kept in a library file which is just a classic script with only functions
All files are in the same directory, so commands can find the library with $(dirname $0)/library
Configuration is stored in another file as environment variables
To keep things clear, you should not use global variables to communicate between functions and main program.
I prepare a template for scripts with the following parts prepared :
Header with name and copyright
Read configuration with source
Load library with source
Check parameters
Function to display help, which is called if asked for or if parameters are wrong
My best advice is : always write the help function, as the next person who will need it is ... yourself !
To install your project you simply copy all files, and explain what to configure in the configuration file.

Perl program structure for parsing

I've got question about program architecture.
Say you've got 100 different log files with different formats and you need to parse and put that info into an SQL database.
My view of it is like:
use general config file like:
program1->name1("apache",/var/log/apache.log) (modulename,path to logfile1)
program2->name2("exim",/var/log/exim.log) (modulename,path to logfile2)
....
sqldb->configuration
use something like a module (1 file per program) type1.module (regexp, logstructure(somevariables), sql(tables and functions))
fork or thread processes (don't know what is better on Linux now) for different programs.
So question is, is my view of this correct? I should use one module per program (web/MTA/iptablat)
or there is some better way? I think some regexps would be the same, like date/time/ip/url. What to do with that? Or what have I missed?
example: mta exim4 mainlog
2011-04-28 13:16:24 1QFOGm-0005nQ-Ig
<= exim#mydomain.org.ua** H=localhost
(exim.mydomain.org.ua)
[127.0.0.1]:51127 I=[127.0.0.1]:465
P=esmtpsa
X=TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32
CV=no A=plain_server:spam S=763
id=1303985784.4db93e788cb5c#mydomain.org.ua T="test" from
<exim#exim.mydomain.org.ua> for
test#domain.ua
everything that is bold is already parsed and will be putted into sqldb.incoming table. now im having structure in perl to hold every parsed variable like $exim->{timstamp} or $exim->{host}->{ip}
my program will do something like tail -f /file and parse it line by line
Flexability: let say i want to add supprot to apache server (just timestamp userip and file downloaded). all i need to know what logfile to parse, what regexp shoud be and what sql structure should be. So im planning to have this like a module. just fork or thread main process with parameters(logfile,filetype). Maybe further i would add some options what not to parse (maybe some log level is low and you just dont see mutch there)
I would do it like this:
Create a config file that is formatted like this: appname:logpath:logformatname
Create a collection of Perl class that inherit from a base parser class.
Write a script which loads the config file and then loops over its contents, passing each iteration to its appropriate handler object.
If you want an example of steps 1 and 2, we have one on our project. See MT::FileMgr and MT::FileMgr::* here.
The log-monitoring tool wots could do a lot of the heavy lifting for you here. It runs as a daemon, watching as many log files as you could want, running any combination of perl regexes over them and executing something when matches are found.
I would be inclined to modify wots itself (which its licence freely allows) to support a database write method - have a look at its existing handle_* methods.
Most of the hard work has already been done for you, and you can tackle the interesting bits.
I think File::Tail is a nice fit.
You can make an array of File::Tail objects and poll them with select like this:
while (1) {
($nfound,$timeleft,#pending)=
File::Tail::select(undef,undef,undef,$timeout,#files);
unless ($nfound) {
# timeout - do something else here, if you need to
} else {
foreach (#pending) {
# here you can handle log messages depending on filename
print $_->{"input"}." (".localtime(time).") ".$_->read;
}
(from perl File::Tail doc)

Resources