is there a clean way to run a nextflow (dsl2 based) sub-workflow on a single node? - slurm

i have a pipeline defined in a sub-workflow:
workflow pipeline{
main:
task1()
task2(task1.out)
...
}
that gets called by the main-workflow for different inputs:
workflow{
params.in = "$baseDir/input"
input_ch = Channel.fromPath(params.in + "/*.*")
main:
pipeline(input_ch)
}
its easy to add a label to process to define its resources. what i want to do is add such a label to my (whole) subworkflow. they label is supposed to force the subworkflow to execute as a if it was a single process (meaning that it does not invoke a new slurm call for all processes but only per subworkflow) like:
- pipeline for input 1 on node 1 as one slurmjob
- pipeline for input 2 on node 2 as one slurmjob
- main: - collect all the outputs
...
- pipeline for input n on node n as one slurmjob
i did as above by calling nextflow inside of nextflow:
process nextflowofnextflows {
label 'use_whatever_is_needed'
input:
file input1
output:
path("$out/*.*", emit: output)
script:
"""
nextflow run pipline.nf --in=${input1}
"""
}
... - but this is super unclean and i´d like to to it in a clean way.
whats the best practice to achieve the described behavior?
best, t.

Related

Nextflow script to process all files in given directory

I have a nextflow script that runs a couple of processes on a single vcf file. The name of the file is 'bos_taurus.vcf' and it is located in the directory /input_files/bos_taurus.vcf. The directory input_files/ contains also another file 'sacharomyces_cerevisea.vcf'. I would like my nextflow script to process both files. I was trying to use a glob pattern like ch_1 = channel.fromPath("/input_files/*.vcf"), but sadly I can't find a working solution. Any help would be really appreciated.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// here I tried to use globbing
params.input_files = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/input_files/*.vcf"
params.results_dir = "/mnt/c/Users/Lenovo/Desktop/STUDIA/BIOINFORMATYKA/SEMESTR_V/PRACOWNIA_INFORMATYCZNA/nextflow/projekt/results"
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
// how can I make this process work on two files simultanously
process FILTERING {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
path(input_files)
output:
path("*")
script:
"""
vcftools --vcf ${input_files} --mac 1 --minQ 20 --recode --recode-INFO-all --out after_filtering.vcf
"""
}
Note that if your VCF files are actually bgzip compressed and tabix indexed, you could instead use the fromFilePairs factory method to create your input channel. For example:
params.vcf_files = "./input_files/*.vcf.gz{,.tbi}"
params.results_dir = "./results"
process FILTERING {
tag { sample }
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(sample), path(indexed_vcf)
output:
tuple val(sample), path("${sample}.filtered.vcf")
"""
vcftools \\
--vcf "${indexed_vcf.first()}" \\
--mac 1 \\
--minQ 20 \\
--recode \\
--recode-INFO-all \\
--out "${sample}.filtered.vcf"
"""
}
workflow {
vcf_files = Channel.fromFilePairs( params.vcf_files, checkIfExists: true )
FILTERING( vcf_files ).view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [thirsty_torricelli] DSL2 - revision: 8f69ad5638
executor > local (3)
[7d/dacad6] process > FILTERING (C) [100%] 3 of 3 ✔
[A, /path/to/work/84/f9f00097bcd2b012d3a5e105b9d828/A.filtered.vcf]
[B, /path/to/work/cb/9f6f78213f0943013990d30dbb9337/B.filtered.vcf]
[C, /path/to/work/7d/dacad693f06025a6301c33fd03157b/C.filtered.vcf]
Note that BCFtools is actively maintained and is intended as a replacement for VCFtools. In a production pipeline, BCFtools should be preferred.
Here is a little example for starters. First, you should specify a unique output name in each process. Currently, after_filtering.vcf is hardcoded so this will overwrite each other once copied to the publishDir. You can do that with the baseName operator as below and permanently store it in the input file channel, first element being the sample name and second one the actual file. I made an example process that just runs head on the vcf, you can then adapt as needed for what you actually need.
#! /usr/bin/env nextflow
nextflow.enable.dsl = 2
params.input_files = "/Users/atpoint/vcf/*.vcf"
params.results_dir = "/Users/atpoint/vcf/"
// A channel that contains a map with sample name and the file itself
file_channel = Channel.fromPath( params.input_files, checkIfExists: true )
.map { it -> [it.baseName, it] }
// An example process just head-ing the vcf
process VcfHead {
publishDir("${params.results_dir}/after_filtering", mode: 'copy')
input:
tuple val(name), path(vcf_in)
output:
path("*_head.vcf")
script:
"""
head -n 1 $vcf_in > ${name}_head.vcf
"""
}
// Run it
workflow {
VcfHead(file_channel)
}
The file_channel channel looks like this if you add a .view() to it:
[one, /Users/atpoint/vcf/one.vcf]
[two, /Users/atpoint/vcf/two.vcf]

How to call a forward the value of a variable created in the script in Nextflow to a value output channel?

i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.
If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env

How to use Ansible Python API to run ansible task

I would like to use Ansible 2.9.9 Python API to get config file and parse it to json format from servers in hosts file.
I don't know how to call an existing ansible task using Python API.
Through the Ansible API document, how to integrate ansible task with the sample code.
Sample.py
#!/usr/bin/env python
import json
import shutil
from ansible.module_utils.common.collections import ImmutableDict
from ansible.parsing.dataloader import DataLoader
from ansible.vars.manager import VariableManager
from ansible.inventory.manager import InventoryManager
from ansible.playbook.play import Play
from ansible.executor.task_queue_manager import TaskQueueManager
from ansible.plugins.callback import CallbackBase
from ansible import context
import ansible.constants as C
class ResultCallback(CallbackBase):
"""A sample callback plugin used for performing an action as results come in
If you want to collect all results into a single object for processing at
the end of the execution, look into utilizing the ``json`` callback plugin
or writing your own custom callback plugin
"""
def v2_runner_on_ok(self, result, **kwargs):
"""Print a json representation of the result
This method could store the result in an instance attribute for retrieval later
"""
host = result._host
print(json.dumps({host.name: result._result}, indent=4))
# since the API is constructed for CLI it expects certain options to always be set in the context object
context.CLIARGS = ImmutableDict(connection='local', module_path=['/to/mymodules'], forks=10, become=None,
become_method=None, become_user=None, check=False, diff=False)
# initialize needed objects
loader = DataLoader() # Takes care of finding and reading yaml, json and ini files
passwords = dict(vault_pass='secret')
# Instantiate our ResultCallback for handling results as they come in. Ansible expects this to be one of its main display outlets
results_callback = ResultCallback()
# create inventory, use path to host config file as source or hosts in a comma separated string
inventory = InventoryManager(loader=loader, sources='localhost,')
# variable manager takes care of merging all the different sources to give you a unified view of variables available in each context
variable_manager = VariableManager(loader=loader, inventory=inventory)
# create data structure that represents our play, including tasks, this is basically what our YAML loader does internally.
play_source = dict(
name = "Ansible Play",
hosts = 'localhost',
gather_facts = 'no',
tasks = [
dict(action=dict(module='shell', args='ls'), register='shell_out'),
dict(action=dict(module='debug', args=dict(msg='{{shell_out.stdout}}')))
]
)
# Create play object, playbook objects use .load instead of init or new methods,
# this will also automatically create the task objects from the info provided in play_source
play = Play().load(play_source, variable_manager=variable_manager, loader=loader)
# Run it - instantiate task queue manager, which takes care of forking and setting up all objects to iterate over host list and tasks
tqm = None
try:
tqm = TaskQueueManager(
inventory=inventory,
variable_manager=variable_manager,
loader=loader,
passwords=passwords,
stdout_callback=results_callback, # Use our custom callback instead of the ``default`` callback plugin, which prints to stdout
)
result = tqm.run(play) # most interesting data for a play is actually sent to the callback's methods
finally:
# we always need to cleanup child procs and the structures we use to communicate with them
if tqm is not None:
tqm.cleanup()
# Remove ansible tmpdir
shutil.rmtree(C.DEFAULT_LOCAL_TMP, True)
sum.yml : generated summary file for each host
- hosts: staging
tasks:
- name: pt_mysql_sum
shell: PTDEST=/tmp/collected;mkdir -p $PTDEST;cd /tmp;wget percona.com/get/pt-mysql-summary;chmod +x pt*;./pt-mysql-summary -- --user=adm --password=***** > $PTDEST/pt-mysql-summary.txt;cat $PTDEST/pt-mysql-summary.out;
register: result
environment:
http_proxy: http://proxy.example.com:8080
https_proxy: https://proxy.example.com:8080
- name: ansible_result
debug: var=result.stdout_lines
- name: fetch_log
fetch:
src: /tmp/collected/pt-mysql-summary.txt
dest: /tmp/collected/pt-mysql-summary-{{ inventory_hostname }}.txt
flat: yes
hosts file
[staging]
vm1 ansible_ssh_host=10.40.50.41 ansible_ssh_user=testuser ansible_ssh_pass=*****

Is there any way to clean up a Jenkins Worflowjob workspace with a groovy script via Jenkins script console?

Why again this type of question?
This question seems to have been asked multiple times, but all the answers are irelevant for Jenkins Pipeline jobs (plugin: workflow-job).
Situation
I am migrating a bunch of old freestyle jobs from old Jenkins standalone server to a distributed Jenkins env and I've decided to convert them to Jenkins Pipeline jobs (can't use Blue ocean for it as the SCM is SVN.
Anyway, for some of the jobs is not desired to clean up their workspaces as they are sort of sanity/verification jobs and because the size of SVN checkout and built stuff is large (2GB in 20K files, just deleting it is so slow).
However I do occasionally (ad-hoc) need to delete a workspace of such jobs.
And I don't want to do it by:
modifying a Jenkinsfile and committing it to SVN
"Replaying" a pipeline runs with modification
And I don't have r/w access to a FS on that slave node (which would be the easiest thing to do).
Googling
Quick search on the internet avalanched me with dozens of results [1,2, 3, 4, ...] how to cleanWS() from within a Groovy script ran from Jenkins script console.
Unfortunately, none of them tries to delete workspace of true org.jenkinsci.plugins.workflow.job.WorkflowJob item instance of a job.
My Groovy attempt to cleanWS
Based on the answers gathered from the internet I started my Groovy clean up script which can be executed from the Jenkins script console <Jenkins:port/script>
import hudson.model.*
import com.cloudbees.hudson.*
import com.cloudbees.hudson.plugins.*
import com.cloudbees.hudson.plugins.folder.*
import org.jenkinsci.plugins.workflow.job.*
//jobsToRetrieve = ["aFolder/aJobInFolder","topLevelJob"]
jobsToRetrieve = ["Sandbox/PipelineTests/SamplePipeline"]
enumerateItems(Hudson.instance.items)
def enumerateItems(items) {
items.each { item ->
println("===============::: GENERAL INFO::: =======================")
println(" item: " + item)
println(" item FN: " + item.fullName)
println(" item.getClass " + item.getClass())
println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
if ( !(item instanceof Folder)) {
jobName = item.getFullDisplayName()
println(" :::jobname::: " + jobName)
if (jobsToRetrieve.contains(item.getFullName())) {
if (item instanceof WorkflowJob) {
println("XXXXXXXXXXXXX--- THIS IS THE JOB --- XXXXXXXXXXXXXXXXXXXXX")
println(" item.workspace: " + item.WORKSPACE)
println("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
println(" following methods ain't implemented for WorkflowJob type of Item\nand it will blow out.")
//see https://javadoc.jenkins.io/hudson/model/FreeStyleProject.html
println(" customWS: " + item.getCustomWorkspace())
println(" WS:" + item.getWorkspace())
item.doDoWipeOutWorkspace()
}
}
} else {
println(" :::foldername::: " + item.displayName)
enumerateItems(((Folder) item).getItems())
}
println("==========================================================")
}
}
Results (kinda expected, but disaponting)
as you can see, my script is going to explode on calls of:
item.getCustomWorkspace()
item.getWorkspace()
item.doDoWipeOutWorkspace()
with MissingMethodException
groovy.lang.MissingMethodException: No signature of method: org.jenkinsci.plugins.workflow.job.WorkflowJob.doDoWipeOutWorkspace() is applicable for argument types: () values: []
at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:58)
at org.codehaus.groovy.runtime.callsite.PojoMetaClassSite.call(PojoMetaClassSite.java:49)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:117)
at Script1$_enumerateItems_closure1.doCall(Script1.groovy:33)
Simply because those methods ain't available for this type of item, but only for hudson.model.FreeStyleProject
Question: How then I can delete the Pipeline job's workspace?
There is another Jenkins plugin: Workspace Cleanup which is probably used within Jenkinsfile by calling cleanWs() inside a stage() {}, but I didn't figure out how to utilise it from the outside of Jenkinsfile (like from my groovy script ran Jenkins script console).
Is it a bug/request for enhancement of Jenkins Pipeline jobs plugin?
Or is there any other way how to cast the item to something from where I would have access to a desired functionality?
Alright, investigating this more and googling even more and listening to your ideas (this one particularly inspired me by Daniel Spilker) I have achieved what I wanted, which is:
Independently to CLEAN-UP Pipeline Job's WORKSPACE via Jenkins script console
(only using Jenkins available means and no messing up with Job configuration, nor updating Jenkinsfile, nor replaying)
The code is not surprisingly difficult and for manual demonstration it looks like this:
Jenkins jenkins = Jenkins.instance
Job item = jenkins.getItemByFullName('Sandbox/PipelineTests/SamplePipeline')
println("RootDir: " + item.getRootDir())
for (Node node in jenkins.nodes) {
// Make sure slave is online
if (!node.toComputer().online) {
println "Node '$node.nodeName' is currently offline - skipping workspace cleanup"
continue
}
println "Node '$node.nodeName' is online - performing cleanup:"
// Do some cleanup
FilePath wrksp = node.getWorkspaceFor(item)
println("WRKSP " + wrksp)
println("ls " + wrksp.list())
println("Free space " + wrksp.getFreeDiskSpace())
println("===== PERFORMING CLEAN UP!!! =====")
wrksp.deleteContents()
println("ls now " + wrksp.list())
println("Free space now " + wrksp.getFreeDiskSpace())
}
Its output, if your job is found, looks like:
Result
RootDir: /var/lib/jenkins/jobs/Sandbox/jobs/PipelineTests/jobs/SamplePipeline
....
.... other node's output noise
....
Node 'mcs-ubuntu-chch' is online - performing cleanup:
WRKSP /var/lib/jenkins/workspace/Sandbox/PipelineTests/SamplePipeline
ls [/var/lib/jenkins/workspace/Sandbox/PipelineTests/SamplePipeline/README.md, /var/lib/jenkins/workspace/Sandbox/PipelineTests/SamplePipeline/.git]
Free space 3494574714880
===== PERFORMING CLEAN UP!!! =====
ls now []
Free space now 3494574919680
Mission completed:)
References
Mainly Jenkins javadoc
https://javadoc.jenkins.io/hudson/model/Node.html
https://javadoc.jenkins.io/hudson/model/TopLevelItem.html
https://javadoc.jenkins.io/hudson/model/Item.html
https://javadoc.jenkins.io/hudson/FilePath.html
This is not the most beautiful way, but you could just execute the OS command:
def isWin = Jenkins.instance.windows
def cmd = isWin ? "rmdir /s /q $workspace" : "rm -rf $workspace"
cmd.execute()
If you are using your code just once or are not dealing with multiple OS, you can instead shorten the code to the respective command:
"rm -rf $workspace".execute()

scheduling a task at multiple timings(with different parameters) using celery beat but task run only once(with random parameters)

What i am trying to achieve
Write a scheduler, that uses a database to schedule similar tasks at different timings.
For the same i am using celery beat, the code snippet below would give an idea
try:
reader = MongoReader()
except:
raise
try:
tasks = reader.get_scheduled_tasks()
except:
raise
celerybeat_schedule = dict()
for task in tasks:
celerybeat_schedule[task["task_id"]] =dict()
celerybeat_schedule[task["task_id"]]["task"] = task["task_name"]
celerybeat_schedule[task["task_id"]]["args"] = (task,)
celerybeat_schedule[task["task_id"]]["schedule"] = get_task_schedule(task)
app.conf.update(BROKER_URL=rabbit_mq_endpoint, CELERY_TASK_SERIALIZER='json', CELERY_ACCEPT_CONTENT=['json'], CELERYBEAT_SCHEDULE=celerybeat_schedule)
so these are three steps
- reading all tasks from datastore
- creating a dictionary, celery scheduler which is populated by all tasks having properties, task_name(method that would run), parameters(data to pass to the method), schedule(stores when to run)
- updating this with celery configurations
Expected scenario
given all entries run the same celery task name that just prints, have same schedule to be run every 5 min, having different parameters specifying what to print, lets say db has
task name , parameter , schedule
regular_print , Hi , {"minutes" : 5}
regular_print , Hello , {"minutes" : 5}
regular_print , Bye , {"minutes" : 5}
I expect, these to be printing every 5 minutes to print all three
What happens
Only one of Hi, Hello, Bye prints( possible randomly, surely not in sequence)
Please help,
Thanks a lot in advance :)
Was able to resolve this using version 4 of celery. Sample similar to what worked for me.. can also find in documentation by celery for version 4
#taking address and user-pass from environment(you can mention direct values)
ex_host_queue = os.environ["EX_HOST_QUEUE"]
ex_port_queue = os.environ["EX_PORT_QUEUE"]
ex_user_queue = os.environ["EX_USERID_QUEUE"]
ex_pass_queue = os.environ["EX_PASSWORD_QUEUE"]
broker= "amqp://"+ex_user_queue+":"+ex_pass_queue+"#"+ex_host_queue+":"+ex_port_queue+"//"
#celery initialization
app = Celery(__name__,backend=broker, broker=broker)
app.conf.task_default_queue = 'scheduler_queue'
app.conf.update(
task_serializer='json',
accept_content=['json'], # Ignore other content
result_serializer='json'
)
task = {"task_id":1,"a":10,"b":20}
##method to update scheduler
def add_scheduled_task(task):
print("scheduling task")
del task["_id"]
print("adding task_id")
name = task["task_name"]
app.add_periodic_task(timedelta(minutes=1),add.s(task), name = task["task_id"])
#app.task(name='scheduler_task')
def scheduler_task(data):
print(str(data["a"]+data["b"]))

Resources