scheduling a task at multiple timings(with different parameters) using celery beat but task run only once(with random parameters) - python-3.x

What i am trying to achieve
Write a scheduler, that uses a database to schedule similar tasks at different timings.
For the same i am using celery beat, the code snippet below would give an idea
try:
reader = MongoReader()
except:
raise
try:
tasks = reader.get_scheduled_tasks()
except:
raise
celerybeat_schedule = dict()
for task in tasks:
celerybeat_schedule[task["task_id"]] =dict()
celerybeat_schedule[task["task_id"]]["task"] = task["task_name"]
celerybeat_schedule[task["task_id"]]["args"] = (task,)
celerybeat_schedule[task["task_id"]]["schedule"] = get_task_schedule(task)
app.conf.update(BROKER_URL=rabbit_mq_endpoint, CELERY_TASK_SERIALIZER='json', CELERY_ACCEPT_CONTENT=['json'], CELERYBEAT_SCHEDULE=celerybeat_schedule)
so these are three steps
- reading all tasks from datastore
- creating a dictionary, celery scheduler which is populated by all tasks having properties, task_name(method that would run), parameters(data to pass to the method), schedule(stores when to run)
- updating this with celery configurations
Expected scenario
given all entries run the same celery task name that just prints, have same schedule to be run every 5 min, having different parameters specifying what to print, lets say db has
task name , parameter , schedule
regular_print , Hi , {"minutes" : 5}
regular_print , Hello , {"minutes" : 5}
regular_print , Bye , {"minutes" : 5}
I expect, these to be printing every 5 minutes to print all three
What happens
Only one of Hi, Hello, Bye prints( possible randomly, surely not in sequence)
Please help,
Thanks a lot in advance :)

Was able to resolve this using version 4 of celery. Sample similar to what worked for me.. can also find in documentation by celery for version 4
#taking address and user-pass from environment(you can mention direct values)
ex_host_queue = os.environ["EX_HOST_QUEUE"]
ex_port_queue = os.environ["EX_PORT_QUEUE"]
ex_user_queue = os.environ["EX_USERID_QUEUE"]
ex_pass_queue = os.environ["EX_PASSWORD_QUEUE"]
broker= "amqp://"+ex_user_queue+":"+ex_pass_queue+"#"+ex_host_queue+":"+ex_port_queue+"//"
#celery initialization
app = Celery(__name__,backend=broker, broker=broker)
app.conf.task_default_queue = 'scheduler_queue'
app.conf.update(
task_serializer='json',
accept_content=['json'], # Ignore other content
result_serializer='json'
)
task = {"task_id":1,"a":10,"b":20}
##method to update scheduler
def add_scheduled_task(task):
print("scheduling task")
del task["_id"]
print("adding task_id")
name = task["task_name"]
app.add_periodic_task(timedelta(minutes=1),add.s(task), name = task["task_id"])
#app.task(name='scheduler_task')
def scheduler_task(data):
print(str(data["a"]+data["b"]))

Related

Run databricks job from notebook

I want to know if it is possible to run a Databricks job from a notebook using code, and how to do it
I have a job with multiple tasks, and many contributors, and we have a job created to execute it all, now we want to run the job from a notebook to test new features without creating a new task in the job, also for running the job multiple times in a loop, for example:
for i in [1,2,3]:
run job with parameter i
Regards
what you need to do is the following:
install the databricksapi. %pip install databricksapi==1.8.1
Create your job and return an output. You can do that by exiting the notebooks like that:
import json dbutils.notebook.exit(json.dumps({"result": f"{_result}"}))
If you want to pass a dataframe, you have to pass them as json dump too, there is some official documentation about that from databricks. check it out.
Get the job id you will need it later. You can get it from the jobs details in databricks.
In the executors notebook you can use the following code.
def run_ks_job_and_return_output(params):
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
# context
url = context['extraContext']['api_url']
token = context['extraContext']['api_token']
jobs_instance = Jobs.Jobs(url, token) # initialize a jobs_instance
runs_job_id = jobs_instance.runJob(****************, 'notebook',
params) # **** is the job id
run_is_not_completed = True
while run_is_not_completed:
current_run = [run for run in jobs_instance.runsList('completed')['runs'] if run['run_id'] == runs_job_id['run_id'] and run['number_in_job'] == runs_job_id['number_in_job']]
if len(current_run) == 0:
time.sleep(30)
else:
run_is_not_completed = False
current_run = current_run[0]
print( f"Result state: {current_run['state']['result_state']}, You can check the resulted output in the following link: {current_run['run_page_url']}")
note_output = jobs_instance.runsGetOutput(runs_job_id['run_id'])['notebook_output']
return note_output
run_ks_job_and_return_output( { 'parm1' : 'george',
'variable': "values1"})
If you want to run the job many times in parallel you can do the following. (first be sure that you have increased the max concurent runs in the job settings)
from multiprocessing.pool import ThreadPool
pool = ThreadPool(1000)
results = pool.map(lambda j: run_ks_job_and_return_output( { 'table' : 'george',
'variable': "values1",
'j': j}),
[str(x) for x in range(2,len(snapshots_list))])
There is also the possibility to save the whole html output but maybe you are not interested on that. In any case I will answer to that to another post on StackOverflow.
Hope it helps.
You can use following steps :
Note-01:
dbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")
dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")
result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")
def display():
print("Function Display: "+result)
dbutils.notebook.exit(result)
Note-02:
thislist = ["apple", "banana", "cherry"]
for x in thislist:
dbutils.notebook.run("Note-01 path", 60, {"foo": x,"foo2":'Azure'})

Multiprocess : Persistent Pool?

I have code like the one below :
def expensive(self,c,v):
.....
def inner_loop(self,c,collector):
self.db.query('SELECT ...',(c,))
for v in self.db.cursor.fetchall() :
collector.append( self.expensive(c,v) )
def method(self):
# create a Pool
#join the Pool ??
self.db.query('SELECT ...')
for c in self.db.cursor.fetchall() :
collector = []
#RUN the whole cycle in parallel in separate processes
self.inner_loop(c, collector)
#do stuff with the collector
#! close the pool ?
both the Outer and the Inner loop are thousands of steps ...
I think I understand how to run a Pool of couple of processes.
All the examples I found show that more or less.
But in my case I need to lunch a persistent Pool and then feed the data (c-value). Once a inner-loop process has finished I have to supply the next-available-c-value.
And keep the processes running and collect the results.
How do I do that ?
A clunky idea I have is :
def method(self):
ws = 4
with Pool(processes=ws) as pool :
cs = []
for i,c in enumerate(..) :
cs.append(c)
if i % ws == 0 :
res = [pool.apply(self.inner_loop, (c)) for i in range(ws)]
cs = []
collector.append(res)
will this keep the same pool running !! i.e. not lunch new process every time ?i
Do I need 'if i % ws == 0' part or I can use imap(), map_async() and the Pool obj will block the loop when available workers are exhausted and continue when some are freed ?
Yes, the way that multiprocessing.Pool works is:
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue.
So simply submitting all your work to the pool via imap should be sufficient:
with Pool(processes=4) as pool:
initial_results = db.fetchall("SELECT c FROM outer")
results = [pool.imap(self.inner_loop, (c,)) for c in initial_results]
That said, if you really are doing this to fetch things from the DB, it may make more sense to move more processing down into that layer (bring the computation to the data rather than bringing the data to the computation).

Schedule jobs dynamically with Flask APScheduler

I'm trying the code given at advanced.py with the following modification for adding new jobs.
I created a route for adding a new job
#app.route('/add', methods = ['POST'])
def add_to_schedule():
data = request.get_json()
print(data)
d = {
'id': 'job'+str(random.randint(0, 100)),
'func': 'flask_aps_code:job1',
'args': (random.randint(200,300),random.randint(200, 300)),
'trigger': 'interval',
'seconds': 10
}
print(app.config['JOBS'])
scheduler.add_job(id = 'job'+str(random.randint(0, 100)), func = job1)
#app.config.from_object(app.config['JOBS'].append(d))
return str(app.config['JOBS']), 200
I tried adding the jobs to config['JOBS'] as well as scheduler.add_job. But none of my new jobs are not getting executed. Additionally, my first scheduled job doesnt get executed till I do a ctrl+c on the terminal, after which the first scheduled job seems to execute, twice. What am I missing?
Edit: Seemingly the job running twice is because of flask reloading, so ignore that.

how to get celery tasks id

I set up a periodic task using celery beat. The task runs and I can see the result in the console.
I want to have a python script that recollects the results thrown by the tasks.
I could do it like this:
#client.py
from cfg_celery import app
task_id = '337fef7e-68a6-47b3-a16f-1015be50b0bc'
try:
x = app.AsyncResult(id)
print(x.get())
except:
print('some error')
Anyway, as you can see, for this test I had to copy the task_id thrown at the celery beat console (so to say) and hardcode it in my script. Obviously this is not going to work in real production.
I hacked it setting the task_id on the celery config file:
#cfg_celery.py
app = Celery('celery_config',
broker='redis://localhost:6379/0',
include=['taskos'],
backend = 'redis'
)
app.conf.beat_schedule = {
'something': {
'task': 'tasks.add',
'schedule': 10.0,
'args': (16, 54),
'options' : {'task_id':"my_custom_id"},
}
}
This way I can read it like this:
#client.py
from cfg_celery import app
task_id = 'my_custom_id'
try:
x = app.AsyncResult(id)
print(x.get())
except:
print('some error')
The problem with this approach is that I lose the previous results (previous to the call of client.py).
Is there some way I can read a list of the task_id's in the celery backend?
If I have more than one periodic tasks, can I get a list of task_id's from each periodic task?
Can I use app.tasks.key() to accomplish this, how?
pd: not english-speaking-native, plus new to celery, be nice if I used some terminology wrong.
OK. I am not sure if nobody answered this because is difficult or because my question is too dumb.
Anyway, what I wanted to do is to get the results of my 'celery-beat' tasks from another python process.
Being in the same process there was no problem I could access the task id and everything was easy from there on. But from other process I didn't find a way to retrieve a list of the finished tasks.
I tried python-RQ (it is nice) but when I saw that using RQ I couldn't do that either I came to understand that I had to manually make use of redis storage capabilities. So I got what I wanted, doing this:
. Use 'bind=True' to be able to instrospect from within the task function.
. Once I have the result of the function, I write it in a list in redis (I made some trick to limit the sizeof this list)
. Now I can from an independent process connect to the same redis server and retrieve the results stored in such list.
My files ended up being like this:
cfg_celery.py : here I define the way the tasks are going to be called.
#cfg_celery.py
from celery import Celery
appo = Celery('celery_config',
broker='redis://localhost:6379/0',
include=['taskos'],
backend = 'redis'
)
'''
urlea se decoro como periodic_task. no hay necesidad de darla de alta aqi.
pero como add necesita args, la doy de alta manualmente p pasarselos
'''
appo.conf.beat_schedule = {
'q_loco': {
'task': 'taskos.add',
'schedule': 10.0,
'args': (16, 54),
# 'options' : {'task_id':"lcura"},
}
}
taskos.py : these are the tasks.
#taskos.py
from cfg_celery import appo
from celery.decorators import periodic_task
from redis import Redis
from datetime import timedelta
import requests, time
rds = Redis()
#appo.task(bind=True)
def add(self,a, b):
#result of operation. very dummy.
result = a + b
#storing in redis
r= (self.request.id,time.time(),result)
rds.lpush('my_results',r)
# for this test i want to have at most 5 results stored in redis
long = rds.llen('my_results')
while long > 5:
x = rds.rpop('my_results')
print('popping out',x)
long = rds.llen('my_results')
time.sleep(1)
return a + b
#periodic_task(run_every=20)
def urlea(url='https://www.fullstackpython.com/'):
inicio = time.time()
R = dict()
try:
resp = requests.get(url)
R['vato'] = url+" = " + str(resp.status_code*10)
R['num palabras'] = len(resp.text.split())
except:
R['vato'] = None
R['num palabras'] = 0
print('u {} : {}'.format(url,time.time()-inicio))
time.sleep(0.8) # truco pq se vea mas claramente la dif.
return R
consumer.py : the independent process that can get the results.
#consumer.py
from redis import Redis
nombre_lista = 'my_results'
rds = Redis()
tamaño = rds.llen(nombre_lista)
ultimos_resultados = list()
for i in range(tamaño):
ultimos_resultados.append(rds.rpop(nombre_lista))
print(ultimos_resultados)
I am relatively new to programming and I hope that this answer can help noobs like me. If I got something wrong feel free to make the corrections as necessary.

Grinder multiple post url test not working properly

I am running grinder to test a POST URI with 10 different json bodies. The response times given by grinder are not proper.Individual json body tests are giving a reasonable response time though 10 the script with with 10 different json bodies is giving a very high response time and a very low tps. I am using 1 agent with 5 worker processes and 15 threads.Can someone help me figure out where the problem might be?
The script I am using are :-
`from net.grinder.script.Grinder import grinder
from net.grinder.script import Test
from net.grinder.plugin.http import HTTPRequest
from HTTPClient import NVPair
from java.io import FileInputStream
test1 = Test(1, "Request resource")
request1 = HTTPRequest()
#response1 = HTTPResponse()
test1.record(request1)
log = grinder.logger.info
class TestRunner:
def __call__(self):
#request1.setDataFromFile("ReqBody.txt")
payload1 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req.txt")
payload2 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req2.txt")
payload3 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req3.txt")
payload4 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req4.txt")
payload5 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req5.txt")
payload6 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req6.txt")
payload7 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req7.txt")
payload8 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req8.txt")
payload9 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req9.txt")
payload10 = FileInputStream("/home/ubuntu/grinder-3.11/scripts/Req10.txt")
headersPost = [NVPair('Content-Type', ' application/json')]
#request1.setHeaders(headersPost)
#request1.setHeaders
myload = [payload1, payload2, payload3, payload4, payload5, payload6, payload7, payload8, payload9, payload10]
for f in myload:
result1 = request1.POST("http://XX.XX.XX.XX:8080/api/USblocks/ORG101/1/N/",f,headersPost)
log(result1.toString())`
First step for you is to run it with 1 thread 1 process and 1 agent . I hope it will run properly.
It looks since for loop is used all the scripts are going to run for each and every thread.
I think what you want /what should be done is that each thread should be sending one request .
You can move the request out to a global method and take some random number like Grinder.threadNo and use it to return the script to be executed . This then also demands that you remove the for loop since you can control the number of times the script is executed or leave it running for a duration .
Running 10 threads in parallel is a good number to start with then slowly you can see how many requests are actually getting accepted.

Resources