Luigi task methods execution order

Luigi task methods execution order - python-3.5

What is the order in which Luigi executes the methods (run, output, requires). I understand requires is run as a first check for checking the validity of the task DAG, but shouldn't output be run after run()?
I'm actually trying to wait for a kafka message in run and based on that trigger a bunch of other tasks and return a LocalTarget. Like this:
def run(self):
for message in self.consumer:
self.metadata_key = str(message.value, 'utf-8')
self.path = os.path.join(settings.LUIGI_OUTPUT_PATH, self.metadata_key, self.batch_id)
if not os.path.exists(self.path):
os.mkdir(self.path)
with self.conn.cursor() as cursor:
all_accounts = cursor.execute('select domainname from tblaccountinfo;')
for each in all_accounts:
open(os.path.join(self.path,each)).close()
def output(self):
return LocalTarget(self.path)
However, I get an error saying:
Exception: path or is_tmp must be set
At the return LocalTarget(self.path) line. Why does luigi try to execute the def output() method till def run() is done?

When you run a pipeline (ie one or more tasks), Luigi first checks whether its output targets already exist, and if not, schedules the task to run.
How does Luigi know what targets it must check? It just gets them calling your task's output() method.

It is not the execution order. Luigi will check for the file that we want to create using output() method is existing or not before making the task to pending status. So, it expects the variables to be resolved if you are using any. Here, you are using self.path, which is getting created in the run method. That's why the error.
Either you have to create the path in the class itself and consume in output method or create them in the output method itself and consume them in the run method as below
self.output().open('w').close()

Related

How to give dynamic soft_time_limit for celery task

Need to add the dynamic soft_time_limit for my task. The task should be executed based on dynamic soft_time_limit. I should not mention a limit for this execution. Is there any way to define this?
#APP.task(acks_late=True, soft_time_limit=10000, trail=True, bind=True)
def execute_fun(self, data):
try:
do_work()
except Exception as error:
print('---error---', error)
Here in the above function, I don't want to define the soft_time_limit, It should take it as a dynamic time limit. How should I achieve this?

Locust - How do I define multiple task sets for the same user?

Please consider the follow code:
class Task1(TaskSet):
#task
def task1_method(self):
pass
class Task2(TaskSet):
#task
def task2_method(self):
pass
class UserBehaviour(TaskSet):
tasks = [Task1, Task2]
class LoggedInUser(HttpUser):
host = "http://localhost"
wait_time = between(1, 5)
tasks = [UserBehaviour]
When I execute the code above with just one user, the method Task2.Method never gets executed, only the method from Task1.
What can I do to make sure the code from both tasks gets executed for the same user?
I would like to do it this way because I want to separate the tasks into different files for better organizing the project. If that is not possible, how can I have tasks defined into different files in an way that I can have tasks defined for each od my application modules?

I think I got it. To solve the problem I had to add a method at the end of each taskset to stop the execution of the task set:
def stop(self):
self.interrupt()
In addition to that, I had to change the inherited class to SequentialTaskSet so all tasks get executed in order.
This is the full code:
class Task1(SequentialTaskSet):
#task
def task1_method(self):
pass
#task
def stop(self):
self.interrupt()
class Task2(SequentialTaskSet):
#task
def task2_method(self):
pass
#task
def stop(self):
self.interrupt()
class UserBehaviour(SequentialTaskSet):
tasks = [Task1, Task2]
class LoggedInUser(HttpUser):
host = "http://localhost"
wait_time = between(1, 5)
tasks = [UserBehaviour]
Everything seems to be working fine now.

At first I thought this was a bug, but it is actually just as intended (although I dont really understand WHY it was implemented that way)
One important thing to know about TaskSets is that they will never
stop executing their tasks, and hand over execution back to their
parent User/TaskSet, by themselves. This has to be done by the
developer by calling the TaskSet.interrupt() method.
https://docs.locust.io/en/stable/writing-a-locustfile.html#interrupting-a-taskset
I would solve this issue with inheritance: Define a base TaskSet or User class that has the common tasks, and then subclass it, adding the user-type-specific tasks/code.
If you define a base User class, remember to set abstract = True if you dont want Locust to run that user as well.

Want to use TriggerDagRunOperator in Airflow to trigger many sub-dags by using only Main-dag with bashoperator(sub-dag operator)

Unable to understand the concept of Payload in airflow with TriggerDagRunOperator.Please help me to understand this term in a very easy way.

the TriggerDagRunOperator triggers a DAG run for a specified dag_id. This needs a trigger_dag_id with type string and a python_callable param which is a reference to a python function that will be called while passing it the context object and a placeholder object obj for your callable to fill and return if you want a DagRun created. This obj object contains a run_id and payload attribute that you can modify in your function.
The run_id should be a unique identifier for that DAG run, and the payload has to be a picklable object that will be made available to your tasks while executing that DAG run. Your function header should look like def foo(context, dag_run_obj):
picklable simply means it can be serialized by the pickle module. For a basic understanding of this, see what can be pickled and unpickled?. The pickle protocol provides more details, and shows how classes can customize the process.
Reference: https://github.com/apache/airflow/blob/d313d8d24b1969be9154b555dd91466a2489e1c7/airflow/operators/dagrun_operator.py#L37

APscheduler and Pyramid python

I'm trying to use the wonder apscheduler in a pyarmid api. The idea is to have a background job run regularly, while we still query the api for the result from time to time. Basically I use the job in a class as:
def my_class(object):
def __init__(self):
self.current_result = 0
scheduler = BackGroundScheduler()
scheduler.start()
scheduler.add_job(my_job,"interval", id="foo", seconds=5)
def my_job():
print("i'm updating result")
self.current_result += 1
And outside of this class (a service for me), the api has a POST endpoint that returns my_class instance's current result:
class MyApi(object):
def __init__(self):
self.my_class = MyClass()
#view_config(request_method='POST')
def my_post(self):
return self.my_class.current_result
When everything runs, I see the prints and incrementation of value inside the service. But current_result stays as 0 when gathered from the post.
From what I know of the threading, I guess that the update I do is not on the same object my_class but must be on a copy passed to the thread.
One solution I see would be to update the variable in a shared intermediate (write on disk, or in a databse). But I wondered if that would be possible to do in memory.
I manage to do exactly this when I do this in a regular script, or with one script and a very simple FLASK api (no class for the API there) but I can't manage to have this logic function inside the pyramid api.
It must be linked to some internal of Pyramid spawning my api endpoint on a different thread but I can't get right on the problem.
Thanks !
=== EDIT ===
I have tried several things to solve the issue. First, the instance of MyClass used is intitialized in another script, follow a container pattern. That container is by default contained in all MyApi instances of pyramid, and supposed to contain all global variables linked to my project.
I also define a global instance of MyClass just to be sure, and print its current result value to compare
global_my_class = MyClass()
class MyApi(object):
def __init__(self):
pass
#view_config(request_method='POST')
def my_post(self):
print(global_my_class.current_result)
return self.container.my_class.current_result
I check using debug that MyClass is only spawned twice during the api execution (one for the global variable, one inside the container. However.
So what I see in logging are two values of current_result getting incremented, but at each calls of my_post I only get 0s.

An instance of view class only lives for the duration of the request - request comes in, a view class is created, produces the result and is disposed. As such, each instance of your view gets a new copy of MyClass() which is separate from the previous requests.
As a very simple solution you may try defining a global instance which will be shared process-wide:
my_class = MyClass()
class MyApi(object):
#view_config(request_method='POST')
def my_post(self):
return my_class.current_result

How to use the same context manager across different methods?

I am trying to implement a class which uses python context manager ..
Though i understand the general concept of enter and exit i dont see how to use the same context manager across multiple code blocks.
for example take the below case
#contextmanager
def backupContext(input)
try:
return xyz
finally
revert (xyz)
class do_something:
def __init__(self):
self.context = contextVal
def doResourceOperation_1():
with backupContext(self.context) as context:
do_what_you_want_1(context)
def doResourceOperation_2():
with backupContext(self.context) as context:
do_what_you_want_2(context)
I am invoking the context managers twice..Suppose i want to do only once.. during the init and use the same context manager object to do all my operations and then finally when the object is deleted i want to do the revert operation how should i go about it?
Should i call enter and exit manually instead of using the with statement?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Luigi task methods execution order - python-3.5

When you run a pipeline (ie one or more tasks), Luigi first checks whether its output targets already exist, and if not, schedules the task to run. How does Luigi know what targets it must check? It just gets them calling your task's output() method.

Related

How to give dynamic soft_time_limit for celery task

Locust - How do I define multiple task sets for the same user?

Want to use TriggerDagRunOperator in Airflow to trigger many sub-dags by using only Main-dag with bashoperator(sub-dag operator)

APscheduler and Pyramid python

How to use the same context manager across different methods?

Categories

Resources