Viewflow Django - How do you deprecate a step? - django-viewflow

I recently removed a step in my viewflow.
Now i am getting 500 errors from coerce_to_related_instance(task, task.flow_task.flow_class.task_class with the error 'NoneType' object has no attribute 'flow_class'.
class TaskIterable(ModelIterable):
def __iter__(self):
base_iterator = super(TaskIterable, self).__iter__()
if getattr(self.queryset, '_coerced', False):
for task in base_iterator:
if isinstance(task, self.queryset.model):
print(task)
task = coerce_to_related_instance(task, task.flow_task.flow_class.task_class)
yield task
else:
for task in base_iterator:
yield task
I understand this happens because the task cannot be mapped to a valid task anymore as the old task has been deprecated.
What are my options?
Keep the old task around so that it can be mapped?
Run a sql script to update all 'flow_task'?
?

Yep, generally in order to keep node type info and associated detail task views, you need to keep unconnected flow nodes in your flow class.
General flow updating scenario is just to drop incoming connection but leave it with .Next(..) that would allow a user to complete an existing node.
If it's not possible, flow node task references could be updated during data migration
http://docs.viewflow.io/viewflow_core.html#flow-migration
The PRO version contains special obsolete node that allows dropping outdated nodes, and all node detail views would be performed by obsolete node detail views.

Related

Where to check the pool stats of FAIR scheduler in Spark Web UI

I see my Spark application is using FAIR scheduler:
But I can't confirm whether it is using two pools I set up (pool1, pool2). Here is a thread function I implemented in PySpark which is called twice - one with "pool1" and the other with "pool2".
def do_job(f1, f2, id, pool_name, format="json"):
spark.sparkContext.setLocalProperty("spark.scheduler.pool", pool_name)
...
I thought the "Stages" menu is supposed to show the pool info but I don't see it. Does that mean the pools are not set up correctly or am I looking at the wrong place?
I am using PySpark 3.3.0 on top of EMR 6.9.0
You can confirm like this diagram.
pls refer my article I created 3 pools like module1 module2 module3 based on certin logic.
Each one is using specific pool.like above.. based on this I created below diagrams
Note : Please see the verification steps in the article I gave

Celery Process a Task and Items within a Task

I am new to Celery, and I would like advice on how best to use Celery to accomplish the following.
Suppose I have ten large datasets. I realize that I can use Celery to do work on each dataset by submitting ten tasks. But suppose that each dataset consists of 1,000,000+ text documents stored in a NoSQL database (Elasticsearch in my case). The work is performed at the document level. The work could be anything - maybe counting words.
For a given dataset, I need to start the dataset-level task. The task should read documents from the data store. Then workers should process the documents - a document-level task.
How can I do this, given that the task is defined at the dataset level, not the document level? I am trying to move away from using a JoinableQueue to store documents and submit them for work with multiprocessing.
It have read that it is possible to use multiple queues in Celery, but it is not clear to me that that is the best approach.
Lets see if this helps. You can define a workflow and add tasks to it and then run the whole thing after building up your tasks. You can have normal python methods return tasks to can be added into celery primatives (chain, group chord etc) See here for more info. For example lets say you have two tasks that process documents for a given dataset:
def some_task():
return dummy_task.si()
def some_other_task():
return dummy_task.si()
#celery.task()
def dummy_task(self, *args, **kwargs):
return True
You can then provide a task that generates the subtasks like so:
#celery.task()
def dataset_workflow():
datastets = get_datasets(*args, **kwargs)
workflows = []
for dataset in datasets:
documents = get_documents(dataset)
worflow = chain(some_task(documents), some_other_task(documents))
worlfows.append(workflow)
run_workflows = chain(*workflows).apply_aysnc()
Keep in mind that generating alot of tasks can consume alot of memory for the celery workers, so throttling or breaking the task generation up might be needed as you start to scale your workloads.
Additionally you can have the document level tasks on a diffrent queue then your worflow task if needed based on resource contstraints etc.

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Netsuite Map Reduce yielding

I read in documentation that soft limits on governance cause map reduce scripts to yield and reschedule. My problem is I cannot see in docs where it explains what happens in the yield. Is the getInputData called again to regather the same data set ok to be mapped or is the initial data set persisted somewhere and already mapped and reduced records are Excluded from processing?
With yielding, the getInputData stage is not called again. From the docs;
If a job monopolizes a processor for too long, the system can
naturally finish the job after the current map or reduce function has
completed. In this case, the system creates a new job to continue
executing remaining key/value pairs. Based on its priority and
submission timestamp, the new job either starts right after the
original job has finished, or it starts later, to allow
higher-priority jobs processing other scripts to execute. For more
details, see SuiteScript 2.0 Map/Reduce Yielding.
This is different from server restarts or interruptions, however.

How to debug a slow PySpark application

There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?
You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)
Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*

Resources