I am new to Celery, and I would like advice on how best to use Celery to accomplish the following.
Suppose I have ten large datasets. I realize that I can use Celery to do work on each dataset by submitting ten tasks. But suppose that each dataset consists of 1,000,000+ text documents stored in a NoSQL database (Elasticsearch in my case). The work is performed at the document level. The work could be anything - maybe counting words.
For a given dataset, I need to start the dataset-level task. The task should read documents from the data store. Then workers should process the documents - a document-level task.
How can I do this, given that the task is defined at the dataset level, not the document level? I am trying to move away from using a JoinableQueue to store documents and submit them for work with multiprocessing.
It have read that it is possible to use multiple queues in Celery, but it is not clear to me that that is the best approach.
Lets see if this helps. You can define a workflow and add tasks to it and then run the whole thing after building up your tasks. You can have normal python methods return tasks to can be added into celery primatives (chain, group chord etc) See here for more info. For example lets say you have two tasks that process documents for a given dataset:
def some_task():
return dummy_task.si()
def some_other_task():
return dummy_task.si()
#celery.task()
def dummy_task(self, *args, **kwargs):
return True
You can then provide a task that generates the subtasks like so:
#celery.task()
def dataset_workflow():
datastets = get_datasets(*args, **kwargs)
workflows = []
for dataset in datasets:
documents = get_documents(dataset)
worflow = chain(some_task(documents), some_other_task(documents))
worlfows.append(workflow)
run_workflows = chain(*workflows).apply_aysnc()
Keep in mind that generating alot of tasks can consume alot of memory for the celery workers, so throttling or breaking the task generation up might be needed as you start to scale your workloads.
Additionally you can have the document level tasks on a diffrent queue then your worflow task if needed based on resource contstraints etc.
Related
I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?
Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs
We are a skills based development company that creates competitions . The players of this competition can upload photos and rank each other photos to earn points . One of the key requirements of this is to update the competition leader board regularly to keep the players interested. We are looking for a fan-out and fan in architecture to implement the leader board. A typical work flow is attached
From our analysis Durable functions seems to an best option.
However we have the following constraints
Each competition has about 500 players
A player will be ranking up to 500 photos
I have been trying to read through documentation. However could not find documentation on the scalability of this approach using Durable functions. Any comments or insights is highly appreciated
You can find the performance targets for Durable Functions here: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#performance-targets
Parallel activity execution (fan-out) 100 activities per second, per instance
Parallel response processing (fan-in) 150 responses per second, per instance
If you run on an Azure Functions Consumption plan, the scale controller there will scale up to more instances as more messages appear in the work item queue.
This is the queue used to start activities (which you would use to calculate a single player's score.
You can also improve fan-in performance by doing what the say in the docs:
Unlike fan-out, fan-in operations are limited to a single VM. If your application uses the fan-out, fan-in pattern and you are concerned about fan-in performance, consider sub-dividing the activity function fan-out across multiple sub-orchestrations.
So you'd have:
Main orchestrator
Batch 0 sub-orchestrator
Activity for user 0 in batch 0
Activity for user 1 in batch 0
...
Batch 1 sub-orchestrator
Activity for user 0 in batch 1
Activity for user 1 in batch 1
...
...
The reason this kind of sub-orchestrator batching makes it faster is because your orchestrator history table gets more and more rows as the activities complete.
It has to load these every time there is a result.
So by limiting the ceiling for those rows you get maximal performance.
TL;DR: I think the fan-out will scale well, but you may want to do sub-orchestrator batching to improve fan-in performance.
Problem
I have purchase service that users can use to buy/rent digital assets like game, media, movies... When purchase event happened, I create a job and schedule it to run at calculated expired date to remove key for such asset.
Everything works. But it would be better if I can group those jobs that will run in the same agenda db scan into 1 batch job to remove multiple keys.
This will reduce significant amount of db read/write/delete operations in both keys and agenda collection, it also increases the free memory at most of the time as instead of storing 100+ jobs to run in a scan, it stores just 1 job to remove 100+ keys.
Research
The closest feature I found in Agenda repo is unique(). Which allows user to find and modify the existing job that matches the fields defined in unique(). If it can concat new jobs to the existing job, that will solve my case.
Implementation
Before diving in and modify the package I want to check if there are already people solved the problem I have and have some thoughts to share.
Another solution without touching the package is to create an in-memory dictionary to accumulate jobs for a specific db scan with this strategy:
dict = {}
//if key expires in 1597202228 then put to dict slot:
dict = {
1597300000: [jobA]
}
//another key expires in 1597202238 then put to the same slot:
dict = {
1597300000: [jobA,jobB]
}
//the latch condition to put job batch into agenda:
if dict_size == dict_allocated_memory then put the whole dict into db.
if a batch_size = batch_limit then put the batch into db and remove the batch in dict.
if the batch is going to expire in the next db scan then put the batch (it may be empty, has a few jobs...) into db and remove the batch in dict.
I am supposed to update 2-3 fields in a DataBase Table having 10s of millions of records. I am doing the operation in a .Net application in batches of 100K(recursively) and updating the table by regular ADO.Net code and executing Stored Procs to update the table. This process is estimated to take 30 hours(probably because of IO and server roundtrips) like this and I have to do it in just 4.
Would DataAdapter.Update be any faster? Any suggestions on improving speed greatly appreciated.
well i also have the same problem but we solve it my using the threading concept.
First make a list of dataset for 1000 or 10000 record in each
Than create a thread pool by calling
Task tsk = Task.factory.StartNew(()=>function_Name(list object));
Define the function_name and perform batch update using dataadapter
You will get huge difference ......
The pooled thread fired concurrently.
I have TPL (Task Parallel Library) code for executing a loop in parallel in C# in a class library project using .Net 4.0. I am new to TPL in C# and had following questions .
CODE Background:
In the code that appears just after the questions, I am getting all unprocessed batches and then processing each batch one at a time. Each batch can be processed independently since there are no dependencies between batches, but for each batch the sequence of steps is very important when processing it.
My questions are:
Will using Parallel.ForEach be advisable in this scenario where the number of batches and therefore the number of iterations could be very small or very large like 10,000 batches? I am afraid that with too many batches, using parallelism might cause more harm than good in this case.
When using Parallel.ForEach is the sequence of steps in ProcessBatch method guaranteed to execute in the same order as step1, step2, step3 and then step4?
public void ProcessBatches() {
List < Batch > batches = ABC.Data.GetUnprocessesBatches();
Parallel.ForEach(batches, batch = > {
ProcessBatch(batch);
});
}
public void ProcessBatch(Batch batch) {
//step 1
ABC.Data.UpdateHistory(batch);
//step2
ABC.Data.AssignNewRegions(batch);
//step3
UpdateStatus(batch);
//step4
RemoveBatchFromQueue(batch);
}
UPDATE 1:
From the accepted answer, the number of iterations is not an issue even when its large. In fact according to an article at this url: Potential Pitfalls in Data and Task Parallelism, performance improvements with parallelism will likely occur when there are many iterations and for fewer iterations parallel loop is not going to provide any benefits over a sequential/synchronous loop.
So it seems having a large number of iterations in the loop is the best situation for using Parallel.ForEach.
The basic rule of thumb is that parallel loops that have few iterations and fast user delegates are unlikely to speedup much.
Parallel foreach will us the appropriate number of threads for the hardware you are running on. So you don't need to worry about too many batches causing harm
The steps will run in order for each batch. ProcessBatch will get called on different threads for different batches but for each batch the steps will get executed in the order they are defined in that method