I'm running dask locally using the distributed scheduler on my machine with 8 cores. On initialization I see:
Which looks correct, but I'm confused by the task stream in the diagnostics (shown below):
I was expecting 8 rows corresponding to the 8 workers/cores, is that incorrect?
Thanks
AJ
I've added the code I'm running:
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client()
progress(client)
# load datasets
trd = (dd.read_csv('trade_201811*.csv', compression='gzip',
blocksize=None, dtype={'Notional': 'float64'})
.assign(timestamp=lambda x: dd.to_datetime(x.timestamp.str.replace('D', 'T')))
.set_index('timestamp', sorted=True))
Each line corresponds to a single thread. Some more sophisticated Dask operations will start up additional threads, this happens particularly when tasks launch other tasks, which is common especially in machine learning workloads.
My guess is that you're using one of the following approaches:
dask.distributed.get_client or dask.distributed.worker_client
Scikit-Learn's Joblib
Dask-ML
If so, the behavior that you're seeing is normal. The task stream plot will look a little odd, yes, but hopefully it is still interpretable.
Related
I am currently using the multiprocessing module to parallelize iterations as shown in this example. The thing is that this way I would be using only 1 worker and its cores, but not using all the workers available. Also I'm not able to parallelize experiments (I'm running several experiments, and several iterations for each experiment).
This code is taking too long to run, and I have the understanding that the runtime could be highly reduced using PySpark. My Spark knowledge is very little and I donĀ“t know how to translate this code in order to use it with Spark.
All the functions and classes used here are written using purely python, (numpy and pandas)
import concurrent.futures
import multiprocessing as mp
def process_simulation(experiment):
number_of_workers = mp.cpu_count()
with concurrent.futures.ProcessPoolExecutor(max_workers=number_of_workers) as executor:
results = list(executor.map(Simulation.simulation_steps(), iterations_generator()))
experiment.simulations = []
for i, v in enumerate(results):
experiment.simulations.append(results[v])
For context, Experiment and Simulation are classes (there's no inheritance). One experiment needs multiple simulations to be completed.
Thank you!
You can use Fugue to bring this type of logic to PySpark with a minimal wrapper. The only thing is you need to start with a DataFrame of inputs, and then you can do something like:
from fugue import transform
transform(input_df, Simulation.simulation_steps, schema=<your output schema here>, partition={"how": "per_row"}, engine="spark"
I can always help more get it in this shape if I have more details to what the logic is. It may just need one wrapper function. (contact info in bio).
I have a project in which joblib works well on one computer, it sends function to different cores effectively.
Now I have assignment to do same thing on a Databricks cluster. I've tried this many ways today, but the problem in the result is that the jobs do not spread out one-per-compute node. I've got 4 executors, I set n_jobs=6, but when I send 4 jobs through, some of them pile up on same node, leaving nodes unused. Here's a picture of Databricks Spark UI:
. Sometimes when I try this, I get 1 job running on a node by itself and all of the rest are piled up on one node.
In the joblib and joblibspark docs, I see the parameter batch_size which specifies how many tasks are sent to a given node. Even when I set that to 1, I get this same problem, nodes unused.
from joblib import Parallel, delayed
from joblibspark import register_spark
register_spark()
output = Parallel(backend="spark", n_jobs=6,
verbose=config.JOBLIB_VERBOSE, batch_size=1)(
delayed(fit_one)
(x, model_data=model_data, dlmodel=dlmodel,
outdir=outdir, frac=sample_p,
score_type=score_type,
save=save,
verbose=verbose) for x in ZZ)
I've hacked at this all day, trying various backends and combinations of settings. What am I missing?
I am working with a large CSV (~60GB; ~250M rows) with Dask in Jupyter.
The first thing I want to do with the DF after loading it is to concatenate two string columns. I can do so successfully, but I noticed that cell execution time does not seem to decrease with higher workers counts (I tried 5, 10, and 20 on a machine with 64 logical cores). If anything, every five or so workers seem to add an extra minute to execution time.
Meanwhile, the progress bar of Dask's dashboard suggests that the task scales well with worker count. At 5 workers the task finishes (ac. to the dashboard) in about 10-15 min. At 20 workers the stream visualisation suggests task completion in roughly 3-5 min. But cell execution time remains around 25 min, i.e. in the 5-worker case the cell will appear to be hanging for an extra 10-15 min. after the stream has finished; in the 20-worker case -- for 20-22 more min., with no evidence of worker activity as far as I can see.
This is the code that I'm running:
import dask
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=20)
client = Client(cluster)
df = dd.read_csv('df_name.csv', dtype={'col1': 'object', 'col2': 'object'})
with ProgressBar():
df["col_merged"] = df["col3"]+df["col4"]
df = df.compute()
Python version: 3.9.1
Dask version: 2021.06.2
What am I missing? Could this simply be overhead from having Dask to coordinate several workers?
To add to #SultanOrazbayev 's answer, the specific thing that's taking time after the tasks have all been done, is copying data from the workers into your client process to assemble the single in-memory dataframe that you have asked for. This is not a "task", as all the computing has already happened, and does not parallelise well, because the client is a single thread pulling data from the workers.
As with the comment above: if you want to achieve parallelism, you need to load the data in workers (which dd.read_csv does) and act on them in workers o get your result. You should on .compute() relatively small things. Conversely, if your data first comfortably into memory, there was probably nothing to be gained by having dask involved at all, just use pandas.
Running
df = df.compute()
will attempt to load all the 250M rows into memory. If this is feasible with your machine, you will still spend a lot of time because each worker is going to send their chunk, so there will be a lot of data transfer...
The core idea is to bring into memory only the results of the reduced calculations, and distribute the workload among the workers until then.
I have already called multiprocessing package and used up all the CPUs in one node. I would like to use another 10 nodes to complete my job. Thus, I need 10 * 10 task threads to calculate it. Is there some example code? I found this post "How to use multiple nodes/cores on a cluster with parellelized Python code"
. But I am still in confusion. For instance, the implemented interface,
job_task = perform_job(job_params, nodes, cups)
Any suggestion is greatly appreciated.
I noticed that a tasks of a dask graph can be executed several times by different workers.
Also I see that log in the scheduler console (Don't know if it can be related to resilience):
"WARNING - Lost connection to ... while sending result: Stream is
closed"
Is there a way to impede dask to execute the same task twice on different workers ?
Note that i'm using:
dask 0.15.0
distributed 1.15.1
Thx
Bertrand
The short answer is "no".
Dask reserves the right to call your function many times. This might occur if a worker goes down or if Dask does some load balancing and moves some tasks around the cluster while at the same time they've just started.
However you can significantly reduce the likelihood of a task running multiple times by turning off work stealing:
def turn_off_stealing(dask_scheduler):
dask_scheduler.extensions['stealing']._pc.stop()
client.run(turn_off_stealing)