Parallel Python with joblibspark: how to evenly distribute jobs? - python-3.x

I have a project in which joblib works well on one computer, it sends function to different cores effectively.
Now I have assignment to do same thing on a Databricks cluster. I've tried this many ways today, but the problem in the result is that the jobs do not spread out one-per-compute node. I've got 4 executors, I set n_jobs=6, but when I send 4 jobs through, some of them pile up on same node, leaving nodes unused. Here's a picture of Databricks Spark UI:
. Sometimes when I try this, I get 1 job running on a node by itself and all of the rest are piled up on one node.
In the joblib and joblibspark docs, I see the parameter batch_size which specifies how many tasks are sent to a given node. Even when I set that to 1, I get this same problem, nodes unused.
from joblib import Parallel, delayed
from joblibspark import register_spark
register_spark()
output = Parallel(backend="spark", n_jobs=6,
verbose=config.JOBLIB_VERBOSE, batch_size=1)(
delayed(fit_one)
(x, model_data=model_data, dlmodel=dlmodel,
outdir=outdir, frac=sample_p,
score_type=score_type,
save=save,
verbose=verbose) for x in ZZ)
I've hacked at this all day, trying various backends and combinations of settings. What am I missing?

Related

Dask DF operation takes a long time after 100% progress in Dask dashboard

I am working with a large CSV (~60GB; ~250M rows) with Dask in Jupyter.
The first thing I want to do with the DF after loading it is to concatenate two string columns. I can do so successfully, but I noticed that cell execution time does not seem to decrease with higher workers counts (I tried 5, 10, and 20 on a machine with 64 logical cores). If anything, every five or so workers seem to add an extra minute to execution time.
Meanwhile, the progress bar of Dask's dashboard suggests that the task scales well with worker count. At 5 workers the task finishes (ac. to the dashboard) in about 10-15 min. At 20 workers the stream visualisation suggests task completion in roughly 3-5 min. But cell execution time remains around 25 min, i.e. in the 5-worker case the cell will appear to be hanging for an extra 10-15 min. after the stream has finished; in the 20-worker case -- for 20-22 more min., with no evidence of worker activity as far as I can see.
This is the code that I'm running:
import dask
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=20)
client = Client(cluster)
df = dd.read_csv('df_name.csv', dtype={'col1': 'object', 'col2': 'object'})
with ProgressBar():
df["col_merged"] = df["col3"]+df["col4"]
df = df.compute()
Python version: 3.9.1
Dask version: 2021.06.2
What am I missing? Could this simply be overhead from having Dask to coordinate several workers?
To add to #SultanOrazbayev 's answer, the specific thing that's taking time after the tasks have all been done, is copying data from the workers into your client process to assemble the single in-memory dataframe that you have asked for. This is not a "task", as all the computing has already happened, and does not parallelise well, because the client is a single thread pulling data from the workers.
As with the comment above: if you want to achieve parallelism, you need to load the data in workers (which dd.read_csv does) and act on them in workers o get your result. You should on .compute() relatively small things. Conversely, if your data first comfortably into memory, there was probably nothing to be gained by having dask involved at all, just use pandas.
Running
df = df.compute()
will attempt to load all the 250M rows into memory. If this is feasible with your machine, you will still spend a lot of time because each worker is going to send their chunk, so there will be a lot of data transfer...
The core idea is to bring into memory only the results of the reduced calculations, and distribute the workload among the workers until then.

Understanding Dask's Task Stream

I'm running dask locally using the distributed scheduler on my machine with 8 cores. On initialization I see:
Which looks correct, but I'm confused by the task stream in the diagnostics (shown below):
I was expecting 8 rows corresponding to the 8 workers/cores, is that incorrect?
Thanks
AJ
I've added the code I'm running:
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client()
progress(client)
# load datasets
trd = (dd.read_csv('trade_201811*.csv', compression='gzip',
blocksize=None, dtype={'Notional': 'float64'})
.assign(timestamp=lambda x: dd.to_datetime(x.timestamp.str.replace('D', 'T')))
.set_index('timestamp', sorted=True))
Each line corresponds to a single thread. Some more sophisticated Dask operations will start up additional threads, this happens particularly when tasks launch other tasks, which is common especially in machine learning workloads.
My guess is that you're using one of the following approaches:
dask.distributed.get_client or dask.distributed.worker_client
Scikit-Learn's Joblib
Dask-ML
If so, the behavior that you're seeing is normal. The task stream plot will look a little odd, yes, but hopefully it is still interpretable.

SLURM job taking up entire node when using just one GPU

I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each?
My script looks like this:
#SLURM --gres=gpu:1
#SLURM --ntasks-per-node 1
#SLURM -p ghp-queue
myprog.exe
I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf, like this:
PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE
After that, I was able to run four jobs with --gres=gpu:1, and each one took a different GPU (a fifth job is queued, as expected).

dask processes tasks twice

I noticed that a tasks of a dask graph can be executed several times by different workers.
Also I see that log in the scheduler console (Don't know if it can be related to resilience):
"WARNING - Lost connection to ... while sending result: Stream is
closed"
Is there a way to impede dask to execute the same task twice on different workers ?
Note that i'm using:
dask 0.15.0
distributed 1.15.1
Thx
Bertrand
The short answer is "no".
Dask reserves the right to call your function many times. This might occur if a worker goes down or if Dask does some load balancing and moves some tasks around the cluster while at the same time they've just started.
However you can significantly reduce the likelihood of a task running multiple times by turning off work stealing:
def turn_off_stealing(dask_scheduler):
dask_scheduler.extensions['stealing']._pc.stop()
client.run(turn_off_stealing)

Spark cores & tasks concurrency

I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.

Resources