Why isn't a very big Spark stage using all available executors? - apache-spark

I am running a Spark job with some very big stages (e.g. >20k tasks), and am running it with 1k to 2k executors.
In some cases, a stage will appear to run unstably: many available executors become idle over time, despite still being in the middle of a stage with many unfinished tasks. From the user perspective, it appears that tasks are finishing, but executors that have finished a given task do not get a new task assigned to them. As a result, the stage takes longer than it should, and a lot of executor CPU-hours are being wasted on idling. This seems to mostly (only?) happen during input stages, where data is being read from HDFS.
Example Spark stderr log during an unstable period -- notice that the number of running tasks decreases over time until it almost reaches zero, then suddenly jumps back up to >1k running tasks:
[Stage 0:==============================> (17979 + 1070) / 28504]
[Stage 0:==============================> (18042 + 1019) / 28504]
[Stage 0:===============================> (18140 + 921) / 28504]
[Stage 0:===============================> (18222 + 842) / 28504]
[Stage 0:===============================> (18263 + 803) / 28504]
[Stage 0:===============================> (18282 + 786) / 28504]
[Stage 0:===============================> (18320 + 751) / 28504]
[Stage 0:===============================> (18566 + 508) / 28504]
[Stage 0:================================> (18791 + 284) / 28504]
[Stage 0:================================> (18897 + 176) / 28504]
[Stage 0:================================> (18940 + 134) / 28504]
[Stage 0:================================> (18972 + 107) / 28504]
[Stage 0:=================================> (19035 + 47) / 28504]
[Stage 0:=================================> (19067 + 17) / 28504]
[Stage 0:================================> (19075 + 1070) / 28504]
[Stage 0:================================> (19107 + 1039) / 28504]
[Stage 0:================================> (19165 + 982) / 28504]
[Stage 0:=================================> (19212 + 937) / 28504]
[Stage 0:=================================> (19251 + 899) / 28504]
[Stage 0:=================================> (19355 + 831) / 28504]
[Stage 0:=================================> (19481 + 708) / 28504]
This is what the stderr looks like when a stage is running stably -- the number of running tasks remains roughly constant, because new tasks are assigned to executors as they finish their previous tasks:
[Stage 1:===================> (11599 + 2043) / 28504]
[Stage 1:===================> (11620 + 2042) / 28504]
[Stage 1:===================> (11656 + 2044) / 28504]
[Stage 1:===================> (11692 + 2045) / 28504]
[Stage 1:===================> (11714 + 2045) / 28504]
[Stage 1:===================> (11741 + 2047) / 28504]
[Stage 1:===================> (11771 + 2047) / 28504]
[Stage 1:===================> (11818 + 2047) / 28504]
Under what circumstances would this happen, and how can I avoid this behavior?
NB: I am using dynamic allocation, but I'm pretty sure this is unrelated to this problem -- e.g., during an unstable period, in the Spark Application Master UI I can see that the expected number of executors are "Active", but are not running "Active Tasks."

I've seen behavior like this from spark when the amount of time taken per task is very low. For some reason, the scheduler seems to assume that the job will complete faster without the extra distribution overhead, since each task is completing so quickly.
A couple of things to try:
Try .coalesce() to reduce the number of partitions, so that each partition takes longer to run (granted, this could cause a shuffle step and may increase overall job
time, you'll have to expiriment)
Tweak the spark.locality.wait* settings here. If each task takes less than the default wait times of 3s, then perhaps the scheduler is just trying to keep the existing slots full and never has a chance to allocate more slots.
I've yet to track down exactly what causes this issue, so these are only speculations and hunches based on my own observations in my own (much smaller) cluster.

Related

Ray spilling objects seem to be accumulating

I am using Ray in order to parallelize some computations, but it seems to be accumulating spillage..
I don't mind it spilling objects to my hard drive, but I do if it means using +130 GiB for processing about 1.6 GiB of simulations..
Bellow is a trace of what is happening:
Number of steps: 55 (9,091 simulations each)
0%
[2m[36m(raylet)[0m Spilled 3702 MiB, 12 objects, write throughput 661 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[2m[36m(raylet)[0m Spilled 5542 MiB, 17 objects, write throughput 737 MiB/s.
2%
[2m[36m(raylet)[0m Spilled 9883 MiB, 33 objects, write throughput 849 MiB/s.
5%
[2m[36m(raylet)[0m Spilled 16704 MiB, 58 objects, write throughput 997 MiB/s.
13%
[2m[36m(raylet)[0m Spilled 32903 MiB, 124 objects, write throughput 784 MiB/s.
29%
[2m[36m(raylet)[0m Spilled 66027 MiB, 268 objects, write throughput 661 MiB/s.
53%
[2m[36m(raylet)[0m Spilled 131920 MiB, 524 objects, write throughput 461 MiB/s.
60%
And here is the code I am running:
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
results = ray.get([get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
return np.vstack(results)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
results = get_res_parallel(simulations=step_simulations)
expected_result += results.mean(axis=0)
print(f"\r100%")
return expected_result / num_steps
Running on a Mac M1 with 16 GiB of RAM, Ray 2.0.0 and Python 3.9.13.
Question
Given my code, is it normal behavior?
What can I do to resolve this problem? Force garbage collection?
Do you know the expected size of the array returned by get_res_ray?
Ray will spill objects returned by remote tasks as well as objects passed to remote tasks, so in this case there are two possible places that can cause memory pressure:
The ObjectRefs returned by get_res_ray.remote
The simulations passed to get_res_ray.remote. Since these are large, Ray will automatically put these in the local object store to reduce the size of the task definition.
It may be expected to spill if the size of these objects combined is greater than 30% of the RAM on your machine (this is the default size of Ray's object store). It's not suggested to increase the size of the object store, since this can cause memory pressure on the functions instead.
But you can try to either process fewer things in each iteration and/or you can try to release ObjectRefs sooner. In particular, you should try to release the results from the previous iteration as soon as possible, so that Ray can GC the objects for you. You can do this by calling del results once you're done using them.
Here's a full suggestion that will do the same thing by feeding the array results into another task instead of getting them on the driver. This is usually a better approach because it avoids adding memory pressure on the driver and you're less likely to be accidentally pinning results in the driver's memory.
#ray.remote
def mean(*arrays):
return np.vstack(arrays).mean(axis=0)
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
# Use the * syntax in Python to unpack the ObjectRefs as function arguments.
result = mean.remote(*[get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
# We never have the result arrays stored in driver's memory.
return ray.get(result)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
expected_result += get_res_parallel(simulations=step_simulations)
print(f"\r100%")
return expected_result / num_steps

Having Two QThreadPool in an Application

I have a ShortTask and a LongTask. I dont want threads that running LongTask affect execution speed of threads that running ShortTash. My PC has 8 threads, if I create 2 threadpools each with 4 max num of threads, will those threads become isolated from each other automatically?
QThreadPool* shortPool = new QThreadPool(this);
shortPool->setMaxThreadCount(QThread::idealThreadCount() / 2);
QThreadPool* longPool = new QThreadPool(this);
longPool->setMaxThreadCount(QThread::idealThreadCount() / 2);

What does pcpu signify and why multiply by 1000?

I was reading about calculating the cpu usage of a process.
seconds = utime / Hertz
total_time = utime + stime
IF include_dead_children
total_time = total_time + cutime + cstime
ENDIF
seconds = uptime - starttime / Hertz
pcpu = (total_time * 1000 / Hertz) / seconds
print: "%CPU" pcpu / 10 "." pcpu % 10
What I don't get is, by 'seconds' the algorithm means the time computer spent doing operations other than the interested process, and before it. Since, uptime is the time our computer spent being operational and starttime means the time our [interested] process started.
Then why are we dividing the total_time by seconds [Time computer spent doing something else] to get pcpu? It doesn't make sense.
The standard meanings of the variables:
# Name Description
14 utime CPU time spent in user code, measured in jiffies
15 stime CPU time spent in kernel code, measured in jiffies
16 cutime CPU time spent in user code, including time from children
17 cstime CPU time spent in kernel code, including time from children
22 starttime Time when the process started, measured in jiffies
/proc/uptime :The uptime of the system (seconds), and the amount of time spent in idle process (seconds).
Hertz :Number of clock ticks per second
Now that you've provided what each of the variables represent, here's some comments on the pseudo-code:
seconds = utime / Hertz
The above line is pointless, as the new value of seconds is never used before it's overwritten a few lines later.
total_time = utime + stime
Total running time (user + system) of the process, in jiffies, since both utime and stime are.
IF include_dead_children
total_time = total_time + cutime + cstime
ENDIF
This should probably just say total_time = cutime + cstime, since the definitions seem to indicate that, e.g. cutime already includes utime, plus the time spent by children in user mode. So, as written, this overstates the value by including the contribution from this process twice. Or, the definition is wrong... Regardless, the total_time is still in jiffies.
seconds = uptime - starttime / Hertz
uptime is already in seconds; starttime / Hertz converts starttime from jiffies to seconds, so seconds becomes essentially "the time in seconds since this process was started".
pcpu = (total_time * 1000 / Hertz) / seconds
total_time is still in jiffies, so total_time / Hertz converts that to seconds, which is the number of CPU seconds consumed by the process. That divided by seconds would give the scaled CPU-usage percentage since process start if it were a floating point operation. Since it isn't, it's scaled by 1000 to give a resolution of 1/10%. The scaling is forced to be done early by the use of parentheses, to preserve accuracy.
print: "%CPU" pcpu / 10 "." pcpu % 10
And this undoes the scaling, by finding the dividend and the remainder when dividing pcpu by 10, and printing those values in a format that looks like a floating point value.

How many Pages do a certain number of Bytes amount to?

Given a system supports a certain page-size of X-KB (Power of 2), and I have a certain number of bytes Y-Bytes(May or May not be a multiple of X). Is there a macro that will give me a "ceil" of the number of pages that would amount to Y-Bytes ?
Thanks,
vj
Not sure if there is such a macro. But you can easily write your own using the asm/page.h PAGE_SIZE and PAGE_SHIFT definitions.
NUM_PAGES(y) ((y + PAGE_SIZE - 1) >> PAGE_SHIFT)
or
NUM_PAGES(y) ((y + PAGE_SIZE - 1) / PAGE_SIZE)

Big performance loss with NodeJS loops on Amazon EC2 server

I am running a Amazon EC2 M1 General Purpose Small (1 core x 1 unit) 64 bit instance.
I've established that the Amazon instance is on average about half as fast as the computer I'm working on when using a single-threaded nodejs app:
var time = Date.now();
var fibo = function(n) {
return n == 0 ? 0 :
n == 1 ? 1 :
fibo(n-1) + fibo(n-2);
};
console.log("fibo(40) = " + fibo(40));
time = Date.now() - time;
console.log("Took: " + time + " ms");
localhost:
fibo(40) = 102334155
Took: 1447 ms
Amazon EC2:
fibo(40) = 102334155
Took: 3148 ms
Now when in my bigger app I iterate over a big 5MB JSON object (5MB being the formatted and indented filesize, I assume this is smaller internally) using 6 nested for loops (and for the sake of argument please assume with me that this is necessary) I end up with about 9.000.000 iterations.
On localhost, this takes about 8 seconds. On Amazon EC2, this takes about 46 seconds.
I expected this to be 20 seconds at most.
What causes Amazon to lag so much more?
I know V8 is highly optimized.
Is javascript/V8/node.js perhaps using optimizations that aren't compatible with virtual machines (like the EC2 instance)?
And similarly
Any special kind of code optimizations recommended for crazy stuff like this?

Resources