Hi I am using have an rdd containing tuple of arrays, i.e. of type
RDD[(Array[Int], Array[Int])]
rdd = sc.parallelize(Array( (Array(1, 2, 3), Array(3,4, 5))
(Array(5, 6, 7), Array(4,5, 6))
....
))
and I am trying to do the following :
rdd.flatMap{ case (arr1, arr2) =>
(for(i <- arr1; j <- arr2) yield (i, j) )
}
And I noticed that as I increase the sizes of the arrays from 500 to 5000, the runtime increase from several minutes to about 10 minutes,
however, if I increase the sizes of the arrays from 5K to 6K, the runtime of this operations increase to several hours.
So I am wondering why I am getting such a big increase in runtime from 5K to 6K, while from 1k to 5k runtime increase smoothly?
I am suspecting that may be the memory limit of map task is reached, and disk operations are involved, resulting in the long runtime, but the sizes is not really big, since I allocated 14G memory and 8 cores to Spark in local mode.
Related
I have an RDD as follows:
rdd
.filter { case (_, record) => predicates.forall(_.accept(record)) }
.toDS()
.cache()
It basically filters down an RDD after applying a predicate.
The issue I have is this... Some of my data set RDDs are massive and predicates may be empty meaning that we attempt to cache an entire data set.
Instead what I'd like to do is always limit the size of the data set before I cache it.
I've tried placing a limit as follows:
dataSet
.filter { case (_, record) => predicates.forall(_.accept(record)) }
.limit(10000)
.toDS()
.cache()
but I get OOM errors. It looks to me like the partitions are being overloaded before the limit is applied.
Therefore I'm wondering if there is some way for the limit to be applied to the partitions. So effectively filtering would be paused once we reach the limit.
Scaling out further isn't an option as these data sets are too big
You should likely look into sampling the rdd. If you provide a consistent seed you will get a consistent result. You likely don't want "withReplace". This will run faster than using limit. Sample does work on the entire data but filters as it goes reducing the data set.
RDD.sample(withReplacement, fraction, seed=None)
Parameters:
withReplacement - bool can elements be sampled multiple times
(replaced when sampled out)
fraction - float expected size of the sample as a fraction of this RDD’s
size without replacement: probability that each element is chosen;
fraction must be [0, 1] with replacement: expected number of times
each element is chosen; fraction must be >= 0
seed - int, optional seed for the random number generation
Relevant code links (rdd.sample), (subclass that does actual work work.)
Documentation says this:
If maxsize is set to None, the LRU feature is disabled and the cache can grow without bound. The LRU feature performs best when maxsize is a power-of-two.
Would anyone happen to know where does this "power-of-two" come from? I am guessing it has something to do with the implementation.
Where the size effect arises
The lru_cache() code exercises its underlying dictionary in an atypical way. While maintaining total constant size, cache misses delete the oldest item and insert a new item.
The power-of-two suggestion is an artifact of how this delete-and-insert pattern interacts with the underlying dictionary implementation.
How dictionaries work
Table sizes are a power of two.
Deleted keys are replaced with dummy entries.
New keys can sometimes reuse the dummy slot, sometimes not.
Repeated delete-and-inserts with different keys will fill-up the table with dummy entries.
An O(N) resize operation runs when the table is two-thirds full.
Since the number of active entries remains constant, a resize operation doesn't actually change the table size.
The only effect of the resize is to clear the accumulated dummy entries.
Performance implications
A dict with 2**n entries has the most available space for dummy entries, so the O(n) resizes occur less often.
Also, sparse dictionaries have fewer hash collisions than mostly full dictionaries. Collisions degrade dictionary performance.
When it matters
The lru_cache() only updates the dictionary when there is a cache miss. Also, when there is a miss, the wrapped function is called. So, the effect of resizes would only matter if there are high proportion of misses and if the wrapped function is very cheap.
Far more important than giving the maxsize a power-of-two is using the largest reasonable maxsize. Bigger caches have more cache hits — that's where the big wins come from.
Simulation
Once an lru_cache() is full and the first resize has occurred, the dictionary settles into a steady state and will never get larger. Here, we simulate what happens next as new dummy entries are added and periodic resizes clear them away.
steady_state_dict_size = 2 ** 7 # always a power of two
def simulate_lru_cache(lru_maxsize, events=1_000_000):
'Count resize operations as dummy keys are added'
resize_point = steady_state_dict_size * 2 // 3
assert lru_maxsize < resize_point
dummies = 0
resizes = 0
for i in range(events):
dummies += 1
filled = lru_maxsize + dummies
if filled >= resize_point:
dummies = 0
resizes += 1
work = resizes * lru_maxsize # resizing is O(n)
work_per_event = work / events
print(lru_maxsize, '-->', resizes, work_per_event)
Here is an excerpt of the output:
for maxsize in range(42, 85):
simulate_lru_cache(maxsize)
42 --> 23255 0.97671
43 --> 23809 1.023787
44 --> 24390 1.07316
45 --> 25000 1.125
46 --> 25641 1.179486
...
80 --> 200000 16.0
81 --> 250000 20.25
82 --> 333333 27.333306
83 --> 500000 41.5
84 --> 1000000 84.0
What this shows is that the cache does significantly less work when maxsize is as far away as possible from the resize_point.
History
The effect was minimal in Python3.2, when dictionaries grew by 4 x active_entries when resizing.
The effect became catastrophic when the growth rate was lowered to 2 x active entries.
Later a compromise was reached, setting the growth rate to 3 x used. That significantly mitigated the issue by giving us a larger steady state size by default.
A power-of-two maxsize is still the optimum setting, giving the least work for a given steady state dictionary size, but it no longer matters as much as it did in Python3.2.
Hope this helps clear up your understanding. :-)
TL;DR - this is an optimization that doesn't have much effect at small lru_cache sizes, but (see Raymond's reply) has a larger effect as your lru_cache size gets bigger.
So this piqued my interest and I decided to see if this was actually true.
First I went and read the source for the LRU cache. The implementation for cpython is here: https://github.com/python/cpython/blob/master/Lib/functools.py#L723 and I didn't see anything that jumped out to me as something that would operate better based on powers of two.
So, I wrote a short python program to make LRU caches of various sizes and then exercise those caches several times. Here's the code:
from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
def run_test(i):
# We create a new decorated perform_calc
#lru_cache(maxsize=i)
def perform_calc(input):
return input * 3.1415
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
# Calculate the value for a range larger than our largest cache
for k in range(2000):
perform_calc(k)
for t in range(10):
print (t)
values = defaultdict(list)
for i in range(1,1025):
start = time.perf_counter()
run_test(i)
t = time.perf_counter() - start
values[i].append(t)
for k,v in values.items():
print(f"{k}\t{mean(v)}")
I ran this on a macbook pro under light load with python 3.7.7.
Here's the results:
https://docs.google.com/spreadsheets/d/1LqZHbpEL_l704w-PjZvjJ7nzDI1lx8k39GRdm3YGS6c/preview?usp=sharing
The random spikes are probably due to GC pauses or system interrupts.
At this point I realized that my code always generated cache misses, and never cache hits. What happens if we run the same thing, but always hit the cache?
I replaced the inner loop with:
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
# Only ever create cache hits
for k in range(i):
perform_calc(k)
The data for this is in the same spreadsheet as above, second tab.
Let's see:
Hmm, but we don't really care about most of these numbers. Also, we're not doing the same amount of work for each test, so the timing doesn't seem useful.
What if we run it for just 2^n 2^n + 1, and 2^n - 1. Since this speeds things up, we'll average it out over 100 tests, instead of just 10.
We'll also generate a large random list to run on, since that way we'll expect to have some cache hits and cache misses.
from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
import random
rands = list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128))
random.shuffle(rands)
def run_test(i):
# We create a new decorated perform_calc
#lru_cache(maxsize=i)
def perform_calc(input):
return input * 3.1415
# let's run the test 5 times (so that we exercise the caching)
for j in range(5):
for k in rands:
perform_calc(k)
for t in range(100):
print (t)
values = defaultdict(list)
# Interesting numbers, and how many random elements to generate
for i in [15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 257, 511, 512, 513, 1023, 1024, 1025]:
start = time.perf_counter()
run_test(i)
t = time.perf_counter() - start
values[i].append(t)
for k,v in values.items():
print(f"{k}\t{mean(v)}")
Data for this is in the third tab of the spreadsheet above.
Here's a graph of the average time per element / lru cache size:
Time, of course, decreases as our cache size gets larger since we don't spend as much time performing calculations. The interesting thing is that there does seem to be a dip from 15 to 16, 17 and 31 to 32, 33. Let's zoom in on the higher numbers:
Not only do we lose that pattern in the higher numbers, but we actually see that performance decreases for some of the powers of two (511 to 512, 513).
Edit: The note about power-of-two was added in 2012, but the algorithm for functools.lru_cache looks the same at that commit, so unfortunately that disproves my theory that the algorithm has changed and the docs are out of date.
Edit: Removed my hypotheses. The original author replied above - the problem with my code is that I was working with "small" caches - meaning that the O(n) resize on the dicts was not very expensive. It would be cool to experiment with very large lru_caches and lots of cache misses to see if we can get the effect to appear.
Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.
Default number of partitions : 4
scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala> rdd.map(x=>1).reduce((acc,y) => acc+1)
res17: Int = 8
scala> rdd.map(x=>1).reduce((acc,y) => acc+y)
res18: Int = 20
Because here value of y is nothing but only 1.
That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):
Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.
In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.
Such approach would work only with non-parallel, non-sequential reduce implementations.
I use Spark 2.1 in local mode and I'm running this simple application.
val N = 10 << 20
sparkSession.conf.set("spark.sql.shuffle.partitions", "5")
sparkSession.conf.set("spark.sql.autoBroadcastJoinThreshold", (N + 1).toString)
sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")
val df1 = sparkSession.range(N).selectExpr(s"id as k1")
val df2 = sparkSession.range(N / 5).selectExpr(s"id * 3 as k2")
df1.join(df2, col("k1") === col("k2")).count()
Here, the range(N) creates a dataset of Long (with unique values), so I assume that the size of
df1 = N * 8 bytes ~ 80MB
df2 = N / 5 * 8 bytes ~ 16MB
Ok now let's take df1 as an example.
df1 consists of 8 partitions and shuffledRDDs of 5, so I assume that
# of mappers (M) = 8
# of reducers (R) = 5
As the # of partitions is low, Spark will use the Hash Shuffle which will create M * R files in the disk but I haven't understood if every file has all the data, thus each_file_size = data_size resulting to M * R * data_size files or all_files = data_size.
However when executing this app, shuffle write of df1 = 160MB which doesn't match either of the above cases.
Spark UI
What am I missing here? Why has the shuffle write data doubled in size?
First of all, let's see what data size total(min, med, max) means:
According to SQLMetrics.scala#L88 and ShuffleExchange.scala#L43, the data size total(min, med, max) we see is the final value of dataSize metric of shuffle. Then, how is it updated? It get updated each time a record is serialized: UnsafeRowSerializer.scala#L66 by dataSize.add(row.getSizeInBytes) (UnsafeRow is the internal representation of records in Spark SQL).
Internally, UnsafeRow is backed by a byte[], and is copied directly to the underlying output stream during serialization, its getSizeInBytes() method just return the length of the byte[]. Therefore, the initial question is transformed to: Why the bytes representation is twice big as the only long column a record have? This UnsafeRow.scala doc gives us the answer:
Each tuple has three parts: [null bit set] [values] [variable length portion]
The bit set is used for null tracking and is aligned to 8-byte word boundaries. It stores one bit per field.
since it's 8-byte word aligned, the only 1 null bit is taking another 8 byte, the same width as the long column. Therefore, each UnsafeRow represents your one-long-column-row using 16 bytes.
I'm looking for a trick to force Spark to perform reduction operation locally between all the tasks executed by the worker cores before making it for all tasks.
Indeed, it seems my driver node and the network bandwitch are overload because of big task results (=400MB).
val arg0 = sc.broadcast(fs.read(0, 4))
val arg1 = sc.broadcast(fs.read(1, 4))
val arg2 = fs.read(5, 4)
val index = info.sc.parallelize(0.toLong to 10000-1 by 1)
val mapres = index.map{ x => function(arg0.value, arg1.value, x, arg2) }
val output = mapres.reduce(Util.bitor)
The driver distributes 1 partition by processor core so 8 partitions by worker.
There is nothing to force because reduce applies reduction locally for each partition. Only the final merge is applied on the driver. Not to mention 400MB shouldn't be a problem in any sensible configuration.
Still if you want to perform more work on the workers you can use treeReduce although with 8 partitions there is almost nothing to gain.