Scala parallel collections workload balancing strategies - multithreading

I've been toying around with the Scala parallel collections and I was wondering if there was a way to easily define what workload balancing strategy to use.
For instance, let's say we're calculating how many prime numbers we have between 1 and K = 500 000:
def isPrime(k: Int) = (2 to k/2).forall(k % _ != 0)
Array.range(1, 500*1000).par.filter(isPrime).length
If all .par is doing is dividing the data to be processed in different contiguous blocks, then there's not much advantage in parallelizing this algorithm, as the last blocks would dominate the total running time anyway.
On the other hand, running this algorithm such that each thread has an equally distributed share of work would solve the issue (by having each one of N threads start at index x € (0 .. N-1) and then work only on elements at x+kN).
I would like to avoid having to write such boilerplate code. Is there some parameter that would allow me to easily tell Scala's library how to do this?

Related

How to reduce white space in the task stream?

I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64
As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html

Faster way to count values greater than 0 in Spark DataFrame?

I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour.
After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to complete.
As you can see the median is 6ms per task (I'm running it on a smaller source dataset to test), but 1 task takes 10min. It's hardly using any CPU cycles, it actually joins data, but so, so slow.
The next slowest task runs in 14s, has 4x more records & actually spills to disk.
If you look
The join itself is a inner join between the two datasets on pos & hashValue (minhash) in accordance with minhash specification & udf to calculate the jaccard distance between match pairs.
Explode the hashtables:
modelDataset.select(
struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols))
Jaccard distance function:
override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
val xSet = x.toSparse.indices.toSet
val ySet = y.toSparse.indices.toSet
val intersectionSize = xSet.intersect(ySet).size.toDouble
val unionSize = xSet.size + ySet.size - intersectionSize
assert(unionSize > 0, "The union of two input sets must have at least 1 elements")
1 - intersectionSize / unionSize
}
Join of processed datasets :
// Do a hash join on where the exploded hash values are equal.
val joinedDataset = explodedA.join(explodedB, explodeCols)
.drop(explodeCols: _*).distinct()
// Add a new column to store the distance of the two rows.
val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)
val joinedDatasetWithDist = joinedDataset.select(col("*"),
distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)
)
// Filter the joined datasets where the distance are smaller than the threshold.
joinedDatasetWithDist.filter(col(distCol) < threshold)
I've tried combinations of caching, repartitioning and even enabling spark.speculation, all to no avail.
The data consists of shingles address text that have to be matched:
53536, Evansville, WI => 53, 35, 36, ev, va, an, ns, vi, il, ll, le, wi
will have a short distance with records where there is a typo in the city or zip.
Which gives pretty accurate results, but may be the cause of the join skew.
My question is:
What may cause this discrepancy? (One task taking very very long, even though it has less records)
How can I prevent this skew in minhash without losing accuracy?
Is there a better way to do this at scale? ( I can't Jaro-Winkler / levenshtein compare millions of records with all records in location dataset)
It might be a bit late but I will post my answer here anyways to help others out. I recently had similar issues with matching misspelled company names (All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). Someone helped me out by suggesting to take NGrams to reduce the data skew. It helped me a lot. You could also try using e.g. 3-grams or 4-grams.
I don’t know how dirty the data is, but you could try to make use of states. It reduces the number of possible matches substantially already.
What really helped me improving the accuracy of the matches is to postprocess the connected components (group of connected matches made by the MinHashLSH) by running a label propagation algorithm on each component. This also allows you to increase N (of the NGrams), therefore mitigating the problem of skewed data, setting the jaccard distance parameter in approxSimilarityJoin less tightly, and postprocess using label propagation.
Finally, I am currently looking into using skipgrams to match it. I found that in some cases it works better and reduces the data skew somewhat.

Spark short-circuiting, sorting, and lazy maps

I'm working on an optimization problem that involves minimizing an expensive map operation over a collection of objects.
The naive solution would be something like
rdd.map(expensive).min()
However, the map function returns values that guaranteed to be >= 0. So, if any single result is 0, I can take that as the answer and do not need to compute the rest of the map operations.
Is there an idiomatic way to do this using Spark?
Is there an idiomatic way to do this using Spark?
No. If you're concerned with low level optimizations like this one, then Spark is not the best option. It doesn't mean it is completely impossible.
If you can for example try something like this:
rdd.cache()
(min_value, ) = rdd.filter(lambda x: x == 0).take(1) or [rdd.min()]
rdd.unpersist()
short circuit partitions:
def min_part(xs):
min_ = None
for x in xs:
min_ = min(x, min_) if min_ is not None else x
if x == 0:
return [0]
return [min_] in min_ is not None else []
rdd.mapPartitions(min_part).min()
Both will usually execute more than required, each giving slightly different performance profile, but can skip evaluating some records. With rare zeros the first one might be better.
You can even listen to accumulator updates and use sc.cancelJobGroup once 0 is seen. Here is one example of similar approach Is there a way to stream results to driver without waiting for all partitions to complete execution?
If "expensive" is really expensive, maybe you can write the result of "expensive" to, say, SQL (Or any other storage available to all the workers).
Then in the beginning of "expensive" check the number currently stored, if it is zero return zero from "expensive" without performing the expensive part.
You can also do this localy for each worker which will save you a lot of time but won't be as "global".

How to use Spark to distribute a processing load?

All, I'll be needing to distribute some computing ( for now it is only academic ), and I was planning on using Spark to do so.
I'm now conducting some tests, and they go like this:
I have a large file with variables and sum them, line by line, and then output the result. I've made a non-Spark version as below:
def linesum(inputline):
m=0
for i in inputline:
m=m+i
return m
with open('numbers.txt', 'r') as f:
reader = csv.reader(f, delimiter=';')
testdata = [list(map(float, rec)) for rec in reader]
testdata_out=list()
print('input : ' + str(testdata))
for i in testdata:
testdata_out.append(linesum(i))
testdata=testdata_out[:]
print('output : ' + str(testdata_out))
print(len(testdata))
print('OK')
and run in a 600k line text file, then
I've made a local spark instalation, and ran the following code :
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'C:\spark\spark-2.0.1-bin-hadoop2.7'
conf = SparkConf().setAppName('file_read_sum').setMaster('local[4]')
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
def linesum(inputline):
m=0
tmpout=list()
tmpout=[]
for i in inputline:
m=m+i
return m
with open('numbers.txt', 'r') as f:
reader = csv.reader(f, delimiter=';')
testdata = [list(map(float, rec)) for rec in reader]
print('input : ' + str(testdata))
print(len(testdata))
testdata_rdd = sc.parallelize(testdata, numSlices=(len(testdata)/10000))
testdata_out = testdata_rdd.map(linesum).collect()
testdata=testdata_out[:]
print('output : ' + str(testdata_out))
print(len(testdata_out))
print('OK')
The results match, but the first ( without Spark ) is much faster than the second, I've also made a distributed Spark instalation into 4 VMs and, as expected, the result is even worse.
I do understand that there is some overhead, specially when using the VMs, the questions are :
1) - Is my reasoning sound? Is Spark an appropriate tool to distribute this kind of job? ( for now, I am only summing the lines, but the lines can be VERY large and the operations may be much much more complex ( think Genetic programming fitness evaluation here ) )
2) - Is my code appropriate for distributing calculations ?
3) - How can I improve the speed of this?
A1) No, Spark may be a great tool for other tasks, but not for GP
The core idea behind the powers the GP-approaches have opened, is in zero-indoctrination of the process. It is evolutionary, it is the process' self-developed diversity of candidates ( each population member is a candidate solution, having ( very ) different fitness ( "bestness-of-fit" ) ). So most of the processing powers are right in the principle used for increasing the potential to allow to evolve the maximum width of the evolutionary search, where genetic-operations mediate self-actualisation ( via cross-over, mutation and change of architecture ) altogether with self-reproduction. The Spark is a fit for the very opposite -- for a rigid, scripted workflow, having zero-space for any evolutionary behaviour.
The richer diversity of the population members the evolutionary generator is able to scan, the better. So let the diversity grow wider and forget about tools for rigid & repetitive RDD calculus ( where RDD is the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel". Notice the word immutable. ).
Nota Bene: using VMs for testing a ( potential ) speedup of a parallel ( well, in practice not the [PARALLEL] but the "just"-(might-be-highly)-[CONCURENT] scheduling ) processing performance is unexceptionally bad idea. Why? Consuming more overheads onto shared resources ( in case of a just container-based deployment ) plus consuming additional overheads in hypervisor-service planes, next absolutely devastated all temporal-cache-localities inside VM-abstracted vCPU/vCore(s)'s L1/L2/L3-caches, all that criss-cross-chopped by the external O/S, fighting for his few CPU-CLK-ticks on the external process-scheduler, so the idea is indeed a bad, very bad anti-pattern, that may get but some super-dogmatic advertisement from cloud-protagonists support ( hard-core, technically not-adjusted PR cliché + heavy bell$ & whistle$ ), but having negative performance gains, if rigorously validated against raw silicon execution.
A2) + A3) Distributed nature of evolutionary systems depends a lot on the nature of processing ( The Job )
Given we are here about a GP, the distributed execution may best help in generation the increased width of the evolution-accelerated diversity, not in the naive code-execution.
Very beneficial in GP is the global self-robustness of the evolutionary concept -- many uncoordinated ( async ) and very independent processing nodes are typically much more powerful ( in terms of the global TFLOPs levels achieved ) and practical reports even show, that failed nodes, even in large units ( if not small tens of percent (!!) ) still do not devastate the finally achieved quality of the last epochs of the global-search across the self-evolving populations. That is a point! You will indeed love the GP if you can harness these few core principles into the light-weight async herd of distributed-computing nodes correctly & just-enough for the HPC-ruled GP/GA-searches!
The Best Next step:
To get some pieces of the very-first-hand experience, read John R. KOZA remarks on his GP distributed-processing concepts, where the +99% of the problem is actually used ( and where a CPU-bound processing deserves a maximum possible acceleration ( surprisingly not by re-distribution, right because of not willing to loose a single item's locality ) ). I am almost sure if you are serious into GP/GA, you will both like it & benefit from his pioneering work.

Resources