How to use Spark to distribute a processing load? - apache-spark

All, I'll be needing to distribute some computing ( for now it is only academic ), and I was planning on using Spark to do so.
I'm now conducting some tests, and they go like this:
I have a large file with variables and sum them, line by line, and then output the result. I've made a non-Spark version as below:
def linesum(inputline):
m=0
for i in inputline:
m=m+i
return m
with open('numbers.txt', 'r') as f:
reader = csv.reader(f, delimiter=';')
testdata = [list(map(float, rec)) for rec in reader]
testdata_out=list()
print('input : ' + str(testdata))
for i in testdata:
testdata_out.append(linesum(i))
testdata=testdata_out[:]
print('output : ' + str(testdata_out))
print(len(testdata))
print('OK')
and run in a 600k line text file, then
I've made a local spark instalation, and ran the following code :
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'C:\spark\spark-2.0.1-bin-hadoop2.7'
conf = SparkConf().setAppName('file_read_sum').setMaster('local[4]')
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
def linesum(inputline):
m=0
tmpout=list()
tmpout=[]
for i in inputline:
m=m+i
return m
with open('numbers.txt', 'r') as f:
reader = csv.reader(f, delimiter=';')
testdata = [list(map(float, rec)) for rec in reader]
print('input : ' + str(testdata))
print(len(testdata))
testdata_rdd = sc.parallelize(testdata, numSlices=(len(testdata)/10000))
testdata_out = testdata_rdd.map(linesum).collect()
testdata=testdata_out[:]
print('output : ' + str(testdata_out))
print(len(testdata_out))
print('OK')
The results match, but the first ( without Spark ) is much faster than the second, I've also made a distributed Spark instalation into 4 VMs and, as expected, the result is even worse.
I do understand that there is some overhead, specially when using the VMs, the questions are :
1) - Is my reasoning sound? Is Spark an appropriate tool to distribute this kind of job? ( for now, I am only summing the lines, but the lines can be VERY large and the operations may be much much more complex ( think Genetic programming fitness evaluation here ) )
2) - Is my code appropriate for distributing calculations ?
3) - How can I improve the speed of this?

A1) No, Spark may be a great tool for other tasks, but not for GP
The core idea behind the powers the GP-approaches have opened, is in zero-indoctrination of the process. It is evolutionary, it is the process' self-developed diversity of candidates ( each population member is a candidate solution, having ( very ) different fitness ( "bestness-of-fit" ) ). So most of the processing powers are right in the principle used for increasing the potential to allow to evolve the maximum width of the evolutionary search, where genetic-operations mediate self-actualisation ( via cross-over, mutation and change of architecture ) altogether with self-reproduction. The Spark is a fit for the very opposite -- for a rigid, scripted workflow, having zero-space for any evolutionary behaviour.
The richer diversity of the population members the evolutionary generator is able to scan, the better. So let the diversity grow wider and forget about tools for rigid & repetitive RDD calculus ( where RDD is the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel". Notice the word immutable. ).
Nota Bene: using VMs for testing a ( potential ) speedup of a parallel ( well, in practice not the [PARALLEL] but the "just"-(might-be-highly)-[CONCURENT] scheduling ) processing performance is unexceptionally bad idea. Why? Consuming more overheads onto shared resources ( in case of a just container-based deployment ) plus consuming additional overheads in hypervisor-service planes, next absolutely devastated all temporal-cache-localities inside VM-abstracted vCPU/vCore(s)'s L1/L2/L3-caches, all that criss-cross-chopped by the external O/S, fighting for his few CPU-CLK-ticks on the external process-scheduler, so the idea is indeed a bad, very bad anti-pattern, that may get but some super-dogmatic advertisement from cloud-protagonists support ( hard-core, technically not-adjusted PR cliché + heavy bell$ & whistle$ ), but having negative performance gains, if rigorously validated against raw silicon execution.
A2) + A3) Distributed nature of evolutionary systems depends a lot on the nature of processing ( The Job )
Given we are here about a GP, the distributed execution may best help in generation the increased width of the evolution-accelerated diversity, not in the naive code-execution.
Very beneficial in GP is the global self-robustness of the evolutionary concept -- many uncoordinated ( async ) and very independent processing nodes are typically much more powerful ( in terms of the global TFLOPs levels achieved ) and practical reports even show, that failed nodes, even in large units ( if not small tens of percent (!!) ) still do not devastate the finally achieved quality of the last epochs of the global-search across the self-evolving populations. That is a point! You will indeed love the GP if you can harness these few core principles into the light-weight async herd of distributed-computing nodes correctly & just-enough for the HPC-ruled GP/GA-searches!
The Best Next step:
To get some pieces of the very-first-hand experience, read John R. KOZA remarks on his GP distributed-processing concepts, where the +99% of the problem is actually used ( and where a CPU-bound processing deserves a maximum possible acceleration ( surprisingly not by re-distribution, right because of not willing to loose a single item's locality ) ). I am almost sure if you are serious into GP/GA, you will both like it & benefit from his pioneering work.

Related

how to benchmark pypspark queries?

I have got a simple pyspark script and I would like to benchmark each section.
# section 1: prepare data
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")
# section 2: Dataframe API
avg_earnings = df.agg({"earnings": "avg"}).show()
# section 3: SQL
avg_earnings = spark.sql("""SELECT AVG(earnings)
FROM MyData""").show()
Do generate reliable measurements one would need to run each section multiple times. My solution using the python time module looks like this.
import time
for _ in range(iterations):
t1 = time.time()
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")
t2 = time.time()
avg_earnings = df.agg({"earnings": "avg"}).show()
t3 = time.time()
avg_earnings = spark.sql("""SELECT AVG(earnings)
FROM MyData""").show()
t4 = time.time()
write_to_csv(t1, t2, t3, t4)
My Question is how would one benchmark each section ? Would you use the time-module as well ? How would one disable caching for pyspark ?
Edit:
Plotting the first 5 iterations of the benchmark shows that pyspark is doing some form of caching.
How can I disable this behaviour ?
First, you can't benchmark using show, it only calculates and returns the top 20 rows.
Second, in general, PySpark API and Spark SQL share the same Catalyst Optimizer behind the scene, so overall what you are doing (using .agg vs avg()) is pretty much similar and don't have much difference.
Third, usually, benchmarking is only meaningful if your data is really big, or your operation is much longer than expected. Other than that, if the runtime difference is only a couple of minutes, it doesn't really matter.
Anyway, to answer your question:
Yes, there is nothing wrong to use time.time() to measure.
You should use count() instead of show(). count would go forward and compute your entire dataset.
You don't have to worry about cache if you don't call it. Spark won't cache unless you ask for it. In fact, you shouldn't cache at all when benchmarking.
You should also use static allocation instead of dynamic allocation. Or if you're using Databricks or EMR, use a fixed amount of workers and don't auto-scale it.

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour.
After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to complete.
As you can see the median is 6ms per task (I'm running it on a smaller source dataset to test), but 1 task takes 10min. It's hardly using any CPU cycles, it actually joins data, but so, so slow.
The next slowest task runs in 14s, has 4x more records & actually spills to disk.
If you look
The join itself is a inner join between the two datasets on pos & hashValue (minhash) in accordance with minhash specification & udf to calculate the jaccard distance between match pairs.
Explode the hashtables:
modelDataset.select(
struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols))
Jaccard distance function:
override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
val xSet = x.toSparse.indices.toSet
val ySet = y.toSparse.indices.toSet
val intersectionSize = xSet.intersect(ySet).size.toDouble
val unionSize = xSet.size + ySet.size - intersectionSize
assert(unionSize > 0, "The union of two input sets must have at least 1 elements")
1 - intersectionSize / unionSize
}
Join of processed datasets :
// Do a hash join on where the exploded hash values are equal.
val joinedDataset = explodedA.join(explodedB, explodeCols)
.drop(explodeCols: _*).distinct()
// Add a new column to store the distance of the two rows.
val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)
val joinedDatasetWithDist = joinedDataset.select(col("*"),
distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)
)
// Filter the joined datasets where the distance are smaller than the threshold.
joinedDatasetWithDist.filter(col(distCol) < threshold)
I've tried combinations of caching, repartitioning and even enabling spark.speculation, all to no avail.
The data consists of shingles address text that have to be matched:
53536, Evansville, WI => 53, 35, 36, ev, va, an, ns, vi, il, ll, le, wi
will have a short distance with records where there is a typo in the city or zip.
Which gives pretty accurate results, but may be the cause of the join skew.
My question is:
What may cause this discrepancy? (One task taking very very long, even though it has less records)
How can I prevent this skew in minhash without losing accuracy?
Is there a better way to do this at scale? ( I can't Jaro-Winkler / levenshtein compare millions of records with all records in location dataset)
It might be a bit late but I will post my answer here anyways to help others out. I recently had similar issues with matching misspelled company names (All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). Someone helped me out by suggesting to take NGrams to reduce the data skew. It helped me a lot. You could also try using e.g. 3-grams or 4-grams.
I don’t know how dirty the data is, but you could try to make use of states. It reduces the number of possible matches substantially already.
What really helped me improving the accuracy of the matches is to postprocess the connected components (group of connected matches made by the MinHashLSH) by running a label propagation algorithm on each component. This also allows you to increase N (of the NGrams), therefore mitigating the problem of skewed data, setting the jaccard distance parameter in approxSimilarityJoin less tightly, and postprocess using label propagation.
Finally, I am currently looking into using skipgrams to match it. I found that in some cases it works better and reduces the data skew somewhat.

sparkR gapply SLOW compared with SQL

I have a data set of ~8 GB with ~10 million rows (about 10 columns) and wanted to prove the point that SparkR could outperform SQL. To the contrary, I see extremely poor performance from SparkR compared with SQL.
My code simply loads the file from S3 the runs gapply, where my groupings will typically consist of 1-15 rows -- so 10 million rows divided by 15 gives a lot of groups. Am I forcing too much shuffling, serialization/deserialization? Is that why things run so slowly?
For purposes of illustrating that my build_transition function is not the performance bottleneck, I created a trivial version called build_transition2 as shown below, which returns dummy information with what should be constant execution time per group.
Anything fundamental or obvious with my solution formulation?
build_transition2 <- function(key, x) {
patient_id <- integer()
seq_val <- integer()
patient_id <- append(patient_id, as.integer(1234))
seq_val <- append(seq_val, as.integer(5678))
y <- data.frame(patient_id,
seq_val,
stringsAsFactors = FALSE
)
}
dat_spark <- read.df("s3n://my-awss3/data/myfile.csv", "csv", header = "true", inferSchema = "true", na.strings = "NA")
schema <- structType(structField("patient_ID","integer"),
structField("sequence","integer")
)
result <- gapply(dat_spark, "patient_encrypted_id", build_transition2, schema)
and wanted to prove the point that SparkR could outperform SQL.
That's just not the case. The overhead of indirection caused by the guest language:
Internal Catalyst format
External Java type
Sending data to R
....
Sending data back to JVM
Converting to Catalyst format
is huge.
On to of that, gapply is basically an example of group-by-key - something that we normally avoid in Spark.
Overall gapply should be used if, and only if, business logic cannot be expressed using standard SQL functions. It is definitely not a way to optimize your code under normal circumstances (there might border cases where it might be faster, but in general any special logic, if required, will benefit more from native JVM execution with Scala UDF, UDAF, Aggregator, or reduceGroups / mapGroups).

Scala parallel collections workload balancing strategies

I've been toying around with the Scala parallel collections and I was wondering if there was a way to easily define what workload balancing strategy to use.
For instance, let's say we're calculating how many prime numbers we have between 1 and K = 500 000:
def isPrime(k: Int) = (2 to k/2).forall(k % _ != 0)
Array.range(1, 500*1000).par.filter(isPrime).length
If all .par is doing is dividing the data to be processed in different contiguous blocks, then there's not much advantage in parallelizing this algorithm, as the last blocks would dominate the total running time anyway.
On the other hand, running this algorithm such that each thread has an equally distributed share of work would solve the issue (by having each one of N threads start at index x € (0 .. N-1) and then work only on elements at x+kN).
I would like to avoid having to write such boilerplate code. Is there some parameter that would allow me to easily tell Scala's library how to do this?

For Scikit-Learn's RandomForestRegressor, can I specify a different n_jobs for predictions?

Scikit-Learn's RandomForestRegressor has an n_jobs instance attribute, that, from the documentation:
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict. If
-1, then the number of jobs is set to the number of cores.
Training the Random Forest model with more than one core is obviously more performant than on a single core. But I have noticed that predictions are a lot slower (approximately 10 times slower) - this is probably because I am using .predict() on an observation-by-observation basis.
Therefore, I would like to train the random forest model on, say, 4 cores, but run the prediction on a single core. (The model is pickled and used in a separate process.)
Is it possible to configure the RandomForestRegressor() in this way?
Oh sure you can, I use a similar strategy for stored-models.
Just set <_aRFRegressorModel_>.n_jobs = 1 upon pickle.load()-ed, before using a .predict() method.
Nota bene:
the amount of work on .predict()-task is pretty "lightweight" if compared to .fit(), so in doubts, what are is core-motivation for tweaking this. Memory could be the issue, once large-scale forests may get a need to get scanned in n_jobs-"many" replicas ( which due to joblib nature re-instate all the python process-state into that many full-scale replicas ... and the new, overhead-strict Amdahl's Law re-fomulation shows one, what a bad idea that was -- to pay a way more than finally earned ( performancewise ) ). This is not an issue for .fit(), where concurrent processes can well adjust the setup overheads ( in my models ~ 4:00:00+ hrs runtime per process ), but right due to this cost/benefit "imbalance", it could be a killer-factor for "lightweight"-.predict(), where not much work is to be done, so masking the process setup/termination costs cannot be done ( and you pay way more than get ).
BTW, do you pickle.dump() object(s) from the top-level namespace? I got issues if not and the stored object(s) did not reconstruct correctly. ( Spent ages on this issue )

Resources