Spark cluster does not scale to small data

Spark cluster does not scale to small data - apache-spark

i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.
I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?
The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.
Spark submit:
./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData

The optimum cluster size and configuration varies based on the data and the nature of the job. In this case, I think that your intuition is correct, the job seems to take disproportionately longer to complete on smaller dataset, because of the excess overhead given the size of the cluster (cores and executors).
Notice that increasing the amount of data by two orders of magnitude increases the processing time only 5-fold. You are increasing the data toward an optimum size for your cluster setup.
Spark is a great tool for processing lots of data, but it isn't going to be competitive with running a single process on a single machine if the data will fit. However it can be much faster than other distributed processing tools that are disk-based, where the data does not fit on a single machine.
I was at a talk a couple years back and the speaker gave an analogy that Spark is like a locomotive racing a bicycle:- the bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it's going to be faster in the end. (I'm afraid I forget the speakers name, but it was at a Cassandra meetup in London, and the speaker was from a company in the energy sector).

I agree with #ImDarrenG's assessment and generally also the locomotive/bicycle analogy.
With such a small amount of data, I would strongly recommend
A) caching the entire dataset and
B) broadcasting the dataset to each node (especially if you need to do something like your 300M row table join to the small datasets)
Another thing to consider is the # of files (if you're not already cached), because if you're reading in a single unsplittable file, only 1 core will be able to read that file in. However once you cache the dataset (coalescing or repartitioning as appropriate), performance will no longer be bound by disk/serializing the rows.

Related

Spark Multi-Core experiment [duplicate]

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.

Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?

As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

Spark only uses 1CPU when 2x4CPU are available on reduce()

I have 3 machines: 1x Master with 4x CPU, 8G RAM ; 2x executors with 4x CPU and 16G RAM.
The master is standalone mode (no YARN), I'm using pyspark.
Even if it is not a huge infrastructure I would still expect some perf out of it.
When running a reduce operation:
tfsent = tfsent.reduce(lambda x,y: Row(tf=spvecadd(x.tf, y.tf), sentiment=spvecadd(x.sentiment, y.sentiment)))
where tfsent has tf and sentiment which are SparseVector, and spvecadd is an home made function to add SparseVector
Doing this, on the 3x 4CPU, only one, on an executor, is running 100%. Others are 0%, memory is around 5G/16G.
I don't get:
* Why is this so long
* Why is only 1x CPU working.
Should I partition my data myself? (I mean explicitely distribute the data on both executors? even if to my mind thats Spark's job).
Thanks for any help, ideas or tips
pltrdy
Additional Informations
Both executors are connected to the master and "assigned" to the task (can check it using spark web UI)
I have around 380k line. Both Vector dimension is less than 100.
(which is not a lot).
Complexity may be way more dependent of the dimension than the number of rows.
Update
It turns out that I have to use repartition(8) to make the RDD distributed. This solved my problem but not entirely my question: why do I have to do this?
I guess that it is because of how I get the data. I'm reading from database i.e.
df = (sqlContext
.read.format('jdbc')
.options(url=c.url, dbtable='(%s) tmp '%initial_query, user=c.user, password=c.password)
.load())
which, i guess store without distributing it.

Hadoop on Azure - file processing on larger number of nodes takes the same amount of time

I ran a wordcount program in python on HDInsight clusters of different size and every time it took the same amount of time. The file size is 600 MB and I ran it on 2, 4 and 8 nodes - every time the same amount of time (not to the second but very close).
I expected the time to change since the file is processed by larger number of nodes as the cluster grows in size... I am wondering if this is the case with a file which is relatively small? Or is there a way to define number of nodes on which the job should be done? - I personally don't think so since the cluster size is set in advance.
Or is it the nature of the wordcount application and the fact that the reducer does the same amount of work?
Or is because it's python - I read somewhere it is said to be slower than java (or scala on spark)?
The same thing happens on Spark clusters - although the nodes number goes up the time does not go down.

Per my experience, 600MB data size for processing on Hadoop is small. Not all time cost for processing files, because Hadoop need some time to prepare startup for M/R job & data on HDFS.
For a small dataset, it's not necessary for using too more compute nodes. Even, the performance got by a single computer would be higher than the cluster on Hadoop, such as the Hadoop sample wordcount for several small text files.
As I known, the dataset size on Hadoop need to over hundreds of GB level generally for performance advantage, and performance increase with an increase in the number of nodes.
As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that you can know.

Does Spark incur the same amount of overhead as Hadoop for vnodes?

I just read https://stackoverflow.com/a/19974621/260805. Does Spark (specifically Datastax's Cassandra Spark connector) incur the same amount of overhead as Hadoop when reading from a Cassandra cluster? I know Spark uses threads more heavily than Hadoop does.

Performance with vnodes and without in the connector should be basically the same. With hadoop each vnode split generated it's own task which created a large amount of overhead.
With Spark, tasks contain the token ranges from multiple vnodes and are merged into a single task and the overall task overhead is lower. There is a slight locality issue where it becomes difficult to get balanced numbers of tasks for all the nodes in the C* cluster with smaller data sizes. This issue is being worked on in SPARKC-43.

I'll give three separate answers. I apologize for the rather unstructured answer, but it's been building up over time:
A previous answer:
Here's one potential answer: Why not enable virtual node in an Hadoop node?. I quote:
Does this also apply to Spark?
No, if you're using the official DataStax spark-cassandra-connector. It can process multiple token ranges in a single Spark task. There is still some minor performance hit, but not as huge as with Hadoop.
A production benchmark
We ran a Spark job against a vnode-enabled Cassandra (Datastax Enterprise) datacenter with 3 nodes. The job took 9.7 hours. Running the same job on for slightly less data, using 5 non-vnode nodes, a couple of weeks back took 8.8 hours.
A controlled benchmark
To further test the overhead we ran a controlled benchmark on a Datastax Enterprise node in a single-node cluster. For both vnode enabled/disabled the node was 1) reset, 2) X number of rows were written and then 3) SELECT COUNT(*) FROM emp in Shark was executed a couple of times to get a cold vs. hot cache times. X tested were 10^{0-8}.
Assuming that Shark is not dealing with vnodes in any way, the average (quite stable) overhead for vnodes were ~28 seconds for cold Shark query executions and 17 seconds for hot executions. The latency difference did generally not vary with data size.
All the numbers for the benchmark can be found here. All scripts used to run the benchmark (see output.txt for usage) can be found here.
My only guess why there was a difference between "Cold diff" and "Hot diff" (see spreadsheet) is that it took Shark some time to create metadata, but this is simply speculation.
Conclusion
Our conclusion is that the overhead of vnodes is a constant time between 13 and 30 seconds, independent of data size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string