Spark Performance EMR(2.4.5) vs EKS(3.0.0) - apache-spark

I am running experiments with some Spark jobs and I am trying to compare performance on EMR and on EKS. The hardware that I used was 2 instances of m5.2xlarge (8 vCore, 32 GiB memory). The reason is that it is commonly shared hardware instances between EKS and EMR and therefore more reliable to compare performance.
I also shared the spark configuration:
--conf spark.executor.instances=2 \
--conf spark.executor.cores=3 \
--conf spark.default.parallelism=16 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=4g \
--conf spark.executor.memoryOverhead=4g
(Spark 2.4.5 for EMR, Spark 3.0.0 for Kubernetes)
The spark jobs read some json files from S3 and they store parquet on S3 again.
I systematically get faster writes and reads from S3 on EMR (23% faster approximately on EMR).
Could that be because of s3-specific optimizations on EMR ? What could be possible things to do to make the performance better on Kubernetes ?

The reason for the performance improvement on EMR is due to EMRFS which comes default with EMR while it doesn't come with EKS.

Related

Monitoring EMR using Prometheus

My question is regarding monitoring Amazon EMR using prometheus and grafana while deploying in cluster mode.
While running a job in standalone mode on EMR, metrics are pushed on endpoint for master, driver and executors, however, in cluster mode none of the metrics are there.
I also tried to export on other types of metric sinks like csv and console which work perfectly fine in standalone but nothing happens on cluster mode.
I am using PrometheusServlet for exporting metrics and same endpoints as explained in here.
spark-submit --files metrics.properties --conf spark.metrics.conf=metrics.properties --conf spark.ui.prometheus.enabled=true --deploy-mode cluster --master yarn testCluster.py

Spark reads double sized data

I have a Spark cluster in my private network, and I have a job containing only one line of code:
spark.read.parquet('/path/to/data').count()
I tried to run the job with same data on a EMR Spark cluster and my private Spark cluster, both with the same parameters:
spark-submit --driver-memory 1G --driver-cores 1 --num-executors 1 --executor-memory 8G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false dummy_job.py
On the spark monitoring web page, I saw that EMR read only 3GiB data while my private cluster read 6.1GiB data
Both Spark clusters prunes reading size a lot, but EMR reads much less data, this may indicates that our gzidc Spark cluster has a incorrect configuration related parquet I/O.
Any ideas? Thanks

Spark UI of certain application_id is slow when job is running for quite some time

When I click the application_id of a long running job (say 24 hours) in Spark UI, it is taking a long time to load the stages. I don't know if it's connected with my spark config or my deploy-mode client. Here's more info of my spark config:
--master yarn \
--deploy-mode client \
--driver-memory 12g \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 108 \
The UI is running on the machine of the driver. Therefore, if the machine runs out of RAM, the UI gets very slow.
Here I see that you request 12GB of RAM for the driver. This is a lot and if this is all the memory available on the machine, it makes sense that the UI gets very slow at some point. This process is only supposed to drive the computation and share it between the workers.
I guess that you are collecting a large amount of data which is generally not a good idea. (see https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html)
A better option would be to write the RDD to a file or a distributed DB.

Multiple executors for spark applcaition

Can one worker have multiple executors for the same Spark application in standalone and yarn mode? If no, then what is the reason for that (for both standalone and yarn mode).
Yes, you can specify resources which Spark will use. For example, you can use these properties for configuration:
--num-executors 3
--driver-memory 4g
--executor-memory 2g
--executor-cores 2
If your node has enough resources cluster assigns more than one executors to the same node.
You can read more information about Spark resources configuration here.

Spark 2 on YARN is utilizing more cluster resource automatically

I am on CDH 5.7.0 and I could see a strange issue with spark 2 running on YARN cluster. Hereunder is my job submit command
spark2-submit --master yarn --deploy-mode cluster --conf "spark.executor.instances=8" --conf "spark.executor.cores=4" --conf "spark.executor.memory=8g" --conf "spark.driver.cores=4" --conf "spark.driver.memory=8g" --class com.learning.Trigger learning-1.0.jar
Even though I have limited the number of cluster resources my job can use, I could see the resource utilization is more than the allocated amount.
The job starts with basic memory consumption like 8G of memory and would eat us the whole cluster.
I do not have dynamic allocation set to true.
I am just triggering an INSERT OVERWRITE query on top of SparkSession.
Any pointers would be very helpful.
I created Resource Pool in cluster and assigned some resource as
Min Resources : 4 Virtual Cores and 8 GB memory
Used these pool to assign a spark job to limit the usages of resource (VCores and memory).
e.g. spark2-submit --class org.apache.spark.SparkProgram.rt_app --master yarn --deploy-mode cluster --queue rt_pool_r1 /usr/local/abc/rt_app_2.11-1.0.jar
If anyone has better options to archive the same please let us know.

Resources