I have a Spark cluster in my private network, and I have a job containing only one line of code:
spark.read.parquet('/path/to/data').count()
I tried to run the job with same data on a EMR Spark cluster and my private Spark cluster, both with the same parameters:
spark-submit --driver-memory 1G --driver-cores 1 --num-executors 1 --executor-memory 8G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false dummy_job.py
On the spark monitoring web page, I saw that EMR read only 3GiB data while my private cluster read 6.1GiB data
Both Spark clusters prunes reading size a lot, but EMR reads much less data, this may indicates that our gzidc Spark cluster has a incorrect configuration related parquet I/O.
Any ideas? Thanks
Related
I am running experiments with some Spark jobs and I am trying to compare performance on EMR and on EKS. The hardware that I used was 2 instances of m5.2xlarge (8 vCore, 32 GiB memory). The reason is that it is commonly shared hardware instances between EKS and EMR and therefore more reliable to compare performance.
I also shared the spark configuration:
--conf spark.executor.instances=2 \
--conf spark.executor.cores=3 \
--conf spark.default.parallelism=16 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=4g \
--conf spark.executor.memoryOverhead=4g
(Spark 2.4.5 for EMR, Spark 3.0.0 for Kubernetes)
The spark jobs read some json files from S3 and they store parquet on S3 again.
I systematically get faster writes and reads from S3 on EMR (23% faster approximately on EMR).
Could that be because of s3-specific optimizations on EMR ? What could be possible things to do to make the performance better on Kubernetes ?
The reason for the performance improvement on EMR is due to EMRFS which comes default with EMR while it doesn't come with EKS.
I am writing a Spark structured streaming application in which data processed with Spark, needs be sink'ed to s3 bucket.
This is my development environment.
Hadoop 2.6.0-cdh5.16.1
Spark version 2.3.0.cloudera4
I want to limit the usages of VCores
As of now I have used spark2-submit to specify option as --conf spark.cores.max=4. However after submitting job I observed that job occupied maximum available VCores from cluster(my cluster has 12 VCores)
Next job is not getting started because of unavailability of VCores.
Which is the best way to limit the usages of VCores per job?
As of now I am doing some workaround as : I created Resource Pool in cluster and assigned some resource as
Min Resources : 4 Virtual Cores and 8 GB memory
Used these pool to assign a spark job to limit the usages of VCores.
e.g. spark2-submit --class org.apache.spark.SparkProgram.rt_app --master yarn --deploy-mode cluster --queue rt_pool_r1 /usr/local/abc/rt_app_2.11-1.0.jar
I want to limit the usages of VCores without any workaround.
I also tried with
spark2-shell --num-executors 1 --executor-cores 1 --jars /tmp/elasticsearch-hadoop-7.1.1.jar
and below is observation.
You can use the "--executor-cores" option it will assign the number of core to each of your executor.
can refer 1 and 2
When I click the application_id of a long running job (say 24 hours) in Spark UI, it is taking a long time to load the stages. I don't know if it's connected with my spark config or my deploy-mode client. Here's more info of my spark config:
--master yarn \
--deploy-mode client \
--driver-memory 12g \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 108 \
The UI is running on the machine of the driver. Therefore, if the machine runs out of RAM, the UI gets very slow.
Here I see that you request 12GB of RAM for the driver. This is a lot and if this is all the memory available on the machine, it makes sense that the UI gets very slow at some point. This process is only supposed to drive the computation and share it between the workers.
I guess that you are collecting a large amount of data which is generally not a good idea. (see https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html)
A better option would be to write the RDD to a file or a distributed DB.
Can one worker have multiple executors for the same Spark application in standalone and yarn mode? If no, then what is the reason for that (for both standalone and yarn mode).
Yes, you can specify resources which Spark will use. For example, you can use these properties for configuration:
--num-executors 3
--driver-memory 4g
--executor-memory 2g
--executor-cores 2
If your node has enough resources cluster assigns more than one executors to the same node.
You can read more information about Spark resources configuration here.
I am on CDH 5.7.0 and I could see a strange issue with spark 2 running on YARN cluster. Hereunder is my job submit command
spark2-submit --master yarn --deploy-mode cluster --conf "spark.executor.instances=8" --conf "spark.executor.cores=4" --conf "spark.executor.memory=8g" --conf "spark.driver.cores=4" --conf "spark.driver.memory=8g" --class com.learning.Trigger learning-1.0.jar
Even though I have limited the number of cluster resources my job can use, I could see the resource utilization is more than the allocated amount.
The job starts with basic memory consumption like 8G of memory and would eat us the whole cluster.
I do not have dynamic allocation set to true.
I am just triggering an INSERT OVERWRITE query on top of SparkSession.
Any pointers would be very helpful.
I created Resource Pool in cluster and assigned some resource as
Min Resources : 4 Virtual Cores and 8 GB memory
Used these pool to assign a spark job to limit the usages of resource (VCores and memory).
e.g. spark2-submit --class org.apache.spark.SparkProgram.rt_app --master yarn --deploy-mode cluster --queue rt_pool_r1 /usr/local/abc/rt_app_2.11-1.0.jar
If anyone has better options to archive the same please let us know.