Spark + Elastic search write performance issue - apache-spark

Seeing low # of writes to elasticsearch using spark java.
Here are the Configurations
using 13.xlarge machines for ES cluster
4 instances each have 4 processors.
Set refresh interval to -1 and replications to '0' and other basic
configurations required for better writing.
Spark :
2 node EMR cluster with
2 Core instances
- 8 vCPU, 16 GiB memory, EBS only storage
- EBS Storage:1000 GiB
1 Master node
- 1 vCPU, 3.8 GiB memory, 410 SSD GB storage
ES index has 16 shards defined in mapping.
having below config when running job,
executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4
and using
es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false
with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.
and in the spark log am seeing each task
-1116 bytes result sent to driver
Spark Code
JavaRDD<String> javaRDD = jsc.textFile("<S3 Path>");
JavaEsSpark.saveJsonToEs(javaRDD,"<Index name>");
Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?
OR
Is there any other config that I am missing to tweak.
Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance
Thanks in advance

You may have an issue here.
spark.executor.instances=2
You are limited to two executors, where you could have 4 based on your cluster configuration. I would change this to 4 or greater. I might also try executor-memory = 1500M, cores=1, instances=16. I like to leave a little overhead in my memory, which is why I dropped from 2G to 1.5G(but you can't do 1.5G so we have to do 1500M). If you are connecting via your executors this will improve performance.
Would need some code to debug further. I wonder if you are connected to elastic search only in your driver, and not in your worker nodes. Meaning you are only getting one connection instead of one for each executor.

Related

Can we have same Spark Config for all jobs/applications?

I am trying to understand the Spark Config, I see that the number of executors , executor cores and executor memory is being calculated based on the cluster. Eg:
Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node
Recommned Config is 29 executors, 18GB memory each and 5 cores each!!
However, would this config be the same of all the jobs/applications that run on the cluster ? What if more than 1 job/app is running at the same time what would happen ? Also, would this config be the same irrespective of the data that I am processing whether it be 1GB or 100GB or would the config change based on the data aswell, if so how to calculated ?
Reference for recommend config- https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
The default configuration in spark will apply to all the jobs, this you can set in the spark-defaults.conf.
In case of yarn: Jobs are automatically put in the queue if enough resources are not available
You can set the number of executor cores and other configuration during spark submit to override the defaults. You can also look at dynamic allocation to avoid doing this yourself, this is not guaranteed to work as efficiently as setting the configuration yourself.

Experiencing very Low CPU utilization of Spark Jobs on AWS EMR

we have a spark job that reads a csv file and applies a series of transformations, and writes the result in an orc file,
the spark job breaks into close to 20 stages and runs for around an hour
input csv file size: 10 GB
spark-submit job resource configuration:
driver-memory= 5 GB
num-executors= 2
executor-core= 3
executor-memory= 20 GB
EC2 instance type: r5d.xlarge i.e. 32GB Memory and 4 vCPU with attached 128 GB EBS volume
EMR Cluster comprises of 1 Master Node and 2 Core machines
when we run the spark job on the above cluster configurations, the cpu utilization is only close to 10-15%
our requirement is to maximize the cpu utilization of EC2 instances for my spark job.
Appreciate for any suggestion!
AFAIK if you increase the parllelism automatically CPU usage will increase try using these in your spark job configuration
num-executors= 4
executor-core= 5
executor-memory= 25 GB
specially if you increase cpu cores parllelism will increase..
more than 5 cores not recommended for each executor. This is based on a study where any application with more than 5 concurrent threads would start hampering the performance.
spark.dynamicAllocation.enabled could be another option.
spark.default.parallelism = 2 * number of CPUs in total on worker nodes
Make sure that you always use yarn mode
Follow Using maximizeResourceAllocation from aws docs there all these things are discussed in detail. Read it completely
You can configure your executors to utilize the maximum resources possible on each node in a cluster by using the spark configuration classification to set maximizeResourceAllocation option to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets the corresponding spark-defaults settings based on this information.
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
Further reading
Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
EMR-spark-tuning-demystified

Ideal Spark configuration

am using Apache spark on HDFS with MapR in our project. We are facing issue running spark Jobs, as its failing after a small increase in the data. We are reading data from csv file, doing some trasnformation, aggreation and then storing in HBase.
Current Data size = 3TB
Available resources:
Total nodes : 14
Memory Available : 1TB
Total VCores : 450
Total Disk : 150 TB
Spark Conf:
executorCores : 2
executorInstance : 50
executorMemory: 40GB
minPartitions: 600
please suggest, if the above configuration looks fine, because the error am getting looks like it going outOfMemory.
Can you say a bit about how the jobs are failing? Without a bit more information, it will be very hard to say. It would help if you were to say which version of Spark and whether you are running under Yarn or with a standalone Spark cluster (or even on Kubernetes)
Even without any information, however, it seems likely that there is a configuration issue here. What may be happening is that Spark is being told contradictory things about how much memory is available so that when it tries to use memory it thinks it is allowed to use, the system says no.

Spark jobs seem to only be using a small amount of resources

Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.

EMR Cluster utilization

I have a 20 mode c4.4xlarge cluster to run a spark job. Each node is a 16 vCore, 30 GiB memory, EBS only storage EBS Storage:32 GiB machine.
Since each node has 16 vCore, I understand that maximum number of executors are 16*20 > 320 executors. Total memory available is 20(#nodes)*30 ~ 600GB. Assigning 1/3rd to system operations, I have 400 GB of Memory to process my data in-memory. Is this the right understanding.
Also, Spark History shows non-uniform distribution of input and shuffle. I believe the processing is not distributed evenly across executors. I pass these config parameters in my spark-submit -
> —-conf spark.dynamicAllocation.enabled=true  —-conf spark.dynamicAllocation.minExecutors=20
Executor summary from spark history UI also shows that data distribution load is completely skewed, and I am not using the cluster in the best way. How can I distribute my load in a better way -

Resources