Spark Performance on AWS EMR - apache-spark

I am new to spark scala and this is my first question on StackOverflow. I have spark scala ETL pipeline running on AWS EMR with 6 m5.4xlarge instances. After researching and reading through articles, I came up with this spark configuration in spark-submit as follow:
executor-memory - 18G
num-executors - 17
executor-cores - 5
The ETL pipeline takes about 10-20 mins in local environment (deploy-mode client) for 20 MB of data. However, with the above configuration, the pipeline takes around 1 and half hour. I am not sure if the instance launched on emr is not enough or where the problem is. Please help me with this issue. Thank you...
I have tweaked around the instance type and configuration to 6 m6g.4xlarge with 20 executors, 8 executor-cores and 32G executor-memory, but the performance does not seem to improve significantly. the pipeline still takes 1 hour and 10 mins.

Related

AWS Glue Spark job does not scale when partitioning DataFrame

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.
The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?
Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.
Code:
...
df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
"user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()
df.coalesce(17)
df.write.mode("overwrite").format("csv").option(
"compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])
...
This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.
Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.
To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

Why spark application are not running on all nodes

I installed the following spark benchmark:
https://github.com/BBVA/spark-benchmarks
I run Spark on top of YARN on 8 workers but I only get 2 running executors during the job (TestDFSIO).
I also set executor-cores to be 9 but only 2 are running.
Why would that happen?
I think the problem is coming from YARN because I get a similar (almost) issue with TestDFSIO on Hadoop. In fact, at the beginning of the job, only two nodes run, but then all the nodes execute the application in parallel!
Note that I am not using HDFS for storage!
I solved this issue. What I've done is that I set the number of cores per executor to 5 (--executor-cores) and the total number of executors to 23 (--num-executors) which was at first 2 by default.

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

Spark + Elastic search write performance issue

Seeing low # of writes to elasticsearch using spark java.
Here are the Configurations
using 13.xlarge machines for ES cluster
4 instances each have 4 processors.
Set refresh interval to -1 and replications to '0' and other basic
configurations required for better writing.
Spark :
2 node EMR cluster with
2 Core instances
- 8 vCPU, 16 GiB memory, EBS only storage
- EBS Storage:1000 GiB
1 Master node
- 1 vCPU, 3.8 GiB memory, 410 SSD GB storage
ES index has 16 shards defined in mapping.
having below config when running job,
executor-memory - 8g
spark.executor.instances=2
spark.executor.cores=4
and using
es.batch.size.bytes - 6MB
es.batch.size.entries - 10000
es.batch.write.refresh - false
with this configuration, I try to load 1Million documents (each document has a size of 1300 Bytes) , so it does the load at 500 records/docs per ES nodes.
and in the spark log am seeing each task
-1116 bytes result sent to driver
Spark Code
JavaRDD<String> javaRDD = jsc.textFile("<S3 Path>");
JavaEsSpark.saveJsonToEs(javaRDD,"<Index name>");
Also when I look at the In-Network graph in ES cluster it is very low, and I see EMR is not sending huge data over a network. Is there a way I can tell Spark to send a right number of data to make write faster?
OR
Is there any other config that I am missing to tweak.
Cause I see 500docs per sec per es instance is lower. Can someone please guide what am missing with this settings to improve my es write performance
Thanks in advance
You may have an issue here.
spark.executor.instances=2
You are limited to two executors, where you could have 4 based on your cluster configuration. I would change this to 4 or greater. I might also try executor-memory = 1500M, cores=1, instances=16. I like to leave a little overhead in my memory, which is why I dropped from 2G to 1.5G(but you can't do 1.5G so we have to do 1500M). If you are connecting via your executors this will improve performance.
Would need some code to debug further. I wonder if you are connected to elastic search only in your driver, and not in your worker nodes. Meaning you are only getting one connection instead of one for each executor.

Tuning Spark (YARN) cluster for reading 200GB of CSV files (pyspark) via HDFS

I'm working with an 11-node cluster running on AWS right now (1 master, 10 workers - c3.4xlarge) and I'm attempting to read in ~200GB of .csv files (only about 10 actual .csv files) from HDFS.
This process is going really slowly. I'm watching spark at the command line and it looks like this..
[Stage 0:> (30 + 2) / 2044]
With an increase of +2 units (meaning 30+2 going to 32+2 to 34+2, etc..) of progress maybe every 20 seconds. So this is in serious need of improvement, or else we'll be here for about 11 hours before the files are even done reading.
This is the code to this point..
# AMAZON AWS EMR
def sparkconfig():
conf = SparkConf()
conf.setMaster("yarn-client) #client gets output to terminals
conf.set("spark.default.parallelism",340)
conf.setAppName("my app")
conf.set("spark.executor.memory", "20g")
return conf
sc = SparkContext(conf=sparkconfig(),
pyFiles=['/home/hadoop/temp_files/redis.zip'])
path = 'hdfs:///tmp/files/'
all_tx = sc.textFile(my_path).coalesce(1024)
... more code for processing
Now clearly 1024 for partitions may not be correct, that was just from googling and trying different things. I'm really at a loss when it comes to tuning this job.
The worker nodes # AWS are c3.4xlarge instances (I have 10 in the cluster) consist of 30GB of RAM with 16 vCPU's each. The HDFS partition is made up of the local storage for each node in the cluster, which is 2x160GB SSD's, so I believe we're looking at (2*160GB * 10nodes / 3 replication) = ~1TB of HDFS.
The .csv files themselves range from 5GB to 90GB in size.
To clarify in case it is relevant, the Hadoop cluster is the same as the spark cluster as far as the nodes go. I'm allocating 20GB of the 30GB per node to the spark executor, leaving 10GB per node for OS + Hadoop/YARN, etc.. The name-node / spark parent node is a m3.xlarge, which has 4 vcpu's and 16GB of RAM.
Does anyone have suggestions on tuning options (or anything, really) that I might try to speed up this file-read process?
Shameless Plug (Author) try Sparklens https://github.com/qubole/sparklens
Most of the time the real question is not if the application is slow, but will it scale. And for most of the applications, answer is upto a limit.
The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.

Resources