AWS Glue Spark job does not scale when partitioning DataFrame - apache-spark

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.
The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?
Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.
Code:
...
df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
"user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()
df.coalesce(17)
df.write.mode("overwrite").format("csv").option(
"compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])
...

This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.
Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.
To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

Do multiple spark sessions which query on the same partition in Hadoop table make the query slower?

I have multiple Spark Sessions on different driver node query on the same HDFS file, let's say Table T1. Below is the T1 structures.
Partition by date
Then partition by first 2 digit of the user id
User id field
Other related information of the user.
partition date
partition first 2 digit
User id
other info
17
01
01234
...
18
01
01234
...
18
02
02345
...
Now I tried two scenarios,
Create 8 multiple spark sessions on different driver node (using cluster mode) and concurrently query on the exact same partition and user id, every session acquire all the resources as limited in the code, for instance each session takes
1 driver / 1 CPU / 4 GB
4 executors / 1 CPU per executor / 4 GB per executor
And below is the sample query.
SELECT * FROM T1
WHERE user_id = '01234' AND partition_date IN ('17', '18') AND partition_first_2_digit IN ('01', '02')
Create only 1 spark session and query using the same above string. This session also acquire all the resources as limited in the code which is equal to multiple sessions case.
1 driver / 1 CPU / 4 GB
4 executors / 1 CPU per executor / 4 GB per executor
And the results surprise me, the query time per application on the multiple spark session case is 10 minutes which is much higher than one spark session at 5 minutes!
I am curious about why the multiple spark sessions make the query much slower? Does anyone have the same problems?
Thanks in advance everyone!
In theory, in both cases the performance should be same but there are other factors that may be influencing the scheduling of the jobs. From my experience, you could be experiencing any of these constraints.
The job pool in which you are firing your spark jobs has less resources than required by 8 instances of your job. Lets say each instance/run require 4 executors so while you are running 8 instances of the job that means you are requesting 32 executors in total. If your job queue is configured to support 20 executors ( executors could be limited by either number of cores or memory allocated to queue or both). your 8 jobs are not able to achieve maximum parallelism and hence appear to be slow as compared to one single instance running at the same time which gets all the resources it needs and is able to achieve maximum parallelism
your cluster is running full force and utilising most of it resources for other prod jobs due to which your 8 instances of the same job are waiting for resources at the same time.
if you ran a single test for 8 instances, there could be a possibility that there was a write job happening on the parent directory or hive table which could have caused your jobs to enter a wait state.
These are some of the problems which i have seen in my career, there could be other factors as well. i would suggest get access to yarn schedular or queues from your sysadmin and see if one of these is the culprit.

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

Spark 12 GB data load with Window function Performance issue

I am using sparksql to transform 12 GB data.My Transformation is to apply row number function with partition by on one of fields then divide data into two sets first set where row number is 1 and 2nd set include rest of data then write data to target location in 30 partitions.
My job is currently taking approximately 1 hour.I want to run it in less than 10 mins.
I am running this job on 3 Node cluster with specs(16 Cores & 32 GB RAM).
Node 1 yarn master node.
Node 2 Two Executors 1 driver and 1 other
Node 3 Two executors both for processing.
Each executor is assigned 5 cores and 10GB memory.
Is my hardware enough or i need more powerful hardware?
Is executors configuration right?
If both hardware and configuration is good then definitely i need to improve my code.
My code is as follow.
sqlContext=SQLContext(sc)
SalesDf = sqlContext.read.options(header='true').load(path, format='csv')
SalesDf.cache()
SalesDf_Version=SalesDf.withColumn('row_number',F.row_number().over(Window.partitionBy("id").orderBy(desc("recorddate"))))
initialLoad = SalesDf_Version.withColumn('partition',SalesDf_Version.year).withColumn('isActive', when(col('row_number') == 1, lit('Y')).when(col('row_number') != 1, lit('N')))
initialLoad = initialLoad.withColumn('version_flag',col ('isActive')).withColumn('partition',col('city'))
initialLoad = initialLoad.drop('row_number')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
sc.stop()
Thanks in advance for your help
You have a coalesce(1) before writing, what is the reason for that? Coalesce reduces the parallelization of that stage which in your case will cause the windowing query to run on 1 core, so you're losing the benefit of the 16 cores per node.
Remove the coalesce and that should start improving things.
Following were the changes that we implemented to improve performance of our code.
We removed coalesce and used repartition(50).We tried higher and lower numbers in the brackets but 50 was the optimized number in our case.
We were using s3 as our target but it was costing us alot because of rename thing in spark so we used HDFS instead and our job time was reduced to half of what it was before.
Overall by above changes our code ran 12 mins previously it was 50 mins.
Thanks
Ammar

Processing time for my Spark program does not decrease on increasing the number of nodes in the cluster

I have a Cloudera cluster with 3 nodes on which Apache Spark is installed. I am running a Spark program which reads data from HBase tables, transforms the data and stores it in a different HBase table. With 3 nodes the time taken in approximately 1 minutes 10 seconds for 5 million rows HBase data. On decreasing or increasing the number of nodes, the time taken came similar whereas it was expected to reduce after increasing the number of nodes and increase by increasing the number of nodes.Below was the time taken:
1) With 3 nodes: Approximately 1 minute 10 seconds for 5 million rows.
2) With 1 node: Approximately 1 minute 10 seconds for 5 million rows.
3) With 6 nodes: Approximately 1 minute 10 seconds for 5 million rows.
What can be the reason for same time taken despite increasing or decreasing the number of nodes?
Thank You.
By default, Hbase will probably read the 5 million rows from a single region or maybe 2 regions (degree of parallelism). The write will occur to a single region or maybe 2 based on the scale of the data.
Is Spark your bottleneck? If you allocate variable resources (more/less cores or memory) it will only lead to change in overall times of the job if the computation on the job is the bottleneck.
If your computation (the transform) is relatively simple, the bottleneck might be reading from HBase or writing from HBase. In that case irrespective of how many node/cores you may give it. The run time will be constant.
From the runtimes you have mentioned it seems that's the issue.
The bottleneck may be one or both hbase and spark side. You can check the hbase side for your tables number of region servers. It is same meaning with the read and write parallelism of data. The more the better usually. You must notice the hotspotting issue
The spark side parallelism can be checked with your number of rdd for your data. May be you should repartition your data. Added to this,cluster resource utilization may be your problem. For checking this you can monitor spark master web interface. Number of nodes, number of workers per node, and number of job, task per worker etc. Also you must check number of cpu and amont of ram usage per worker within this interface.
For details here

Resources