I have written a spark job which does below operations
Reads data from HDFS text files.
Do a distinct() call to filter duplicates.
Do a mapToPair phase and generate pairRDD
Do a reducebykey call
do the aggregation logic for grouped tuple.
now call a foreach on #5
here it does
make a call to cassandra db
create an aws SNS and SQS client connection
do some json record formatting.
publish the record to SNS/SQS
when I run this job it creates three spark stages
first stage - it takes nearly 45 sec . performs a distinct
second stage - mapToPair and reducebykey = takes 1.5 mins
third stage = takes 19 mins
what I did
I turned off cassandra call so see DB hit cause - this is taking less time
Offending part I found is to create SNS/SQS connection foreach partition
its taking more than 60% of entire job time
I am creating SNS/SQS Connection within foreachPartition to improve less connections. do we have even better way
I Cannot create connection object on the driver as these are not serializable
I am not using number of executor 9 , executore core 15 , driver memory 2g, executor memory 5g
I am using 16 core 64 gig memory
cluster size 1 master 9 slave all same configuration
EMR deployment spark 1.6

It sounds like you would want to set up exactly one SNS/SQS connection per node and then use it to process all of your data on each node.
I think foreachPartition is the right idea here, but you might want to coalesce your RDD beforehand. This will collapse partitions on the same node without shuffling, and will allow you to avoid starting extra SNS/SQS connections.
Spark SQL slow execution with resource idle

I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated.
Increased spark.executor.memory but no luck.
Env: Azure HDInsight Spark 2.4 on Azure Storage
SQL: Read and Join some data and finally write result to a Hive metastore.
The spark.sql script ends with below code:
Application Behavior:
Within the first 15 mins, it loads and complete most tasks (199/200); left only 1 executor process alive and continually to shuffle read / write data. Because now it only leave 1 executor, we need to wait 3 hours until this application finish.
Left only 1 executor alive
Not sure what's the executor doing:
From time to time, we can tell the shuffle read increased:
Therefore I increased the spark.executor.memory to 20g, but nothing changed. From Ambari and YARN I can tell the cluster has many resources left.
Release of almost all executor
Any guidance is greatly appreciated.
I would like to start with some observations for your case:
From the tasks list you can see that that Shuffle Spill (Disk) and Shuffle Spill (Memory) have both very high values. The max block size for each partition during the exchange of data should not exceed 2GB therefore you should be aware to keep the size of shuffled data as low as possible. As rule of thumb you need to remember that the size of each partition should be ~200-500MB. For instance if the total data is 100GB you need at least 250-500 partitions to keep the partition size within the mentioned limits.
The co-existence of two previous it also means that the executor memory was not sufficient and Spark was forced to spill data to the disk.
The duration of the tasks is too high. A normal task should lasts between 50-200ms.
Too many killed executors is another sign which shows that you are facing OOM problems.
Locality is RACK_LOCAL which is considered one of the lowest values you can achieve within a cluster. Briefly, that means that the task is being executed in a different node than the data is stored.
As solution I would try the next few things:
Increase the number of partitions by using repartition() or via Spark settings with spark.sql.shuffle.partitions to a number that meets the requirements above i.e 1000 or more.
Change the way you store the data and introduce partitioned data i.e day/month/year using partitionBy

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

Spark concurrency performance issue Vs Presto

We are benchmarking spark with alluxio and presto with alluxio. For evaluating the performance we took 5 different queries (with some joins, group by and sort) and ran this on a dataset 650GB in orc.
Spark execution environment is setup in such a way that we have a ever running spark context and we are submitting queries using REST api (Jetty server). We are not considering first batch execution time for this load test as its taking little more time because of task deserialization and all.
What we observed while evaluating is that when we ran individual queries or even all these 5 queries executed concurrently, spark is performing very well compared to presto and is finishing all the execution in half the time than of presto.
But for actual load test, we executed 10 batches (one batch is this 5 queries submitted at the same time) with a batch interval of 60 sec. At this point presto is performing a lot better than spark. Presto finished all job in ~11 mins and spark is taking ~20 mins to complete all the task.
We tried different configurations to improve spark concurrency like
Using 20 pools with equal resource allocation and submitting jobs in a round robin fashion.
Tried using one FAIR pool and submitted all jobs to this default pool and let spark decide on resource allocations
Tuning some spark properties like spark.locality.wait and some other memory related spark properties.
All tasks are NODE_LOCAL (we replicated data in alluxio to make this happen)
Also tried playing arround with executor memory allocation, like tried with 35 small executors (5 cores, 30G) and also tried with (60core, 200G) executors.
But all are resulting in same execution time.
We used dstat on all the workers to see what was happening when spark was executing task and we could see no or minimal IO or network activity . And CPU was alway at 95%+ (Looks like its bounded on CPU) . (Saw almost similar dstat out with presto)
Can someone suggest me something which we can try to achieve similar or better results than presto?
And any explanation why presto is performing well with concurrency than spark ? We observed that presto's 1st batch is taking more time than the succeeding batches . Is presto cacheing some data in memory which spark is missing ? Or presto's resource management/ execution plan is better than spark ?
Note: Both clusters are running with same hardware configuration

Spark write to CSV fails even after 8 hours

I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/ closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
Also have a look at
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
val ssc=new StreamingContext()
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.
