High shuffle write and high execution time - apache-spark

I'm using EMR and I'm developing a collaborative-filtering approach with ALS. I have three doubt:
In order to see the execution time in Spark UI, I do several experiments. I've noticed that with one master and 4 workers the execution time is less than the EMR cluster with one master and six workers. Anyone know why?
The other thing is shuffle write. With one master and six workers, I have 3,2GB. Is it too high, isn't it? In the code I use RDD, groupbykey and I do two join. How can I minimize it?
With one master and six workers, the execution time is 7,5 min. Considering I'm using the Movielens Dataset, with Machine Learning approach I can't understand if this execution time is too high or it's quite good.
I attach a picture with the result form the Spark UI.
Thank you in advance

Related

Apache Spark: is it possible to get dataset counts in a spark job?

Sometimes some Spark job which runs in our cluster runs too long not because of bad optimization, but because of bad logic of the algorithm. In most cases this is a consequence of some unnecessary joins that produce too many rows. Normally we spot such jobs by looking at Spark execution plan where we can find such joins by looking at "number of output rows: xxx" in blue stage labels.
I want to understand - is it possible to optimize this procedure and somehow automatically notify the programmer that the job has too many rows in some dataset (after execution)?
Maybe we can print this in logs (without manually counting dataset's size in code)?
Maybe after running the job we can get the output of the execution plan somehow and save it for further investigations?
No, it's not an option. Spark will do its best to optimize the query plan, so manual interaction with lower execution level is pretty much limited. However, you can "control" the rows for each jobs/tasks by changing some configurations (like spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes), or by repartitioning data, which will cause data to be shuffled and re-distributed nearly equally between executors.

How do I process non-real time data in batches in Spark?

I am new Big data and Spark. I have to work on real-time data and old data from the past 2 years. There are around a million rows for each day. I am using PySpark and Databricks. Data is partitioned on created date. I have to perform some transformations and load it to a database.
For real-time data, I will be using spark streaming (readStream to read, perform transformation and then writeStream).
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches? If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
It is kind of open ended but let me try to address your concerns.
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches?
Since you are new to Spark, do it in batches and start by running 1 day at a time, then 1 week and so one. Get your program to run successfully and optimize. As you increase the batch size you can increase your cluster size using Pyspark Dataframes (not Pandas). If your job is verified and efficient, you can run monthly, bi-monthly or larger batches (smaller jobs are better in your case).
If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
You can use the date range as parameters to your Databricks job and use data bricks to schedule your jobs to ran back to back. Sure you can run them in parallel on different clusters but the whole idea with Spark is to use Sparks distributed capability and run your job on as many worker nodes as your job requires. Again, get one small job to work and validate your results, then validate a larger set and so on. If you feel confident, start a large cluster (many and fat workers) and run a large date range.
It is not an easy task for a newbie but should be a lot of fun. Best wishes.

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

Spark concurrency performance issue Vs Presto

We are benchmarking spark with alluxio and presto with alluxio. For evaluating the performance we took 5 different queries (with some joins, group by and sort) and ran this on a dataset 650GB in orc.
Spark execution environment is setup in such a way that we have a ever running spark context and we are submitting queries using REST api (Jetty server). We are not considering first batch execution time for this load test as its taking little more time because of task deserialization and all.
What we observed while evaluating is that when we ran individual queries or even all these 5 queries executed concurrently, spark is performing very well compared to presto and is finishing all the execution in half the time than of presto.
But for actual load test, we executed 10 batches (one batch is this 5 queries submitted at the same time) with a batch interval of 60 sec. At this point presto is performing a lot better than spark. Presto finished all job in ~11 mins and spark is taking ~20 mins to complete all the task.
We tried different configurations to improve spark concurrency like
Using 20 pools with equal resource allocation and submitting jobs in a round robin fashion.
Tried using one FAIR pool and submitted all jobs to this default pool and let spark decide on resource allocations
Tuning some spark properties like spark.locality.wait and some other memory related spark properties.
All tasks are NODE_LOCAL (we replicated data in alluxio to make this happen)
Also tried playing arround with executor memory allocation, like tried with 35 small executors (5 cores, 30G) and also tried with (60core, 200G) executors.
But all are resulting in same execution time.
We used dstat on all the workers to see what was happening when spark was executing task and we could see no or minimal IO or network activity . And CPU was alway at 95%+ (Looks like its bounded on CPU) . (Saw almost similar dstat out with presto)
Can someone suggest me something which we can try to achieve similar or better results than presto?
And any explanation why presto is performing well with concurrency than spark ? We observed that presto's 1st batch is taking more time than the succeeding batches . Is presto cacheing some data in memory which spark is missing ? Or presto's resource management/ execution plan is better than spark ?
Note: Both clusters are running with same hardware configuration

Spark Performance issue while adding more worker nodes

I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.
I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.
var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()
This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.
Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?
This should not happen, but it is possible for an algorithm to run slower when distributed.
Basically, if the synchronization part is a heavy one, doing that with 2 nodes will take more time then with one.
I would start by comparing some simpler transformations, running a more asynchronous code, as without any sync points (such as group by key), and see if you get the same issue.
#z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.

Resources