Spark SaveAsTable takes lot of time - apache-spark

I am trying to perform few join on different hive table using Spark and trying to save the final table into hive as well.
The problem is SaveAsTable stage takes almost 12 minutes. Table has 16 million row.
There are two executors and total 64 tasks are created. The problem is all the task processes around 17 MB however the last task processes 250 MB of data.
I tried to do repartition to 264, however it creates the new stage after above stage. which is weird.

Related

Creating Dynamic Partitions in Hive is taking too much time

I have a parquet hive table which has date and hour as the partitioning columns. My spark job runs at an interval of 3 hours. Everytime it runs, it creates dynamic partitions. The job gets completed fast but the creation of partitions takes a lot of time. Is there any way by which we can fasten this process ?

Why is Drill fastest with one partition?

My cluster has 6 nodes, each with 2 cores. I have a Spark job saving a Parquet file of the size of ~150MB to HDFS. If I repartition my dataframe to 6 partitions before saving, Drill queries are actually 30-40% slower than when I repartition it to 1 partition. Why is that? Is it expected? Can it indicate an issue with my setup?
Update
Results of the same SQL query in seconds (3 runs per number of partitions)
1 partition: 1.238, 1.29, 1.404
2 partitions: 1.286 1.175 1.259
3 partitions: 1.699 1.8 1.7
6 partitions: 2.223, 1.96, 1.772
12 partitions: 1.311, 1.335, 1.339
24 partitions: 1.261 1.302 1.235
48 partitions: 1.664 1.757 2.133
As you can see 1, 2, 12 and 24 partitions are fast. 3, 6 and 48 partitions are very clearly slower. What could be causing that?
When saving the parquet file in spark using a single partition, you would save the file locally to the partition on a single node. Once that happens, replication needs to kick in and distribute the file over the different nodes.
When saving the partquet file in spark using multiple partitions, spark would save the file distributed already, however, maybe not exactly the way HDFS needs it. Still replication and re-distribution needs to kick-in, but now in a much complexer situation.
Then depending on your spark process, you might have your data already sorted in a different way (1 vs multiple partitions), potentially making it more suitable for the next process in line (Drill).
I really cannot pintpoint a reason, but with such a small difference in time (you are talking seconds), I am not sure if the difference is even distinctive enough.
Then also, we might need to put the test methods in doubt. Java Garbage collection, Background Processes running (including the replicaton process), etc etc.
One suggestion I would have is to leave your HDFS cluster at rest for a while to make sure replication and other processes quiet down before you start with the Drill process.

Spark write to CSV fails even after 8 hours

I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
Update:
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
r2.persist(StorageLevel.MEMORY_AND_DISK)
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
sc.getConf.getAll
Also have a look at
SPARK-5928
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)

Data Processing in Parallel using Apache Spark with Pyspark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Resources