I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size.
Iam reading the data in DF, writing the DF without ay transformations using partitionBy
ex:
df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.
Related
I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.
I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.
Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.
When my spark program is executing, it is creating 1000 stages. However, I have seen recommended is 200 only. I have two actions at the end to write data to S3 and after that i have unpersisted dataframes. Now, when my spark program writes the data into S3, it still runs for almost 30 mins more. Why it is so? Is it due to large number of dataframes i have persisted?
P.S -> I am running program for 5 input records only.
Probably cluster takes a longer time to append data to an existing dataset and in particular, all of Spark jobs have finished, but your command has not finished, it is because driver node is moving the output files of tasks from the job temporary directory to the final destination one-by-one, which is slow with cloud storage. Try setting the configuration mapreduce.fileoutputcommitter.algorithm.version to 2.
I have a dataframe with roughly 200-600 gb of data I am reading, manipulating, and then writing to csv using the spark shell (scala) on an elastic map reduce cluster.Spark write to CSV fails even after 8 hours
here's how I'm writing to csv:
result.persist.coalesce(20000).write.option("delimiter",",").csv("s3://bucket-name/results")
The result variable is created through a mix of columns from some other dataframes:
var result=sources.join(destinations, Seq("source_d","destination_d")).select("source_i","destination_i")
Now, I am able to read the csv data it is based on in roughly 22 minutes. In this same program, I'm also able to write another (smaller) dataframe to csv in 8 minutes. However, for this result dataframe it takes 8+ hours and still fails ... saying one of the connections was closed.
I'm also running this job on 13 x c4.8xlarge instances on ec2, with 36 cores each and 60 gb of ram, so I thought I'd have the capacity to write to csv, especially after 8 hours.
Many stages required retries or had failed tasks and I can't figure out what I'm doing wrong or why it's taking so long. I can see from the Spark UI that it never even got to the write CSV stage and was busy with persist stages, but without the persist function it was still failing after 8 hours. Any ideas? Help is greatly appreciated!
Update:
I've ran the following command to repartition the result variable into 66K partitions:
val r2 = result.repartition(66000) #confirmed with numpartitions
r2.write.option("delimiter",",").csv("s3://s3-bucket/results")
However, even after several hours, the jobs are still failing. What am I doing wrong still?
note, I'm running spark shell via spark-shell yarn --driver-memory 50G
Update 2:
I've tried running the write with a persist first:
r2.persist(StorageLevel.MEMORY_AND_DISK)
But I had many stages fail, returning a, Job aborted due to stage failure: ShuffleMapStage 10 (persist at <console>:36) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3' or saying Connection from ip-172-31-48-180.ec2.internal/172.31.48.180:7337 closed
Executors page
Spark web UI page for a node returning a shuffle error
Spark web UI page for a node returning an ec2 connection closed error
Overall Job Summary page
I can see from the Spark UI that it never even got to the write CSV
stage and was busy with persist stages, but without the persist
function it was still failing after 8 hours. Any ideas?
It is FetchFailedException i.e Failed to fetch a shuffle block
Since you are able to deal with small files, only huge data its failed...
I strongly feel that not enough partitions.
Fist thing is verify/Print source.rdd.getNumPartitions(). and destinations.rdd.getNumPartitions(). and result.rdd.getNumPartitions().
You need to repartition after the data is loaded in order to partition the data (via shuffle) to other nodes in the cluster. This will give you the parallelism that you need for faster processing with out fail
Further more, to verify the other configurations applied...
print all the config like this, adjust them to correct values as per demand.
sc.getConf.getAll
Also have a look at
SPARK-5928
Spark-TaskRunner-FetchFailedException Possible reasons : OOM or Container memory limits
repartition both source and destination before joining, with number of partitions such that each partition would be 10MB - 128MB(try to tune), there is no need to make it 20000(imho too many).
then join by those two columns and then write, without repartitioning(ie. output partitions should be same as reparitioning before join)
if you still have trouble, try to make same thing after converting to both dataframes to rdd(there are some differences between apis, and especially regarding repartitions, key-value rdds etc)
We run Spark in Standalone mode with 3 nodes on a 240GB "large" EC2 box to merge three CSV files read into DataFrames to JavaRDDs into output CSV part files on S3 using s3a.
We can see from the Spark UI, the first stages reading and merging to produce the final JavaRDD run at 100% CPU as expected, but the final stage writing out as CSV files using saveAsTextFile at package.scala:179 gets "stuck" for many hours on 2 of the 3 nodes with 2 of the 32 tasks taking hours (box is at 6% CPU, memory 86%, Network IO 15kb/s, Disk IO 0 for the entire period).
We are reading and writing uncompressed CSV (we found uncompressed was much faster than gzipped CSV) with re partition 16 on each of the three input DataFrames and not coleaseing the write.
Would appreciate any hints what we can investigate as to why the final stage takes so many hours doing very little on 2 of the 3 nodes in our standalone local cluster.
Many thanks
--- UPDATE ---
I tried writing to local disk rather than s3a and the symptoms are the same - 2 of the 32 tasks in the final stage saveAsTextFile get "stuck" for hours:
If you are writing to S3, via s3n, s3a or otherwise, do not set spark.speculation = true unless you want to run the risk of corrupted output.
What I suspect is happening is that the final stage of the process is renaming the output file, which on an object store involves copying lots (many GB?) of data. The rename takes place on the server, with the client just keeping an HTTPS connection open until it finishes. I'd estimate S3A rename time as about 6-8 Megabytes/second...would that number tie in with your results?
Write to local HDFS then, afterwards, upload the output.
gzip compression can't be split, so Spark will not assign parts of processing a file to different executors. One file: one executor.
Try and avoid CSV, it's an ugly format. Embrace: Avro, Parquet or ORC. Avro is great for other apps to stream into, the others better for downstream processing in other queries. Significantly better.
And consider compressing the files with a format such as lzo or snappy, both of which can be split.
see also slides 21-22 on: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
I have seen similar behavior. There is a bug fix in HEAD as of October 2016 that may be relevant. But for now you might enable
spark.speculation=true
in the SparkConf or in spark-defaults.conf .
Let us know if that mitigates the issue.