Spark write Parquet to S3 the last task takes forever - apache-spark

I'm writing a parquet file from DataFrame to S3.
When I look at the Spark UI, I can see all tasks but 1 completed swiftly of the writing stage (e.g. 199/200). This last task appears to take forever to complete, and very often, it fails due to exceeding executor memory limit.
I'd like to know what is happening in this last task. How to optimize it?
Thanks.

I have tried Glemmie Helles Sindholt solution and works very well.
Here is the code:
path = 's3://...'
n = 2 # number of repartitions, try 2 to test
spark_df = spark_df.repartition(n)
spark_df.write.mode("overwrite").parquet(path)

It sounds like you have a data skew. You can fix this by calling repartition on your DataFrame before writing to S3.

As others have noted, data skew is likely at play.
Besides that, I noticed that your task count is 200.
The configuration parameter spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.
200 is the default for this setting, but generally it is far from an optimal value.
For small data, 200 could be overkill and you would waste time in the overhead of multiple partitions.
For large data, 200 can result in large partitions, which should be broken down into more, smaller partitions.
The really rough rules of thumb are:
- have 2-3x number of partitions to cpu's.
- Or ~128MB.
2GB's is the max partition size. If you are hovering just below 2000 partitions, Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000[1]
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
You can try playing with this parameter at runtime:
spark.conf.set("spark.sql.shuffle.partitions", "300")
[1]What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

This article - The Bleeding Edge: Spark, Parquet and S3 has a lot of useful information about Spark, S3 and Parquet. In particular, it talks about how the driver ends up writing out the _common_metadata_ files and can take quite a bit of time. There is a way to turn it off.
Unfortunately, they say that they go on to generate the common metadata themselves, but don't really talk about how they did so.

Related

Optimising Spark read and write performance

I have around 12K binary files, each of 100mb in size and contains multiple compressed records with variables lengths. I am trying to find the most efficient way to read them, uncompress and then write back in parquet format. The cluster i have has is 6 nodes with 4 cores each.
At this moment with pseudocode below, it takes around 8 hrs to read all the files and writing back to parquet is very very slow.
def reader(file_name):
keyMsgList = []
with open(file_name, "rb") as f:
while True:
header = f.read(12)
if not header:
break
keyBytes = header[0:8]
msgLenBytes = header[8:12]
# conver keyBytes & msgLenBytes to int
message = f.read(msgLen)
keyMsgList.append((key, decode(message)))
return keyMsgList
files = os.listdir("/path/to/binary/files")
rddFiles = sc.parallelize(files, 6000)
df = spark.createDataFrame(rddFiles.flatMap(reader), schema)
df.repartition(6000).write.mode("append").partitionBy("key").parquet("/directory")
The rational behind choosing 6000 here sc.parallelize(files, 6000) is creating partitions each with 200 MB in size i.e. (12k files * 100mb size) / 200MB. Being the sequential nature of file content that is needs to read each of them byte by byte, not sure if read can be further optimised?
Similarly, when writing back to parquet, the number in repartition(6000) is to make sure data is distributed uniformly and all executors can write in parallel. However, it turns out be a very slow operation.
One solution is to increase the number of executors, which will improve the read performance but not sure if it will improve writes?
Looking for any suggestion on how can I improve the performance here?
Suggestion 1: do not use repartition but coalesce.
See here. You identified the bottleneck of the repartition operatio, this is because you have launched a full shuffle. With coalesce you won't do that. You will end up with N partitions also. They won't be as balanced as those you would get with repartition but does it matter ?
I would recommend you to favor coalesce rather than repartition
Suggestion 2: 6000 partitions is maybe not optimal
Your application runs with 6 nodes with 4 cores. You have 6000 partitions. This means you have around 250 partitions by core (not even counting what is given to your master). That's, in my opinion, too much.
Since your partitions are small (around 200Mb) your master probably spend more time awaiting anwsers from executor than executing the queries.
I would recommend you to reduce the number of partitions
Suggestion 3: can you use the DataFrame API ?
DataFrame API operations are generally faster and better than a hand-coded solution.
Maybe have a look at pyspark.sql.functions to see if you can find something there (see here). I don't know if it's relevent since I have not seen your data but that's a general recommendation I do from my experience.

How to control size of Parquet files in Glue?

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.
I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.
You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.
I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

Spark Streaming: avoid small files in HDFS

I have a Spark Streaming application that writes its output to HDFS.
What precautions and strategies can I take to ensure that not too many small files are generated by this process and create a memory pressure in the HDFS Namenode.
Does Apache Spark provides any pre-built solutions to avoid small files in HDFS.
No. Spark do not provide any such solution.
What you can do:
Increase batch interval - this will not guarantee anything - but still there is higher chance. Though the tradeoff here is that streaming will have bigger latency.
Manually manage it. For example - on each batch you could calculate size of the RDD and accumulate RDDs unless they satisfy your size requirement. Then you just union RDDs and write to disk. This will unpredictably increase latency, but will guarantee efficient space usage.
Another solution is also to get another Spark application that reaggregates the small files every hour/day/week,etc.
I know this question is old, but may be useful for someone in the future.
Another option is to use coalesce with a smaller number of partitions. coalesce merges partitions together and creates larger partitions. This can increase the processing time of the streaming batch because of the reduction in number of partitions during the write, but it will help in reducing the number of files.
This will reduce the parallelism, hence having too few partitions can cause issues to the Streaming job. You will have to test with different values of partitions for coalesce to find which value works best in your case.
You can reduce the number of part files .
By default spark generates output in 200 part files . You can decrease the number of part files .

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.
I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.
If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.
Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
I really wish they would let you configure this independently.
By the way, I found this information in a Cloudera slide deck.
OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.
I raised this question a while ago and have still yet to get a good answer :(
Spark: increase number of partitions without causing a shuffle?
It's actually depends on your data and your query, if Spark must load 1Tb, there is something wrong on your design.
Use the superbe web UI to see the DAG, mean how Spark is translating your SQL query to jobs/stages and tasks.
Useful metrics are "Input" and "Shuffle".
Partition your data (Hive / directory layout like /year=X/month=X)
Use spark CLUSTER BY feature, to work per data partition
Use ORC / Parquet file format because they provide "Push-down filter", useless data is not loaded to Spark
Analyze Spark History to see how Spark is reading data
Also, OOM could happen on your driver?
-> this is another issue, the driver will collect at the end the data you want. If you ask too much data, the driver will OOM, try limiting your query, or write another table (Spark syntax CREATE TABLE ...AS).
I came across this post from Cloudera about Hive Partitioning. Check out the "Pointers" section talking about number of partitions and number of files in each partition resulting in overloading the name node, which might cause OOM.

Resources