Optimising Spark read and write performance - apache-spark

I have around 12K binary files, each of 100mb in size and contains multiple compressed records with variables lengths. I am trying to find the most efficient way to read them, uncompress and then write back in parquet format. The cluster i have has is 6 nodes with 4 cores each.
At this moment with pseudocode below, it takes around 8 hrs to read all the files and writing back to parquet is very very slow.
def reader(file_name):
keyMsgList = []
with open(file_name, "rb") as f:
while True:
header = f.read(12)
if not header:
break
keyBytes = header[0:8]
msgLenBytes = header[8:12]
# conver keyBytes & msgLenBytes to int
message = f.read(msgLen)
keyMsgList.append((key, decode(message)))
return keyMsgList
files = os.listdir("/path/to/binary/files")
rddFiles = sc.parallelize(files, 6000)
df = spark.createDataFrame(rddFiles.flatMap(reader), schema)
df.repartition(6000).write.mode("append").partitionBy("key").parquet("/directory")
The rational behind choosing 6000 here sc.parallelize(files, 6000) is creating partitions each with 200 MB in size i.e. (12k files * 100mb size) / 200MB. Being the sequential nature of file content that is needs to read each of them byte by byte, not sure if read can be further optimised?
Similarly, when writing back to parquet, the number in repartition(6000) is to make sure data is distributed uniformly and all executors can write in parallel. However, it turns out be a very slow operation.
One solution is to increase the number of executors, which will improve the read performance but not sure if it will improve writes?
Looking for any suggestion on how can I improve the performance here?

Suggestion 1: do not use repartition but coalesce.
See here. You identified the bottleneck of the repartition operatio, this is because you have launched a full shuffle. With coalesce you won't do that. You will end up with N partitions also. They won't be as balanced as those you would get with repartition but does it matter ?
I would recommend you to favor coalesce rather than repartition
Suggestion 2: 6000 partitions is maybe not optimal
Your application runs with 6 nodes with 4 cores. You have 6000 partitions. This means you have around 250 partitions by core (not even counting what is given to your master). That's, in my opinion, too much.
Since your partitions are small (around 200Mb) your master probably spend more time awaiting anwsers from executor than executing the queries.
I would recommend you to reduce the number of partitions
Suggestion 3: can you use the DataFrame API ?
DataFrame API operations are generally faster and better than a hand-coded solution.
Maybe have a look at pyspark.sql.functions to see if you can find something there (see here). I don't know if it's relevent since I have not seen your data but that's a general recommendation I do from my experience.

Related

Create 1GB partitions Spark SQL

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:
# Vaccum to leave 1 week of history
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.vacuum(168)
deltaTable.generate("symlink_format_manifest")
# Reading delta table and rewriting with coalesce to reach 1GB per file
df = spark.read.format('delta').load(f"s3a://{delta_table}")
coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB
df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.generate("symlink_format_manifest")
I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.
I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.
Any suggestions?
I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin
If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:
You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?
You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try
Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:
def define_repartition(delta_table_path):
conn = S3Connection()
bk = conn.get_bucket(bucket)
manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
size = 0
for parquet in parquets:
key = bk.lookup(parquet)
size = size + key.size
return round(size/1073741824)
Thank you all for the help.Regards from Brazil. :)

Resolving small file issue in pyspark

I am reading from a partitioned table that has close to 4 billion records.
The files that I am reading from is my source, and I have no control over it to alter the records.
While reading the files through dataframes, for each partition I am creating 2000 files of size less than 2KB. This is because of shuffle partition being set to 2000, to increase the execution speed.
Approach followed to resolve this issue:
I have looped over the HDFS path of the table, post its execution is completed as has created a list with data paths [/dv/hdfs/..../table_name/partition_value=01,/dv/hdfs/..../table_name/partition_value=02..]
For each such path, I have calculated
disk usage and block size from cluster and got the appropriate number of partitions as
no_of_partitions = ceil[disk_usage / block size], and then written the data into another location with the same partition_id such as [/dv/hdfs/..../table2_name/partition_value=01].
Now though this works in reducing the small files to avg block size of 82 MB from 2KB, it is taking about 2.5 mins per partition. With 256 such partitions being available, it is taking more than 10hrs to finish the execution.
Kindly suggest any other method where this could be achieved in less than 2 hrs of time.
Although you have 2000 shuffle partitions you can and should control the output files.
Generating small files in spark is itself a performance degradation for the next read operations.
Now to control small files issue you can do the following:
While writing the dataframe to hdfs repartition it based on the number of partitions and controlling the number of output files per partition
df.repartition(partition_col).write.option("maxRecordsPerFile", 100000).partition_by(partition_col).parquet(path)
This will generate files having 100000 records each in every partition. Hence solving your small files issue. This will improve overall read and write performance of your job.
Hope it helps.

How to control size of Parquet files in Glue?

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.
I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.
You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.
I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?
I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.
The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).
Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.
If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.
The following will result in 2 files being written out even though it's only got 1 partition:
val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

Spark write Parquet to S3 the last task takes forever

I'm writing a parquet file from DataFrame to S3.
When I look at the Spark UI, I can see all tasks but 1 completed swiftly of the writing stage (e.g. 199/200). This last task appears to take forever to complete, and very often, it fails due to exceeding executor memory limit.
I'd like to know what is happening in this last task. How to optimize it?
Thanks.
I have tried Glemmie Helles Sindholt solution and works very well.
Here is the code:
path = 's3://...'
n = 2 # number of repartitions, try 2 to test
spark_df = spark_df.repartition(n)
spark_df.write.mode("overwrite").parquet(path)
It sounds like you have a data skew. You can fix this by calling repartition on your DataFrame before writing to S3.
As others have noted, data skew is likely at play.
Besides that, I noticed that your task count is 200.
The configuration parameter spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.
200 is the default for this setting, but generally it is far from an optimal value.
For small data, 200 could be overkill and you would waste time in the overhead of multiple partitions.
For large data, 200 can result in large partitions, which should be broken down into more, smaller partitions.
The really rough rules of thumb are:
- have 2-3x number of partitions to cpu's.
- Or ~128MB.
2GB's is the max partition size. If you are hovering just below 2000 partitions, Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000[1]
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
You can try playing with this parameter at runtime:
spark.conf.set("spark.sql.shuffle.partitions", "300")
[1]What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
This article - The Bleeding Edge: Spark, Parquet and S3 has a lot of useful information about Spark, S3 and Parquet. In particular, it talks about how the driver ends up writing out the _common_metadata_ files and can take quite a bit of time. There is a way to turn it off.
Unfortunately, they say that they go on to generate the common metadata themselves, but don't really talk about how they did so.

Resources