Create 1GB partitions Spark SQL - apache-spark

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:
# Vaccum to leave 1 week of history
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.vacuum(168)
deltaTable.generate("symlink_format_manifest")
# Reading delta table and rewriting with coalesce to reach 1GB per file
df = spark.read.format('delta').load(f"s3a://{delta_table}")
coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB
df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.generate("symlink_format_manifest")
I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.
I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.
Any suggestions?

I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin
If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:
You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?
You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try

Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:
def define_repartition(delta_table_path):
conn = S3Connection()
bk = conn.get_bucket(bucket)
manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
size = 0
for parquet in parquets:
key = bk.lookup(parquet)
size = size + key.size
return round(size/1073741824)
Thank you all for the help.Regards from Brazil. :)

Related

How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files.
This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?
You could consider writing your result with the parameter maxRecordsPerFile.
storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
"maxRecordsPerFile",
estimated_records_with_desired_size) \
.parquet(storage_location, compression="snappy")

Optimising Spark read and write performance

I have around 12K binary files, each of 100mb in size and contains multiple compressed records with variables lengths. I am trying to find the most efficient way to read them, uncompress and then write back in parquet format. The cluster i have has is 6 nodes with 4 cores each.
At this moment with pseudocode below, it takes around 8 hrs to read all the files and writing back to parquet is very very slow.
def reader(file_name):
keyMsgList = []
with open(file_name, "rb") as f:
while True:
header = f.read(12)
if not header:
break
keyBytes = header[0:8]
msgLenBytes = header[8:12]
# conver keyBytes & msgLenBytes to int
message = f.read(msgLen)
keyMsgList.append((key, decode(message)))
return keyMsgList
files = os.listdir("/path/to/binary/files")
rddFiles = sc.parallelize(files, 6000)
df = spark.createDataFrame(rddFiles.flatMap(reader), schema)
df.repartition(6000).write.mode("append").partitionBy("key").parquet("/directory")
The rational behind choosing 6000 here sc.parallelize(files, 6000) is creating partitions each with 200 MB in size i.e. (12k files * 100mb size) / 200MB. Being the sequential nature of file content that is needs to read each of them byte by byte, not sure if read can be further optimised?
Similarly, when writing back to parquet, the number in repartition(6000) is to make sure data is distributed uniformly and all executors can write in parallel. However, it turns out be a very slow operation.
One solution is to increase the number of executors, which will improve the read performance but not sure if it will improve writes?
Looking for any suggestion on how can I improve the performance here?
Suggestion 1: do not use repartition but coalesce.
See here. You identified the bottleneck of the repartition operatio, this is because you have launched a full shuffle. With coalesce you won't do that. You will end up with N partitions also. They won't be as balanced as those you would get with repartition but does it matter ?
I would recommend you to favor coalesce rather than repartition
Suggestion 2: 6000 partitions is maybe not optimal
Your application runs with 6 nodes with 4 cores. You have 6000 partitions. This means you have around 250 partitions by core (not even counting what is given to your master). That's, in my opinion, too much.
Since your partitions are small (around 200Mb) your master probably spend more time awaiting anwsers from executor than executing the queries.
I would recommend you to reduce the number of partitions
Suggestion 3: can you use the DataFrame API ?
DataFrame API operations are generally faster and better than a hand-coded solution.
Maybe have a look at pyspark.sql.functions to see if you can find something there (see here). I don't know if it's relevent since I have not seen your data but that's a general recommendation I do from my experience.

Is there a way to maintain appropriate number of files while doing insert overwrite using spark?

I am constantly using "insert overwrite table table_name partition(partition_column) query"" to write data into my table but the problem here is the number of files generated.
so i started using spark.sql.shuffle.partitions property to fix the number of files.
Now the problem statement here is that there is less data in some partition and very huge data in some partitions. when this happens, when i choose my shuffle partitions as per my large partition data there are unnecessary small files created and if i choose shuffle partitions as per partitions with low data, job starts failing with memory issues.
Is there a good way to solve this?
You need to consider .repartition() in this case, As repartition results almost same size partitions which further increases processing times!
Need to comeup with number of partitions to the dataframe, based on dataframe count..etc and apply repartition.
Refer to this link to dynamically create repartition based on number of rows in dataframe.
The function you are looking for is Size Estimator, it will return the number of bytes your file is. Spark is horrendous when it comes to files and number of files. To control the number of files being output you are going to want to run the repartition command because the number of output files form Spark is directly associated with number of partitions the object has. For my example below I am taking the size of an arbitrary input data frame find the "true" number of partitions (the reason for the + 1 is Spark on longs and ints innately rounds down so 0 partitions would be impossible.
Hope this helps!
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.util.SizeEstimator
val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
//write it out with that many partitions
val outputDF = inputDF.repartition(numPartitions.toInt)

Spark coalescing on the number of objects in each partition

We are starting to experiment with spark on our team.
After we do reduce job in Spark, we would like to write the result to S3, however we would like to avoid collecting the spark result.
For now, we are writing the files to Spark forEachPartition of the RDD, however this resulted in a lot of small files. We would like to be able to aggregate the data into a couple files partitioned by the number of objects written to the file.
So for example, our total data is 1M objects (this is constant), we would like to produce 400K objects file, and our current partition produce around 20k objects file (this varies a lot for each job). Ideally we want to produce 3 files, each containing 400k, 400k and 200k instead of 50 files of 20K objects
Does anyone have a good suggestion?
My thought process is to let each partition handle which index it should write it to by assuming that each partition will roughy produce the same number of objects.
So for example, partition 0 will write to the first file, while partition 21 will write to the second file since it will assume that the starting index for the object is 20000 * 21 = 42000, which is bigger than the file size.
The partition 41 will write to the third file, since it is bigger than 2 * file size limit.
This will not always result on the perfect 400k file size limit though, more of an approximation.
I understand that there is coalescing, but as I understand it coalesce is to reduce the number of partition based on the number of partition wanted. What I want is to coalesce the data based on the number of objects in each partition, is there a good way to do it?
What you want to do is to re-partition the files into three partitions; the data will be split approximately 333k records per partition. The partition will be approximate, it will not be exactly 333,333 per partition. I do not know of a way to get the 400k/400k/200k partition you want.
If you have a DataFrame `df', you can repartition into n partitions as
df.repartition(n)
Since you want a maximum number or records per partition, I would recommend this (you don't specify Scala or pyspark, so I'm going with Scala; you can do the same in pyspark) :
val maxRecordsPerPartition = ???
val numPartitions = (df.count() / maxRecordsPerPartition).toInt + 1
df
.repartition(numPartitions)
.write
.format('json')
.save('/path/file_name.json')
This will guarantee your partitions are less than maxRecordsPerPartition.
We have decided to just go with the number of files being generated and just making sure that each files contain less than 1 million line items

spark write parquet file. How can I specify the row groups size?

I am struggling to find how to specify the row group size of the parquet file writer in the Spark API.
I found one way to do this which is to use the fast parquet python module that has this option :
from fastparquet import write
write has the parameter:
row_group_offsets
Also, what is the optimal number for row_group size ?
Thanks to fast parquet, I did some experiments. Picking a row_groupsize of 1 million is ten times faster than 10 000 for instance. But if I pick more than 1 million, it starts to slow down my simple queries.
Thank you in advance for your help
Parquet parameters are part of the hadoop options and can be set before the parquet write command like this:
val sc : SparkContext // An existing SparkContext.
sc.hadoopConfiguration.setInt("parquet.block.size", 1024 * 1024 * 1024)
Thanks Roberto. It seems that also modifying the number of partition (which by default is 600) helped. Now I can see with parquet-tools that the block size of my parquet files has increased. I have 1 million row by block.
But loading my data and doing a simple count operation is still quite slow with spark.
The dataset I am talking about has only 4 millions rows and 15 columns

Resources