spark write parquet file. How can I specify the row groups size? - apache-spark

I am struggling to find how to specify the row group size of the parquet file writer in the Spark API.
I found one way to do this which is to use the fast parquet python module that has this option :
from fastparquet import write
write has the parameter:
row_group_offsets
Also, what is the optimal number for row_group size ?
Thanks to fast parquet, I did some experiments. Picking a row_groupsize of 1 million is ten times faster than 10 000 for instance. But if I pick more than 1 million, it starts to slow down my simple queries.
Thank you in advance for your help

Parquet parameters are part of the hadoop options and can be set before the parquet write command like this:
val sc : SparkContext // An existing SparkContext.
sc.hadoopConfiguration.setInt("parquet.block.size", 1024 * 1024 * 1024)

Thanks Roberto. It seems that also modifying the number of partition (which by default is 600) helped. Now I can see with parquet-tools that the block size of my parquet files has increased. I have 1 million row by block.
But loading my data and doing a simple count operation is still quite slow with spark.
The dataset I am talking about has only 4 millions rows and 15 columns

Related

Create 1GB partitions Spark SQL

I'm trying to split my data in 1GB when writing in S3 using spark. The approach I tried was to calculate the size of the DeltaTable in GB (the define_coalesce function), round, and using that number to write in S3:
# Vaccum to leave 1 week of history
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.vacuum(168)
deltaTable.generate("symlink_format_manifest")
# Reading delta table and rewriting with coalesce to reach 1GB per file
df = spark.read.format('delta').load(f"s3a://{delta_table}")
coalesce_number = define_coalesce(delta_table) < this function calculates the size of the delta in GB
df.coalesce(coalesce_number).write.format("delta").mode('overwrite').option('overwriteSchema', 'true').save(f"s3a://{delta_table}")
deltaTable = DeltaTable.forPath(spark, f"s3a://{delta_table}")
deltaTable.generate("symlink_format_manifest")
I'm trying this way cause our Delta is the opensource one and we don't have the optimize method built in.
I did some searching and found the spark.sql.files.maxPartitionBytes configuration in Spark, but some people said that it was not solving their problems, and that this config partitions when reading and not writing.
Any suggestions?
I understand your problem, and what you are trying to do but i am not sure what is the output of your current solution. If partitions are still not equal to 1 gb you may try to replace coalesce with repartition. Coalesce does not guarantee that after this operation partitions are equal so your formula may not work. If you know how many partition you need on output use repartition(coalesce_number) and it should create equal partitions with round robin
If the problem is with function which is calculating dataset size (so number of partitions) i know two solutions:
You can cache dataset and then take its size from statistics. Of course this may be problematic and you have to spend some resource to due that. Something similar is done here in first answer: How spark get the size of a dataframe for broadcast?
You can calculate count and divide it by number of records you want to have in single partition. Size of single record depends on your schema, it may be tricky to estimate it but it is viable option to try
Finally solved my problem. Since we are using Delta, I had the idea of trying to read the manifest files to find all the parquet names. After that, I get the sum of the list of parquets on manifest connecting in S3 with boto3:
def define_repartition(delta_table_path):
conn = S3Connection()
bk = conn.get_bucket(bucket)
manifest = spark.read.text(f's3a://{delta_table_path}_symlink_format_manifest/manifest')
parquets = [data[0].replace(f's3a://{bucket}/','') for data in manifest.select('value').collect()]
size = 0
for parquet in parquets:
key = bk.lookup(parquet)
size = size + key.size
return round(size/1073741824)
Thank you all for the help.Regards from Brazil. :)

Spark: Generate txt files

I have data stored in parqueue format, i want to generate the delimitered text file from spark with row limit of 100 rows per file. is this possible to handle it from spark notebooks ?
I am building ADF pipeline which triggers this notebook and the assume output is of textfile something like the below format please suggest me the possible ways .
5431732167 899 1011381 1 teststring
5431732163 899 912 teststring
5431932119 899 108808 40 teststring
5432032116 899 1082223 40 teststring
i also have a need to process the batch of text file and load them into database, please suggest the options to do this.
Thanks in advance.
Thanks,
Manoj.
This question appears to be a functional duplicate of: How to get 1000 records from dataframe and write into a file using PySpark?
Before running your job to write your CSV files, set maxRecordsPerFile, so in Spark SQL:
set spark.sql.files.maxRecordsPerFile = 100
You should be able to use maxRecordsPerFile with the CSV output. This will not guarantee that you will have only one file with less than 100 records though, only that there will be no files with more than 100 rows. Spark writes in parallel so this cannot be ensured across nodes.
df
.write
.option("maxRecordsPerFile", 100)
.csv(outputPath)
If your data is very small, you can coalesce it to 1 partition and ensure that only 1 file is bigger than 100 rows, but then you loose the parallel processing speed advantage (most of your cluster will be unused during the last calculation and the write)
For writing to databases, the solution depends on the particular database. One example many databases support is JDBC, spark can read/write data with it, see: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Is there a way to maintain appropriate number of files while doing insert overwrite using spark?

I am constantly using "insert overwrite table table_name partition(partition_column) query"" to write data into my table but the problem here is the number of files generated.
so i started using spark.sql.shuffle.partitions property to fix the number of files.
Now the problem statement here is that there is less data in some partition and very huge data in some partitions. when this happens, when i choose my shuffle partitions as per my large partition data there are unnecessary small files created and if i choose shuffle partitions as per partitions with low data, job starts failing with memory issues.
Is there a good way to solve this?
You need to consider .repartition() in this case, As repartition results almost same size partitions which further increases processing times!
Need to comeup with number of partitions to the dataframe, based on dataframe count..etc and apply repartition.
Refer to this link to dynamically create repartition based on number of rows in dataframe.
The function you are looking for is Size Estimator, it will return the number of bytes your file is. Spark is horrendous when it comes to files and number of files. To control the number of files being output you are going to want to run the repartition command because the number of output files form Spark is directly associated with number of partitions the object has. For my example below I am taking the size of an arbitrary input data frame find the "true" number of partitions (the reason for the + 1 is Spark on longs and ints innately rounds down so 0 partitions would be impossible.
Hope this helps!
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.util.SizeEstimator
val inputDF2 : Long = SizeEstimator.estimate(inputDF.rdd)
//find its appropiate number of partitions
val numPartitions : Long = (inputDF2/134217728) + 1
//write it out with that many partitions
val outputDF = inputDF.repartition(numPartitions.toInt)

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?
I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.
The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).
Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.
If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.
The following will result in 2 files being written out even though it's only got 1 partition:
val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

Need less parquet files

I am doing the following process
rdd.toDF.write.mode(SaveMode.Append).partitionBy("Some Column").parquet(output_path)
However, under each partition, there are too many parquet files and each of them, the size is very small, that will makes my following steps become very slow to load all the parquet files. Is there a better way that under each partition, make less parquet files and increase the single parquet file size?
You can repartition before save:
rdd.toDF.repartition("Some Column").write.mode(SaveMode.Append).partitionBy("Some Column")
I used to have this problem.
Actually you can't control the partition of files because it depends on the executor doing.
The way to work around it is using method coalesce to make a shuffle and you can make how many partition you want but it's not efficient way you also need to set driver memory enough to handle this operation.
df = df.coalesce(numPartitions).write.partitionBy(""yyyyy").parquet("xxxx")
I also faced this issue. The problem is if you use coalesce each partition gets same number of parquet files. Now different partitions have different size so ideally I need different coalesce for each partition.
It's going to be really quite expensive if you open a lot of small files. Let's say you open 1k files and each filesize are far from the value of your parquet.block.size.
Here are my suggestions:
Create a job that will first merge your input parquet files to have smaller number of files where their sizes are near or equal to parquet.block.size. The default block size for 128Mb, though it's configurable by updating parquet.block.size. Spark would love if your parquet file is near before or equal the value of your parquet.block.size. The block size is the size of a row group being buffered in memory.
Or update your spark job to just read limited number of files
Or if you have a big machine and/or resources, just do the right tuning.
Hive query has a way to merge small files into larger one. This is not available in spark sql. Also, reducing spark.sql.shuffle.partitions wont help with Dataframe API.
I tried below solution and it generated lesser number of parquet files(from 800 parquet files to 29).
Suppose the data is loaded to a dataframe df
Create a temporary table in hive.
df.createOrReplaceTempView("tempTable")
spark.sql("CREATE TABLE test_temp LIKE test")
spark.sql("INSERT INTO TABLE test_temp SELECT * FROM tempTable")
The test_temp will contain small parquet files.
Populate final hive table from temporary table
spark.sql("INSERT INTO test SELECT * FROM test_temp")
The final table will contain lesser files. Drop temporary table after populating final table.

Resources