Spark Dataframe Repartioning Causing uneven paritions - apache-spark

I am using spark repartition to change the number of partitions in the dataframe.
While writing the data after repartitioning I saw there are different size parquet files have been created.
Here is the code which I am using to repartition
df.repartition(partitionCount).write.mode(SaveMode.Overwrite).parquet("/test")
Most of the partitions in size KBs and some of them is in around 100MB which is the size I want to keep per partition.
Here is a sample
20.2 K /test/part-00010-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
20.2 K /test/part-00011-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
99.9 M /test/part-00012-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
Now if I open the 20.2K parquet files and do a count action the result comes to be 0. For 99.9M file the same count operation gives some non zero result.
Now as per my understanding of repartition in dataframe, it does a full shuffle and tries to keep each partition of same size. However the above mentioned example contradicts that.
Could someone please help me here.

Related

Weird data skew after loading a parquet file in Spark

I am reading some data from cloud storage in parquet format into Spark DataFrame using PySpark (~10 executors and 4-5 cores per executor). The spark.sql.files.maxPartitionBytes is set before loading so that I could control the size of each partition and in turn # of partitions I could get. After loading in the data, I apply Spark functions/UDFs on the data. So there should be no shuffle happening (join and groupby). I am expecting that each partition has relatively equal partition size after loading in but the reality is the data loaded is very highly skewed.
When looking in YARN, the min, 25%, median, 75% partition size are all 21 B (basically empty partition) while the max partition size is say 100 MB where all rows are loaded.
What I am doing now to solve this skew is to do a df.repartition() to evenly distribute it right after loading and it works well. But this introduces one more shuffle of the data which is not ideal.
So the question is why the partitions are so highly skewed by default after loading? Is there a way that I could load them in with relatively even partition size and skip that df.repartition() step?
Thanks!

Spark Performance Issue vs Hive

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.
Following are the facts about the environment,
For table x:
Data size: 18 MB
Number of files in a partition : ~191
file type: parquet
For table y:
Data size: 1.5 GB
Number of files in a partition : ~3200
file type: parquet
Now the problem is:
Hive and Spark are giving same performance (time taken is same)
I tried different combination of resources for spark job.
e.g.:
executors:50 memory:20GB cores:5
executors:70 memory:20GB cores:5
executors:1 memory:20GB cores:5
All three combinations are giving same performance. I am not sure what I am missing here.
I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.
One key observations is:
70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.
I am not sure how hive is giving the same performance.
Kindly suggest.
I assume you are comparing Hive on MR vs Spark. Please let me know if it is not the case.Because Hive(on tez or spark) vs Spark Sql will not differ
vastly in terms of performance.
I think the main issue is that there are too many small files.
A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.
My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y'
dataframe into 6-7 partitions.
After doing the above, please perform the join(broadcastHashJoin).

Spark coalescing on the number of objects in each partition

We are starting to experiment with spark on our team.
After we do reduce job in Spark, we would like to write the result to S3, however we would like to avoid collecting the spark result.
For now, we are writing the files to Spark forEachPartition of the RDD, however this resulted in a lot of small files. We would like to be able to aggregate the data into a couple files partitioned by the number of objects written to the file.
So for example, our total data is 1M objects (this is constant), we would like to produce 400K objects file, and our current partition produce around 20k objects file (this varies a lot for each job). Ideally we want to produce 3 files, each containing 400k, 400k and 200k instead of 50 files of 20K objects
Does anyone have a good suggestion?
My thought process is to let each partition handle which index it should write it to by assuming that each partition will roughy produce the same number of objects.
So for example, partition 0 will write to the first file, while partition 21 will write to the second file since it will assume that the starting index for the object is 20000 * 21 = 42000, which is bigger than the file size.
The partition 41 will write to the third file, since it is bigger than 2 * file size limit.
This will not always result on the perfect 400k file size limit though, more of an approximation.
I understand that there is coalescing, but as I understand it coalesce is to reduce the number of partition based on the number of partition wanted. What I want is to coalesce the data based on the number of objects in each partition, is there a good way to do it?
What you want to do is to re-partition the files into three partitions; the data will be split approximately 333k records per partition. The partition will be approximate, it will not be exactly 333,333 per partition. I do not know of a way to get the 400k/400k/200k partition you want.
If you have a DataFrame `df', you can repartition into n partitions as
df.repartition(n)
Since you want a maximum number or records per partition, I would recommend this (you don't specify Scala or pyspark, so I'm going with Scala; you can do the same in pyspark) :
val maxRecordsPerPartition = ???
val numPartitions = (df.count() / maxRecordsPerPartition).toInt + 1
df
.repartition(numPartitions)
.write
.format('json')
.save('/path/file_name.json')
This will guarantee your partitions are less than maxRecordsPerPartition.
We have decided to just go with the number of files being generated and just making sure that each files contain less than 1 million line items

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?
I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.
The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).
Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.
If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.
The following will result in 2 files being written out even though it's only got 1 partition:
val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

Spark DataFrames with Parquet and Partitioning

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
So let me clarify, parquet compressed (these numbers are not fully accurate).
1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
Assumptions:
1 Block = 1 Partition For Spark
1 Core operated on 1 Partition
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
Spark 1.5 DataFrame partitions parquet file as follows:
1 partition per HDFS block
If HDFS block size is less than configured in Spark parquet block size a partition will be created for multiple HDFS blocks such as total size of partition is no less than parquet block size
I saw the other answer but I thought I can clarify more on this. If you are reading Parquet from posix filesystem then you can increase number of partitioning readings by just having more workers in Spark.
But in order to control the balance of data that comes into workers one may use the hierarchical data structure of the Parquet files, and later in the workers you may point to different partitions or parts of the Parquet file. This will give you control over how much of data should go to each worker according to the domain of your dataset (if by balancing data in workers you mean equal batch of data per worker is not efficient).

Resources