Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS. But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB. So then what decides the number of partitions?
You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes. The default value is 128 MB, see Spark Tuning.
I think this link does answers my question .The number of partitions depends on the number of splits split and the splits depends on the hadoop inputformat .
https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs
With the block size of each block as 128MB.
Spark will read the data.
Say if your hive table size was aprrox 14.8 GB then it will divide the hive table data into 128 MB blocks and will result in 119 Partitions.
On the other hand your hive table is partitioned so the partition column has 150 unique values.
So number of part files in hive and number of partitions in spark are not linked.
Related
df=spark.read.parquet('path')
df.rdd.getNumPartitions() --->It gives the number of partitions but I would like to know information like size of each partition of this dataframe
How do I get it?
I need to read data from one hive table and insert it into another Hive table. The schema of both the tables is the same. The table is partitioned by date & country. The size of each partition is ~500MB. I want to insert these data in a new table where the files inside each partition are roughly 128 MB (i.e 4 files)
Step 1: Read data from the source table in Spark.
Step 2: Repartition by column(country, date) and the number of partitions to 4.
df.repartition(4, col("country_code"), col("record_date"))
I am getting only 1 partition per country_code & record_date.
Whatever you are doing in the step 2 will repartition your data to 4 partitions in the memory but it won't save 4 files if you do df.write.
In order to do that you can use below code:
df.repartition(4, col("country_code"),col("record_date"))
.write
.partitionBy(col("country_code"),col("record_date"))
.mode(SaveMode.Append).saveAsTable("TableName")
I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.
Example - Now assume we have an input RDD input which is filtered in the second step. Now I want to calculate the data size in the filtered RDD and calculate how many partitions will be required to repartition by considering block size is 128MB
This will help me out to pass the number of partitions to repartition method.
InputRDD=sc.textFile("sample.txt")
FilteredRDD=InputRDD.Filter( Some Filter Condition )
FilteredRDD.repartition(XX)
Q1.How to calculate the value of XX ?
Q2.What is the similar approach for Spark SQL/DataFrame?
The block size of 128MB will comes into picture only when reading /writing the data from/to HDFS. Once RDD is created, data is in memory or spill to disk based on executor RAM size.
You can't calculate data size unless calling collect() action on filtered RDD and it is not recommended.
The maximum partition size is 2GB, you can choose the number of partition based on cluster size or data model.
df.partition(col)
thirdCateBrandres.createOrReplaceTempView("tempTable2")
sql("insert overwrite table temp_cate3_Brand_List select * from tempTable2")
The code as above, thirdCateBrandres is a spark DataFrame, registered as a temp table,then write to table temp_cate3_Brand_List, the table has 3 billion row with 7 fields, data size is about 4GB in ORC+SNAPPY format .
These codes took about 20 minutes.
How can I speed up the program?