How to repartition into fixed number of partition per column in Spark? - apache-spark

I need to read data from one hive table and insert it into another Hive table. The schema of both the tables is the same. The table is partitioned by date & country. The size of each partition is ~500MB. I want to insert these data in a new table where the files inside each partition are roughly 128 MB (i.e 4 files)
Step 1: Read data from the source table in Spark.
Step 2: Repartition by column(country, date) and the number of partitions to 4.
df.repartition(4, col("country_code"), col("record_date"))
I am getting only 1 partition per country_code & record_date.

Whatever you are doing in the step 2 will repartition your data to 4 partitions in the memory but it won't save 4 files if you do df.write.
In order to do that you can use below code:
df.repartition(4, col("country_code"),col("record_date"))
.write
.partitionBy(col("country_code"),col("record_date"))
.mode(SaveMode.Append).saveAsTable("TableName")

Related

Spark Generate A Lot Of Tasks Although Partition Number is 1 Pyspark

My code:
df = self.sql_context.sql(f"select max(id) as id from {table}")
return df.collect()[0][0]
My table is partitioned by id - it has 100M records but only 3 distinct id's.
I expected this query to work with 1 task and scan just the partition column (id).
I don't understand how I have 691 tasks for the collect line with just 3 partitions
I guess the query is executing full scan on the table but I can't figure why it doesn't scan just the metadata
Your df contains the result of an aggregation on the entire table, it contains only one row (with only one field being the max(id)), this is why it has only 1 partition.
But the original table DataFrame may have many partitions (or only 1 partition but its computation needs ~600 stages, triggering 1 task per stage, which is not that common)
Without details on your parallelism configurations and input source type and transformations, it is not easy to help more !

How to partition SQL Server table, where partition column is integer but in date format(20170101 to 20200306) using pyspark?

I have integer column which is a date actually.
like this
20170101
20170103
20170102
.....
20200101
around 10 million rows in each partition.
how to read table using this field as partition column in pyspark?
run spark sql -
spark.sql("select * from table where intPartitionColumn=20200101")
This will push the partition filters to source to read only directory intPartitionColumn=20200101.
You can also check the physical plan(PartitionFilters & PushedFilters) to verify the same

How many partitions Spark creates when loading a Hive table

Even if it is a Hive table or an HDFS file, when Spark reads the data and creates a dataframe, I was thinking that the number of partitions in the RDD/dataframe will be equal to the number of partfiles in HDFS. But when I did a test with Hive external table, I could see that the number was coming different than the number of part-files .The number of partitions in a dataframe was 119. The table was a Hive partitioned table with 150 partfiles in it, with a minimum size of a file 30 MB and max size is 118 MB. So then what decides the number of partitions?
You can control how many bytes Spark packs into a single partition by setting spark.sql.files.maxPartitionBytes. The default value is 128 MB, see Spark Tuning.
I think this link does answers my question .The number of partitions depends on the number of splits split and the splits depends on the hadoop inputformat .
https://intellipaat.com/community/7671/how-does-spark-partition-ing-work-on-files-in-hdfs
With the block size of each block as 128MB.
Spark will read the data.
Say if your hive table size was aprrox 14.8 GB then it will divide the hive table data into 128 MB blocks and will result in 119 Partitions.
On the other hand your hive table is partitioned so the partition column has 150 unique values.
So number of part files in hive and number of partitions in spark are not linked.

How to fasten spark dataframe write to hive table in ORC store

thirdCateBrandres.createOrReplaceTempView("tempTable2")
sql("insert overwrite table temp_cate3_Brand_List select * from tempTable2")
The code as above, thirdCateBrandres is a spark DataFrame, registered as a temp table,then write to table temp_cate3_Brand_List, the table has 3 billion row with 7 fields, data size is about 4GB in ORC+SNAPPY format .
These codes took about 20 minutes.
How can I speed up the program?

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Resources