I'm trying to read a config file in spark read.textfile which basically contains my tables list. my task is to iterate through the table list and convert Avro to ORC format. please find my below code snippet which will do the logic.
val tableList = spark.read.textFile('tables.txt')
tableList.collect().foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)})
Please find my configurations below
DriverMemory: 4GB
ExecutorMemory: 10GB
NoOfExecutors: 5
Input DataSize: 45GB
My question here is this will execute in Executor or Driver? This will throw Out of Memory Error ? Please comment your suggestions.
val tableList = spark.read.textFile('tables.txt')
tableList.collect().foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)}
)
Re:
will this execute in Executor or Driver?
Once you call tableList.collect(), the contents of 'tables.txt' will be brought to the Driver application. If it is well within the Driver Memory it should be alright.
However the save operation on Dataframe would be executed on executor.
Re:
This will throw Out of Memory Error ?
Have you faced one ? IMO, unless your tables.txt is too huge you should be alright.I am assuming Input data size as 45 GB is the data in the tables mentioned in tables.txt.
Hope this helps.
I would suggest to eliminate the collect since it is an action therefore all the data from your 45gb file is loaded in memory. You can try something like this
val tableList = spark.read.textFile('tables.txt')
tableList.foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)})
I am new to Spark distributed development. I'm attempting to optimize my existing Spark job which takes up to 1 hour to complete.
Infrastructure:
EMR [10 instances of r4.8xlarge (32 cores, 244GB)]
Source Data: 1000 .gz files in S3 (~30MB each)
Spark Execution Parameters [Executors: 300, Executor Memory: 6gb, Cores: 1]
In general, the Spark job performs the following:
private def processLines(lines: RDD[String]): DataFrame = {
val updatedLines = lines.mapPartitions(row => ...)
spark.createDataFrame(updatedLines, schema)
}
// Read S3 files and repartition() and cache()
val lines: RDD[String] = spark.sparkContext
.textFile(pathToFiles, numFiles)
.repartition(2 * numFiles) // double the parallelism
.cache()
val numRawLines = lines.count()
// Custom process each line and cache table
val convertedLines: DataFrame = processLines(lines)
convertedRows.createOrReplaceTempView("temp_tbl")
spark.sqlContext.cacheTable("temp_tbl")
val numRows = spark.sql("select count(*) from temp_tbl").collect().head().getLong(0)
// Select a subset of the data
val myDataFrame = spark.sql("select a, b, c from temp_tbl where field = 'xxx' ")
// Define # of parquet files to write using coalesce
val numParquetFiles = numRows / 1000000
var lessParts = myDataFrame.rdd.coalesce(numParquetFiles)
var lessPartsDataFrame = spark.sqlContext.createDataFrame(lessParts, myDataFrame.schema)
lessPartsDataFrame.createOrReplaceTempView('my_view')
// Insert data from view into Hive parquet table
spark.sql("insert overwrite destination_tbl
select * from my_view")
lines.unpersist()
The app reads all S3 files => repartitions to twice the amount of files => caches the RDD => custom processes each line => creates a temp view/cache table => counts the num rows => selects a subset of the data => decrease the amount of partitions => creates a view of the subset of data => inserts to hive destination table using the view => unpersist the RDD.
I am not sure why it takes a long time to execute. Are the spark execution parameters incorrectly set or is there something being incorrectly invoked here?
Before looking at the metrics, I would try the following change to your code.
private def processLines(lines: DataFrame): DataFrame = {
lines.mapPartitions(row => ...)
}
val convertedLinesDf = spark.read.text(pathToFiles)
.filter("field = 'xxx'")
.cache()
val numLines = convertedLinesDf.count() //dataset get in memory here, it takes time
// Select a subset of the data, but it will be fast if you have enough memory
// Just use Dataframe API
val myDataFrame = convertedLinesDf.transform(processLines).select("a","b","c")
//coalesce here without converting to RDD, experiment what best
myDataFrame.coalesce(<desired_output_files_number>)
.write.option(SaveMode.Overwrite)
.saveAsTable("destination_tbl")
Caching is useless if you don't count the number of rows. And it will take some memory and add some GC pressure
Caching table may consume more memory and add more GC pressure
Converting Dataframe to RDD is costly as it implies ser/deser operations
Not sure what you trying to do with : val numParquetFiles = numRows / 1000000 and repartition(2 * numFiles). With your setup, 1000 files of 30MB each will give you 1000 partitions. It could be fine like this. Calling repartition and coalesce may trigger a shuffling operation which is costly. (Coalesce may not trigger a shuffle)
Tell me if you get any improvements !
I have a Kafka topic in which I have received around 500k events.
Currently, I need to insert those events into a Hive table.
Since events are time-driven, I decided to use the following strategy:
1) Define a route inside HDFS, which I call users. Inside of this route, there will be several Parquet files, each one corresponding to a certain date. E.g.: 20180412, 20180413, 20180414, etc. (Format YYYYMMDD).
2) Create a Hive table and use the date in the format YYYYMMDD as a partition. The idea is to use each of the files inside the users HDFS directory as a partition of the table, by simply adding the corresponding parquet file through the command:
ALTER TABLE users DROP IF EXISTS PARTITION
(fecha='20180412') ;
ALTER TABLE users ADD PARTITION
(fecha='20180412') LOCATION '/users/20180412';
3) Read the data from the Kafka topic by iterating from the earliest event, get the date value in the event (inside the parameter dateClient), and given that date value, insert the value into the corresponding Parque File.
4) In order to accomplish the point 3, I read each event and saved it inside a temporary HDFS file, from which I used Spark to read the file. After that, I used Spark to convert the temporary file contents into a Data Frame.
5) Using Spark, I managed to insert the DataFrame values into the Parquet File.
The code follows this approach:
val conf = ConfigFactory.parseResources("properties.conf")
val brokersip = conf.getString("enrichment.brokers.value")
val topics_in = conf.getString("enrichment.topics_in.value")
val spark = SparkSession
.builder()
.master("yarn")
.appName("ParaTiUserXY")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val properties = new Properties
properties.put("key.deserializer", classOf[StringDeserializer])
properties.put("value.deserializer", classOf[StringDeserializer])
properties.put("bootstrap.servers", brokersip)
properties.put("auto.offset.reset", "earliest")
properties.put("group.id", "UserXYZ2")
//Schema para transformar los valores del topico de Kafka a JSON
val my_schema = new StructType()
.add("longitudCliente", StringType)
.add("latitudCliente", StringType)
.add("dni", StringType)
.add("alias", StringType)
.add("segmentoCliente", StringType)
.add("timestampCliente", StringType)
.add("dateCliente", StringType)
.add("timeCliente", StringType)
.add("tokenCliente", StringType)
.add("telefonoCliente", StringType)
val consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe( util.Collections.singletonList("geoevents") )
val fs = {
val conf = new Configuration()
FileSystem.get(conf)
}
val temp_path:Path = new Path("hdfs:///tmp/tmpstgtopics")
if( fs.exists(temp_path)){
fs.delete(temp_path, true)
}
while(true)
{
val records=consumer.poll(100)
for (record<-records.asScala){
val data = record.value.toString
val dataos: FSDataOutputStream = fs.create(temp_path)
val bw: BufferedWriter = new BufferedWriter( new OutputStreamWriter(dataos, "UTF-8"))
bw.append(data)
bw.close
val data_schema = spark.read.schema(my_schema).json("hdfs:///tmp/tmpstgtopics")
val fechaCliente = data_schema.select("dateCliente").first.getString(0)
if( fechaCliente < date){
data_schema.select("longitudCliente", "latitudCliente","dni", "alias",
"segmentoCliente", "timestampCliente", "dateCliente", "timeCliente",
"tokenCliente", "telefonoCliente").coalesce(1).write.mode(SaveMode.Append)
.parquet("/desa/landing/parati/xyusers/" + fechaCliente)
}
else{
break
}
}
}
consumer.close()
However, this method is taking around 1 second to process each record in my cluster. So far, it would mean I will take around 6 days to process all the events I have.
Is this the optimal way to insert the whole amount of events inside a Kafka topic into a Hive table?
What other alternatives exist or which upgrades could I do to my code in order to speed it up?
Other than the fact that you're not using Spark Streaming correctly to poll from Kafka (you wrote a vanilla Scala Kafka consumer with a while loop) and coalesce(1) will always be a bottleneck as it forces one executor to collect the records, I'll just say you're really reinventing the wheel here.
What other alternatives exist
That I known of and are all open source
Gobblin (replaces Camus) by LinkedIn
Kafka Connect w/ HDFS Sink Connector (built into Confluent Platform, but also builds from source on Github)
Streamsets
Apache NiFi
Secor by Pinterest
From those listed, it would be beneficial for you to have JSON or Avro encoded Kafka messages, and not a flat string. That way, you can drop the files as is into a Hive serde, and not parse them while consuming them. If you cannot edit the producer code, make a separate Kafka Streams job taking the raw string data, parse it, then write to a new topic of Avro or JSON.
If you choose Avro (which you really should for Hive support), you can use the Confluent Schema Registry. Or if you're running Hortonworks, they offer a similar Registry.
HIve on Avro operates far better than text or JSON. Avro can easily be transformed into Parquet, and I believe each of the above options offers at least Parquet support while the others also can do ORC (Kafka Connect doesn't do ORC at this time).
Each of the above also support some level of automatic Hive partition generation based on the Kafka record time.
You can improve the parallelism by increasing the partitions of the kafka topic and having one or more consumer groups with multiple consumers consuming one-to-one with each partition.
As, cricket_007 mentioned you can use one of the opensource frameworks or you can have more consumer groups consuming the same topic to off-load the data.
Spark version - 2.2.1.
I've created a bucketed table with 64 buckets, I'm executing an aggregation function select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa . I can see that 64 tasks in Spark UI, which utilize just 4 executors (each executor has 16 cores) out of 20. Is there a way I can scale out the number of tasks or that's how bucketed queries should run (number of running cores as the number of buckets)?
Here's the create table:
sql("""CREATE TABLE level_1 (
bundle string,
date_ date,
hour SMALLINT)
USING ORC
PARTITIONED BY (date_ , hour )
CLUSTERED BY (ifa)
SORTED BY (ifa)
INTO 64 BUCKETS
LOCATION 'XXX'""")
Here's the query:
sql(s"select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa").show
With bucketing, the number of tasks == number of buckets, so you should be aware of the number of cores/tasks that you need/want to use and then set it as the buckets number.
num of task = num of buckets is probably the most important and under-discussed aspect of bucketing in Spark. Buckets (by default) are historically solely useful for creating "pre-shuffled" dataframes which can optimize large joins. When you read a bucketed table all of the file or files for each bucket are read by a single spark executor (30 buckets = 30 spark tasks when reading the data) which would allow the table to be joined to another table bucketed on the same # of columns. I find this behavior annoying and like the user above mentioned problematic for tables that may grow.
You might be asking yourself now, why and when in the would I ever want to bucket and when will my real-world data grow exactly in the same way over time? (you probably partitioned your big data by date, be honest) In my experience you probably don't have a great use case to bucket tables in the default spark way. BUT ALL IS NOT LOST FOR BUCKETING!
Enter "bucket-pruning". Bucket pruning only works when you bucket ONE column but is potentially your greatest friend in Spark since the advent of SparkSQL and Dataframes. It allows Spark to determine which files in your table contain specific values based on some filter in your query, which can MASSIVELY reduce the number of files spark physically reads, resulting in hugely efficient and fast queries. (I've taken 2+hr queries down to 2 minutes and 1/100th of the Spark workers). But you probably don't care because of the # of buckets to tasks issue means your table will never "scale-up" if you have too many files per bucket, per partition.
Enter Spark 3.2.0. There is a new feature coming that will allow bucket pruning to stay active when you disable bucket-based reading, allowing you to distribute the spark reads with bucket-pruning/scan. I also have a trick for doing this with spark < 3.2 as follows.
(note the leaf-scan for files with vanilla spark.read on s3 is added overhead but if your table is big it doesn't matter, bc your bucket optimized table will be a distributed read across all your available spark workers and will now be scalable)
val table = "ex_db.ex_tbl"
val target_partition = "2021-01-01"
val bucket_target = "valuex"
val bucket_col = "bucket_col"
val partition_col = "date"
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.execution.FileSourceScanExec
import org.apache.spark.sql.execution.datasources.{FileScanRDD,FilePartition}
val df = spark.table(tablename).where((col(partition_col)===lit(target_partition)) && (col(bucket_col)===lit(bucket_target)))
val sparkplan = df.queryExecution.executedPlan
val scan = sparkplan.collectFirst { case exec: FileSourceScanExec => exec }.get
val rdd = scan.inputRDDs.head.asInstanceOf[FileScanRDD]
val bucket_files = for
{ FilePartition(bucketId, files) <- rdd.filePartitions f <- files }
yield s"$f".replaceAll("path: ", "").split(",")(0)
val format = bucket_files(0).split("
.").last
val result_df = spark.read.option("mergeSchema", "False").format(format).load(bucket_files:_*).where(col(bucket_col) === lit(bucket_target))
I want to overwrite specific partitions instead of all in spark. I am trying the following command:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
where df is dataframe having the incremental data to be overwritten.
hdfs-base-path contains the master data.
When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.
What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?
Finally! This is now a feature in Spark 2.3.0:
SPARK-20236
To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,
df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")
If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)
You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")
This works for me on AWS Glue ETL jobs (Glue 1.0 - Spark 2.4 - Python 2)
Adding 'overwrite=True' parameter in the insertInto statement solves this:
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)
By default overwrite=False. Changing it to True allows us to overwrite specific partitions contained in df and in the partioned_table. This helps us avoid overwriting the entire contents of the partioned_table with df.
Using Spark 1.6...
The HiveContext can simplify this process greatly. The key is that you must create the table in Hive first using a CREATE EXTERNAL TABLE statement with partitioning defined. For example:
# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'
From here, let's say you have a Dataframe with new records in it for a specific partition (or multiple partitions). You can use a HiveContext SQL statement to perform an INSERT OVERWRITE using this Dataframe, which will overwrite the table for only the partitions contained in the Dataframe:
# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')
hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
SELECT name, age
FROM update_dataframe""")
Note: update_dataframe in this example has a schema that matches that of the target test table.
One easy mistake to make with this approach is to skip the CREATE EXTERNAL TABLE step in Hive and just make the table using the Dataframe API's write methods. For Parquet-based tables in particular, the table will not be defined appropriately to support Hive's INSERT OVERWRITE... PARTITION function.
Hope this helps.
Tested this on Spark 2.3.1 with Scala.
Most of the answers above are writing to a Hive table. However, I wanted to write directly to disk, which has an external hive table on top of this folder.
First the required configuration
val sparkSession: SparkSession = SparkSession
.builder
.enableHiveSupport()
.config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
.appName("spark_write_to_dynamic_partition_folders")
Usage here:
DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")
I tried below approach to overwrite particular partition in HIVE table.
### load Data and check records
raw_df = spark.table("test.original")
raw_df.count()
lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925
### Check data in few partitions.
sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
print "Number of records: ", sample.count()
sample.show()
### Back-up the partitions before deletion
raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")
### UDF : To delete particular partition.
def delete_part(table, part):
qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
spark.sql(qry)
### Delete partitions
part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
part_list = part_df.rdd.map(lambda x : x[0]).collect()
table = "test.original"
for p in part_list:
delete_part(table, p)
### Do the required Changes to the columns in partitions
df = spark.table("test.original_bkp")
newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
newdf.select("c_customer_sk", "c_preferred_cust_flag").show()
### Write the Partitions back to Original table
newdf.write.insertInto("test.original")
### Verify data in Original table
orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()
Hope it helps.
Regards,
Neeraj
As jatin Wrote you can delete paritions from hive and from path and then append data
Since I was wasting too much time with it I added the following example for other spark users.
I used Scala with spark 2.2.1
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}
case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)
object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"
val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"
val someText = List("text1", "text2")
val ids = (0 until 5).toList
val listData = partitions1.flatMap(p1 => {
partitions2.flatMap(p2 => {
someText.flatMap(
text => {
ids.map(
id => DataExample(p1, p2, text, id)
)
}
)
}
)
})
val asDataFrame = spark.createDataFrame(listData)
//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
val p = new Path(path)
val fs = p.getFileSystem(new Configuration())
fs.delete(p, recursive)
}
def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
if (spark.catalog.tableExists(tableName)) {
//clean partitions
val asColumns = partitions.map(c => new Column(c))
val relevantPartitions = df.select(asColumns: _*).distinct().collect()
val partitionToRemove = relevantPartitions.map(row => {
val fields = row.schema.fields
s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
})
val cleanFolders = relevantPartitions.map(partition => {
val fields = partition.schema.fields
path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
})
println(s"Going to clean ${partitionToRemove.size} partitions")
partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
cleanFolders.foreach(partition => deletePath(partition, true))
}
asDataFrame.write
.options(Map("path" -> myTablePath))
.mode(SaveMode.Append)
.partitionBy(partitionColumns: _*)
.saveAsTable(tableName)
}
//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)
import spark.implicits._
val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
println("Overwrite is working !!!")
}
}
If you use DataFrame, possibly you want to use Hive table over data.
In this case you need just call method
df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)
It'll overwrite partitions that DataFrame contains.
There's not necessity to specify format (orc), because Spark will use Hive table format.
It works fine in Spark version 1.6
Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.
CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';
Once the table is created, you would write your data to the tmpLocation
df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)
Then you would recover the table partition paths by executing:
MSCK REPAIR TABLE tmpTbl;
Get the partition paths by querying the Hive metadata like:
SHOW PARTITONS tmpTbl;
Delete these partitions from the trgtTbl and move the directories from tmpTbl to trgtTbl
I would suggest you doing clean-up and then writing new partitions with Append mode:
import scala.sys.process._
def deletePath(path: String): Unit = {
s"hdfs dfs -rm -r -skipTrash $path".!
}
df.select(partitionColumn).distinct.collect().foreach(p => {
val partition = p.getAs[String](partitionColumn)
deletePath(s"$path/$partitionColumn=$partition")
})
df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)
This will delete only new partitions. After writing data run this command if you need to update metastore:
sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")
Note: deletePath assumes that hfds command is available on your system.
My solution implies overwriting each specific partition starting from a spark dataframe. It skips the dropping partition part. I'm using pyspark>=3 and I'm writing on AWS s3:
def write_df_on_s3(df, s3_path, field, mode):
# get the list of unique field values
list_partitions = [x.asDict()[field] for x in df.select(field).distinct().collect()]
df_repartitioned = df.repartition(1,field)
for p in list_partitions:
# create dataframes by partition and send it to s3
df_to_send = df_repartitioned.where("{}='{}'".format(field,p))
df_to_send.write.mode(mode).parquet(s3_path+"/"+field+"={}/".format(p))
The arguments of this simple function are the df, the s3_path, the partition field, and the mode (overwrite or append). The first part gets the unique field values: it means that if I'm partitioning the df by daily, I get a list of all the dailies in the df. Then I'm repartition the df. Finally, I'm selecting the repartitioned df by each daily and I'm writing it on its specific partition path.
You can change the repartition integer by your needs.
You could do something like this to make the job reentrant (idempotent):
(tried this on spark 2.2)
# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)
# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)
# Load the partition
df.write\
.partitionBy("partition_col")\
.saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
For >= Spark 2.3.0 :
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)