Data frame re partition is not happening as expected - apache-spark

I am running below code in spark to create table temp1 with number of parttion 200 . But while i am checking the actual number of partition by creating an rdd out of temp1 table its coming to be more than 200.
How is this possible , am i missing any thing .It would be really helpful if any one can tell me ,if i am missing any thing !! Thanks
val TransDataFrame = hiveContext.sql(
s""" SELECT *
FROM uacc.TRANS
WHERE PROD_SURRO_ID != 0
AND MONTH_ID >= 201401
AND MONTH_ID <= 201403
AND CRE_DT <= '2016-11-13'
""").repartition(200,$"NDC").registerTempTable("temp")
hiveContext.sql(
s"""
CREATE TABLE uacc.temp1
AS SELECT * FROM temp
""")
val df = hiveContext.sql("SELECT * FROM uacc.temp1")
df.rdd.getNumPartitions
1224

As you create the table uacc.temp1 you actually write your dataframe to hdfs, now as you load that table again, the number of partitions is controlled by the number of hdfs files (more specific: file splits), see How does partitioning work for data from files on HDFS?

Related

Is spark.read or spark.sql lazy transformations?

In Spark if the source data has changed in between two action calls why I still get previous o/p not the most recent ones. Through DAG all operations will get executed including read operation once action is called. Isn't it?
e.g.
df = spark.sql("select * from dummy.table1")
#Reading from spark table which has two records into dataframe.
df.count()
#Gives count as 2 records
Now, a record inserted into table and action is called withou re-running command1 .
df.count()
#Still gives count as 2 records.
I was expecting Spark will execute read operation again and fetch total 3 records into dataframe.
Where my understanding is wrong ?
To contrast your assertion, this below does give a difference - using Databricks Notebook (cells). The insert operation is not known that you indicate.
But the following using parquet or csv based Spark - thus not Hive table, does force a difference in results as the files making up the table change. For a DAG re-compute, the same set of files are used afaik, though.
//1st time in a cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//1st time in another cell
val df2 = spark.sql("select * from tab2")
df2.count()
//4 is returned
//2nd time in a different cell
val df = spark.read.csv("/FileStore/tables/count.txt")
df.write.mode("append").saveAsTable("tab2")
//2nd time in another cell
df2.count()
//8 is returned
Refutes your assertion. Also tried with .enableHiveSupport(), no difference.
Even if creating a Hive table directly in Databricks:
spark.sql("CREATE TABLE tab5 (id INT, name STRING, age INT) STORED AS ORC;")
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
...
df.count()
...
spark.sql(""" INSERT INTO tab5 VALUES (2, 'Amy SmithS', 77) """)
df.count()
...
Still get updated counts.
However, the for a Hive created ORC Serde table, the following "hive" approach or using an insert via spark.sql:
val dfX = Seq((88,"John", 888)).toDF("id" ,"name", "age")
dfX.write.format("hive").mode("append").saveAsTable("tab5")
or
spark.sql(""" INSERT INTO tab5 VALUES (1, 'Amy Smith', 7) """)
will sometimes show and sometimes not show an updated count when just the 2nd df.count() is issued. This is due to Hive / Spark lack of synchronization that may depend on some internal flagging of changes. In any event not consistent. Double-checked.
This is most related to inmutability as I see it. DataFrames are inmutables, hence changes in the original table are not reflected on them.
Once a dataframe is evaluated, it will be never calculated again. So once the dataframe named df is evaluated, it is the picture of table1 at the time of evaluation, it doesn't matter if table1 changes, df won't. So the second df.count does not trigger evaluation it just return the previous result, which is 2
If you want the desired results you have to load again the DF in a different variable:
val df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
val df2 = spark.sql("select * from dummy.table1")
df2.count() //Will trigger evaluation and return 3
Or using var instead of val (which is bad)
var df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 2
//Insert record
df = spark.sql("select * from dummy.table1")
df.count() //Will trigger evaluation and return 3
This said: yes, spark read and spark sql are lazy, those are not called until an action is found, but once that happens, evaluation won't be trigger ever again in that dataframe

JDBC to Spark Dataframe - How to ensure even partitioning?

I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc.
I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.
The documentation seems to indicate that these fields are optional.
What happens if I don't provide them?
How does Spark know how to partition the queries? How efficient will that be?
If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?
Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.
Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?
If so, is there a way to prevent this?
I found a way to manually specify the partition boundaries, by using the jdbc constructor with the predicates parameter.
It allows you to explicitly specify individual conditions to be inserted in the "where" clause for each partition, which allows you to specify exactly which range of rows each partition will receive. So, if you don't have a uniformly distributed column to auto-partition on, you can customize your own partition strategy.
An example of how to use it can be found in the accepted answer to this question.
What are all these options : spark.read.jdbc refers to reading a table from RDBMS.
parallelism is power of spark, in order to achieve this you have to mention all these options.
Question[s] :-)
1) The documentation seems to indicate that these fields are optional. What happens if I don't provide them ?
Answer : default Parallelism or poor parallelism
Based on scenario developer has to take care about the performance tuning strategy. and to ensure data splits across the boundaries (aka partitions) which in turn will be tasks in parallel. By seeing this way.
2) How does Spark know how to partition the queries? How efficient will that be?
jdbc-reads -referring to databricks docs
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read.
These options must all be specified if any of them is specified.
Note
These options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but do not
filter the rows in table. Therefore, Spark partitions and returns all
rows in the table.
Example 1:
You can split the table read across executors on the emp_no column using the partitionColumn, lowerBound, upperBound, and numPartitions parameters.
val df = spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties)
also numPartitions means number of parllel connections you are asking RDBMS to read the data. if you are providing numPartitions then you are limiting number of connections... with out exhausing the connections at RDBMS side..
Example 2 source : datastax presentation to load oracle data in cassandra :
val basePartitionedOracleData = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password#//hostname:port/oracle_svc",
"dbtable" -> "ExampleTable",
"lowerBound" -> "1",
"upperBound" -> "10000",
"numPartitions" -> "10",
"partitionColumn" -> “KeyColumn"
)
)
.load()
The last four arguments in that map are there for the purpose of getting a partitioned dataset. If you pass any of them,
you have to pass all of them.
When you pass these additional arguments in, here’s what it does:
It builds a SQL statement template in the format
SELECT * FROM {tableName} WHERE {partitionColumn} >= ? AND
{partitionColumn} < ?
It sends {numPartitions} statements to the DB engine. If you suppled these values: {dbTable=ExampleTable,
lowerBound=1, upperBound=10,000, numPartitions=10, partitionColumn=KeyColumn}, it would create these ten
statements:
SELECT * FROM ExampleTable WHERE KeyColumn >= 1 AND KeyColumn < 1001
SELECT * FROM ExampleTable WHERE KeyColumn >= 1001 AND KeyColumn < 2000
SELECT * FROM ExampleTable WHERE KeyColumn >= 2001 AND KeyColumn < 3000
SELECT * FROM ExampleTable WHERE KeyColumn >= 3001 AND KeyColumn < 4000
SELECT * FROM ExampleTable WHERE KeyColumn >= 4001 AND KeyColumn < 5000
SELECT * FROM ExampleTable WHERE KeyColumn >= 5001 AND KeyColumn < 6000
SELECT * FROM ExampleTable WHERE KeyColumn >= 6001 AND KeyColumn < 7000
SELECT * FROM ExampleTable WHERE KeyColumn >= 7001 AND KeyColumn < 8000
SELECT * FROM ExampleTable WHERE KeyColumn >= 8001 AND KeyColumn < 9000
SELECT * FROM ExampleTable WHERE KeyColumn >= 9001 AND KeyColumn < 10000
And then it would put the results of each of those queries in its own partition in Spark.
Question[s] :-)
If I DO specify these options, how do I ensure that the partition
sizes are roughly even if the partitionColumn is not evenly
distributed?
Will my 1st and 20th executors get most of the work, while the other
18 executors sit there mostly idle?
If so, is there a way to prevent this?
All questions has one answer
Below is the way...
1) You need to understand how many number of records/rows per partition.... based on this you can repartition or coalesce
Snippet 1: Spark 1.6 >
spark 2.x provides facility to know how many records are there in the partition.
spark_partition_id() exists in org.apache.spark.sql.functions
import org.apache.spark.sql.functions._
val df = "<your dataframe read through rdbms.... using spark.read.jdbc>"
df.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count.show
Snippet 2 : for all version of spark
df
.rdd
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","NumberOfRecordsPerPartition")
.show
and then you need to incorporate your strategy again query tuning between ranges or repartitioning etc.... , you can use mappartitions or foreachpartitions
Conclusion : I prefer using given options which works on number columns since I have seen it was dividing data in to uniform across
bounderies/partitions.
Some time it may not be possible to use these option then manually
tuning the partitions/parllelism is required...
Update :
With the below we can achive uniform distribution...
Fetch the primary key of the table.
Find the key minimum and maximum values.
Execute Spark with those values.
def main(args: Array[String]){
// parsing input parameters ...
val primaryKey = executeQuery(url, user, password, s"SHOW KEYS FROM ${config("schema")}.${config("table")} WHERE Key_name = 'PRIMARY'").getString(5)
val result = executeQuery(url, user, password, s"select min(${primaryKey}), max(${primaryKey}) from ${config("schema")}.${config("table")}")
val min = result.getString(1).toInt
val max = result.getString(2).toInt
val numPartitions = (max - min) / 5000 + 1
val spark = SparkSession.builder().appName("Spark reading jdbc").getOrCreate()
var df = spark.read.format("jdbc").
option("url", s"${url}${config("schema")}").
option("driver", "com.mysql.jdbc.Driver").
option("lowerBound", min).
option("upperBound", max).
option("numPartitions", numPartitions).
option("partitionColumn", primaryKey).
option("dbtable", config("table")).
option("user", user).
option("password", password).load()
// some data manipulations here ...
df.repartition(10).write.mode(SaveMode.Overwrite).parquet(outputPath)
}

Spark bucketing read performance

Spark version - 2.2.1.
I've created a bucketed table with 64 buckets, I'm executing an aggregation function select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa . I can see that 64 tasks in Spark UI, which utilize just 4 executors (each executor has 16 cores) out of 20. Is there a way I can scale out the number of tasks or that's how bucketed queries should run (number of running cores as the number of buckets)?
Here's the create table:
sql("""CREATE TABLE level_1 (
bundle string,
date_ date,
hour SMALLINT)
USING ORC
PARTITIONED BY (date_ , hour )
CLUSTERED BY (ifa)
SORTED BY (ifa)
INTO 64 BUCKETS
LOCATION 'XXX'""")
Here's the query:
sql(s"select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa").show
With bucketing, the number of tasks == number of buckets, so you should be aware of the number of cores/tasks that you need/want to use and then set it as the buckets number.
num of task = num of buckets is probably the most important and under-discussed aspect of bucketing in Spark. Buckets (by default) are historically solely useful for creating "pre-shuffled" dataframes which can optimize large joins. When you read a bucketed table all of the file or files for each bucket are read by a single spark executor (30 buckets = 30 spark tasks when reading the data) which would allow the table to be joined to another table bucketed on the same # of columns. I find this behavior annoying and like the user above mentioned problematic for tables that may grow.
You might be asking yourself now, why and when in the would I ever want to bucket and when will my real-world data grow exactly in the same way over time? (you probably partitioned your big data by date, be honest) In my experience you probably don't have a great use case to bucket tables in the default spark way. BUT ALL IS NOT LOST FOR BUCKETING!
Enter "bucket-pruning". Bucket pruning only works when you bucket ONE column but is potentially your greatest friend in Spark since the advent of SparkSQL and Dataframes. It allows Spark to determine which files in your table contain specific values based on some filter in your query, which can MASSIVELY reduce the number of files spark physically reads, resulting in hugely efficient and fast queries. (I've taken 2+hr queries down to 2 minutes and 1/100th of the Spark workers). But you probably don't care because of the # of buckets to tasks issue means your table will never "scale-up" if you have too many files per bucket, per partition.
Enter Spark 3.2.0. There is a new feature coming that will allow bucket pruning to stay active when you disable bucket-based reading, allowing you to distribute the spark reads with bucket-pruning/scan. I also have a trick for doing this with spark < 3.2 as follows.
(note the leaf-scan for files with vanilla spark.read on s3 is added overhead but if your table is big it doesn't matter, bc your bucket optimized table will be a distributed read across all your available spark workers and will now be scalable)
val table = "ex_db.ex_tbl"
val target_partition = "2021-01-01"
val bucket_target = "valuex"
val bucket_col = "bucket_col"
val partition_col = "date"
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.execution.FileSourceScanExec
import org.apache.spark.sql.execution.datasources.{FileScanRDD,FilePartition}
val df = spark.table(tablename).where((col(partition_col)===lit(target_partition)) && (col(bucket_col)===lit(bucket_target)))
val sparkplan = df.queryExecution.executedPlan
val scan = sparkplan.collectFirst { case exec: FileSourceScanExec => exec }.get
val rdd = scan.inputRDDs.head.asInstanceOf[FileScanRDD]
val bucket_files = for
{ FilePartition(bucketId, files) <- rdd.filePartitions f <- files }
yield s"$f".replaceAll("path: ", "").split(",")(0)
val format = bucket_files(0).split("
.").last
val result_df = spark.read.option("mergeSchema", "False").format(format).load(bucket_files:_*).where(col(bucket_col) === lit(bucket_target))

Reading hive orc table using spark

I have a partitioned table. Partitons from 2017-06-20 and up.
My query.
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val test_enc_orc = hiveContext.sql("select * from db.tbl where time_key = '2017-06-21' limit 1")
Every time I run it, spark looks for this partition 2017-06-20
INFO OrcFileOperator: ORC file hdfs://nameservice1/apps/hive/warehouse/db.db/tbl/time_key=2017-06-20/000016_0 has empty schema, it probably contains no rows. Trying to read another ORC file to figure out the schema.
and searches for all files for date 2017-06-20. It holds empty ORC files. But partition 2017-06-21 has files with data. Why doesn't spark search that date or any other?
EDIT
Created test table
drop table arstel.evkuzmin_test_it;
create table arstel.evkuzmin_test_it(name string)
partitioned by(ban int)
stored as orc;
insert into arstel.evkuzmin_test_it partition(ban) values
("bob", 1)
, ("marty", 1)
, ("monty", 2)
, ("naruto", 2)
, ("death", 4);
Seems like the problem is exactly because of empty files. In this case there are none, so everything works. Is there a way to make spark ignore them?

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
where df is dataframe having the incremental data to be overwritten.
hdfs-base-path contains the master data.
When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.
What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?
Finally! This is now a feature in Spark 2.3.0:
SPARK-20236
To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
This is a common problem. The only solution with Spark up to 2.0 is to write directly into the partition directory, e.g.,
df.write.mode(SaveMode.Overwrite).save("/root/path/to/data/partition_col=value")
If you are using Spark prior to 2.0, you'll need to stop Spark from emitting metadata files (because they will break automatic partition discovery) using:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
If you are using Spark prior to 1.6.2, you will also need to delete the _SUCCESS file in /root/path/to/data/partition_col=value or its presence will break automatic partition discovery. (I strongly recommend using 1.6.2 or later.)
You can get a few more details about how to manage large partitioned tables from my Spark Summit talk on Bulletproof Jobs.
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.toDF().write.mode("overwrite").format("parquet").partitionBy("date", "name").save("s3://path/to/somewhere")
This works for me on AWS Glue ETL jobs (Glue 1.0 - Spark 2.4 - Python 2)
Adding 'overwrite=True' parameter in the insertInto statement solves this:
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.mode("overwrite").insertInto("database_name.partioned_table", overwrite=True)
By default overwrite=False. Changing it to True allows us to overwrite specific partitions contained in df and in the partioned_table. This helps us avoid overwriting the entire contents of the partioned_table with df.
Using Spark 1.6...
The HiveContext can simplify this process greatly. The key is that you must create the table in Hive first using a CREATE EXTERNAL TABLE statement with partitioning defined. For example:
# Hive SQL
CREATE EXTERNAL TABLE test
(name STRING)
PARTITIONED BY
(age INT)
STORED AS PARQUET
LOCATION 'hdfs:///tmp/tables/test'
From here, let's say you have a Dataframe with new records in it for a specific partition (or multiple partitions). You can use a HiveContext SQL statement to perform an INSERT OVERWRITE using this Dataframe, which will overwrite the table for only the partitions contained in the Dataframe:
# PySpark
hiveContext = HiveContext(sc)
update_dataframe.registerTempTable('update_dataframe')
hiveContext.sql("""INSERT OVERWRITE TABLE test PARTITION (age)
SELECT name, age
FROM update_dataframe""")
Note: update_dataframe in this example has a schema that matches that of the target test table.
One easy mistake to make with this approach is to skip the CREATE EXTERNAL TABLE step in Hive and just make the table using the Dataframe API's write methods. For Parquet-based tables in particular, the table will not be defined appropriately to support Hive's INSERT OVERWRITE... PARTITION function.
Hope this helps.
Tested this on Spark 2.3.1 with Scala.
Most of the answers above are writing to a Hive table. However, I wanted to write directly to disk, which has an external hive table on top of this folder.
First the required configuration
val sparkSession: SparkSession = SparkSession
.builder
.enableHiveSupport()
.config("spark.sql.sources.partitionOverwriteMode", "dynamic") // Required for overwriting ONLY the required partitioned folders, and not the entire root folder
.appName("spark_write_to_dynamic_partition_folders")
Usage here:
DataFrame
.write
.format("<required file format>")
.partitionBy("<partitioned column name>")
.mode(SaveMode.Overwrite) // This is required.
.save(s"<path_to_root_folder>")
I tried below approach to overwrite particular partition in HIVE table.
### load Data and check records
raw_df = spark.table("test.original")
raw_df.count()
lets say this table is partitioned based on column : **c_birth_year** and we would like to update the partition for year less than 1925
### Check data in few partitions.
sample = raw_df.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag")
print "Number of records: ", sample.count()
sample.show()
### Back-up the partitions before deletion
raw_df.filter(col("c_birth_year") <= 1925).write.saveAsTable("test.original_bkp", mode = "overwrite")
### UDF : To delete particular partition.
def delete_part(table, part):
qry = "ALTER TABLE " + table + " DROP IF EXISTS PARTITION (c_birth_year = " + str(part) + ")"
spark.sql(qry)
### Delete partitions
part_df = raw_df.filter(col("c_birth_year") <= 1925).select("c_birth_year").distinct()
part_list = part_df.rdd.map(lambda x : x[0]).collect()
table = "test.original"
for p in part_list:
delete_part(table, p)
### Do the required Changes to the columns in partitions
df = spark.table("test.original_bkp")
newdf = df.withColumn("c_preferred_cust_flag", lit("Y"))
newdf.select("c_customer_sk", "c_preferred_cust_flag").show()
### Write the Partitions back to Original table
newdf.write.insertInto("test.original")
### Verify data in Original table
orginial.filter(col("c_birth_year") <= 1925).select("c_customer_sk", "c_preferred_cust_flag").show()
Hope it helps.
Regards,
Neeraj
As jatin Wrote you can delete paritions from hive and from path and then append data
Since I was wasting too much time with it I added the following example for other spark users.
I used Scala with spark 2.2.1
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Column, DataFrame, SaveMode, SparkSession}
case class DataExample(partition1: Int, partition2: String, someTest: String, id: Int)
object StackOverflowExample extends App {
//Prepare spark & Data
val sparkConf = new SparkConf()
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val tableName = "my_table"
val partitions1 = List(1, 2)
val partitions2 = List("e1", "e2")
val partitionColumns = List("partition1", "partition2")
val myTablePath = "/tmp/some_example"
val someText = List("text1", "text2")
val ids = (0 until 5).toList
val listData = partitions1.flatMap(p1 => {
partitions2.flatMap(p2 => {
someText.flatMap(
text => {
ids.map(
id => DataExample(p1, p2, text, id)
)
}
)
}
)
})
val asDataFrame = spark.createDataFrame(listData)
//Delete path function
def deletePath(path: String, recursive: Boolean): Unit = {
val p = new Path(path)
val fs = p.getFileSystem(new Configuration())
fs.delete(p, recursive)
}
def tableOverwrite(df: DataFrame, partitions: List[String], path: String): Unit = {
if (spark.catalog.tableExists(tableName)) {
//clean partitions
val asColumns = partitions.map(c => new Column(c))
val relevantPartitions = df.select(asColumns: _*).distinct().collect()
val partitionToRemove = relevantPartitions.map(row => {
val fields = row.schema.fields
s"ALTER TABLE ${tableName} DROP IF EXISTS PARTITION " +
s"${fields.map(field => s"${field.name}='${row.getAs(field.name)}'").mkString("(", ",", ")")} PURGE"
})
val cleanFolders = relevantPartitions.map(partition => {
val fields = partition.schema.fields
path + fields.map(f => s"${f.name}=${partition.getAs(f.name)}").mkString("/")
})
println(s"Going to clean ${partitionToRemove.size} partitions")
partitionToRemove.foreach(partition => spark.sqlContext.sql(partition))
cleanFolders.foreach(partition => deletePath(partition, true))
}
asDataFrame.write
.options(Map("path" -> myTablePath))
.mode(SaveMode.Append)
.partitionBy(partitionColumns: _*)
.saveAsTable(tableName)
}
//Now test
tableOverwrite(asDataFrame, partitionColumns, tableName)
spark.sqlContext.sql(s"select * from $tableName").show(1000)
tableOverwrite(asDataFrame, partitionColumns, tableName)
import spark.implicits._
val asLocalSet = spark.sqlContext.sql(s"select * from $tableName").as[DataExample].collect().toSet
if (asLocalSet == listData.toSet) {
println("Overwrite is working !!!")
}
}
If you use DataFrame, possibly you want to use Hive table over data.
In this case you need just call method
df.write.mode(SaveMode.Overwrite).partitionBy("partition_col").insertInto(table_name)
It'll overwrite partitions that DataFrame contains.
There's not necessity to specify format (orc), because Spark will use Hive table format.
It works fine in Spark version 1.6
Instead of writing to the target table directly, i would suggest you create a temporary table like the target table and insert your data there.
CREATE TABLE tmpTbl LIKE trgtTbl LOCATION '<tmpLocation';
Once the table is created, you would write your data to the tmpLocation
df.write.mode("overwrite").partitionBy("p_col").orc(tmpLocation)
Then you would recover the table partition paths by executing:
MSCK REPAIR TABLE tmpTbl;
Get the partition paths by querying the Hive metadata like:
SHOW PARTITONS tmpTbl;
Delete these partitions from the trgtTbl and move the directories from tmpTbl to trgtTbl
I would suggest you doing clean-up and then writing new partitions with Append mode:
import scala.sys.process._
def deletePath(path: String): Unit = {
s"hdfs dfs -rm -r -skipTrash $path".!
}
df.select(partitionColumn).distinct.collect().foreach(p => {
val partition = p.getAs[String](partitionColumn)
deletePath(s"$path/$partitionColumn=$partition")
})
df.write.partitionBy(partitionColumn).mode(SaveMode.Append).orc(path)
This will delete only new partitions. After writing data run this command if you need to update metastore:
sparkSession.sql(s"MSCK REPAIR TABLE $db.$table")
Note: deletePath assumes that hfds command is available on your system.
My solution implies overwriting each specific partition starting from a spark dataframe. It skips the dropping partition part. I'm using pyspark>=3 and I'm writing on AWS s3:
def write_df_on_s3(df, s3_path, field, mode):
# get the list of unique field values
list_partitions = [x.asDict()[field] for x in df.select(field).distinct().collect()]
df_repartitioned = df.repartition(1,field)
for p in list_partitions:
# create dataframes by partition and send it to s3
df_to_send = df_repartitioned.where("{}='{}'".format(field,p))
df_to_send.write.mode(mode).parquet(s3_path+"/"+field+"={}/".format(p))
The arguments of this simple function are the df, the s3_path, the partition field, and the mode (overwrite or append). The first part gets the unique field values: it means that if I'm partitioning the df by daily, I get a list of all the dailies in the df. Then I'm repartition the df. Finally, I'm selecting the repartitioned df by each daily and I'm writing it on its specific partition path.
You can change the repartition integer by your needs.
You could do something like this to make the job reentrant (idempotent):
(tried this on spark 2.2)
# drop the partition
drop_query = "ALTER TABLE table_name DROP IF EXISTS PARTITION (partition_col='{val}')".format(val=target_partition)
print drop_query
spark.sql(drop_query)
# delete directory
dbutils.fs.rm(<partition_directoy>,recurse=True)
# Load the partition
df.write\
.partitionBy("partition_col")\
.saveAsTable(table_name, format = "parquet", mode = "append", path = <path to parquet>)
For >= Spark 2.3.0 :
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.insertInto("partitioned_table", overwrite=True)

Resources