What is the difference between saveAasTable and save in Spark - apache-spark

I am using Pyspark and want to insert-overwrite partitions into a existing hive table.
in this use case saveAsTable() is not suitable, it overwrites the whole existing table
insertInto() is behaving strangely: I have 3 partition levels, but it is inserting one
Snd what is the right way to use save()?
Can save() take options like database-name and table name to insert into, or only HDFS path?
example :
df\
.write\
.format('orc')\
.mode('overwrite)\
.option('database', db_name)\
.option('table', table_name)\
.save()

Related

How to write to Hive table with static partition using PySpark?

I've created a Hive table with a partition like this:
CREATE TABLE IF NOT EXISTS my_table
(uid INT, num INT) PARTITIONED BY (dt DATE)
Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like this:
df.write.format('hive').mode('append').partitionBy('dt').saveAsTable('my_table')
Running this I'm getting an exception:
Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
I then added this config:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
This time no exception but the table wasn't populated either!
Then I removed the above config and added this:
hive.exec.dynamic.partition=false
Also altered the code to be like:
df.write.format('hive').mode('append').partitionBy(dt='2022-04-29').saveAsTable('my_table')
This time I am getting:
Dynamic partition is disabled. Either enable it by setting hive.exec.dynamic.partition=true or specify partition column values
The Spark job I want to run is going to have daily data, so I guess what I want is the static partition, but how does it work?
If you haven't predefined all the partitions you will need to use:
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
Remember that hive is schema on read, and it won't automagically fix your data into partitions. You need to inform the meta-store of the paritions.
You will need to do that manually with one of the two commands:
alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';
or
MSCK REPAIR TABLE [tablename]
if the table is already created, and you are using append mode anyway, you can use insertInto instead of saveAsTable, and you don't even need .partitionBy('dt')
df.write.format('hive').mode('append').insertInto('my_table')

Delete rows from cassandra table using pyspark or cql query

I have a table with lots of columns, for ex. test_event and also I have another table test in the same keyspace that contains id's of rows I have to delete from test_event.
I tried deleteFromCassandra, but it doesn't works because spark cannot see SparkContext.
I found some solutions used DELETE FROM, but it was written in scala.
After about hundred attempts I finally get confused and asked for your help. Can somebody do it with me step by step?
Take a look on this code:
from pyspark.sql import SQLContext
def main_function():
sql = SQLContext(sc)
tests = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="your keyspace", table="test").where(...)
for test in tests:
delete_sql = "delete from test_event where id = " + test.select('id')
sql.execute(delete_sql)
Be aware of deleting one row at a time is not a best practice on spark but the above code is just an example to help you figure out your implementation.
Spark Cassandra Connector (SCC) itself provides only Dataframe API for Python. But there is a pyspark-cassandra package that provides RDD API on top of the SCC, so deletion could be performed as following.
Start pyspark shell with (I've tried with Spark 2.4.3):
bin/pyspark --conf spark.cassandra.connection.host=IPs\
--packages anguenot:pyspark-cassandra:2.4.0
and inside read data from one table, and do delete. You need to have source data to have the columns corresponding to the primary key. It could be full primary key, partial primary key, or only partition key - depending on it, Cassandra will use corresponding tombstone type (row/range/partition tombstone).
In my example, table has primary key consisting of one column - that's why I specified only one element in the array:
rdd = sc.cassandraTable("test", "m1")
rdd.deleteFromCassandra("test","m1", keyColumns = ["id"])

Spark-Cassandra connector: how to change collections write behavior

In Java, I have a Spark dataset (Spark Structured Streaming) with a column of type java.util.ArrayList<Short> and I want to write the dataset in a Cassandra table which has a corresponding list<smallint>.
Each time I write the row in Cassandra it updates an existing row and I want to customize the write behavior of the list in order to control if
the written list will overwrite the existing list or
the content of the written list will be appended to the content of the list already saved in Cassandra
I found in the spark-cassandra-connector source code a class CollectionBehavior which is extended by both CollectionAppend and CollectionOverwrite. It seems exatcly what I am looking for but I didn't find a way to use it while writing to Cassandra.
The dataset is written to Cassandra using:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.option("table", table)
.option("keyspace", keyspace)
.mode(SaveMode.Append)
.save();
Is it possible to change this behavior?
To save to Cassandra collections while setting the save mode for the collection, use the RDD API. Dataset API seem to be missing this so far. So changing the dataset to RDD and using RDD methods to save to cassandra should be able to give you the behaviour you want.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md

Write files inside Hive table hdfs folder and make them available to be queried from Hive

I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file.
However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one:
in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive.
Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " ..
For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this:
dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path)
As mentioned in this question, partitionBy will delete the full existing hierarchy of partitions at path and replaced them with the partitions in dataFrame. Since new incremental data for a particular day will come in periodically, what I want is to replace only those partitions in the hierarchy that dataFrame has data for, leaving the others untouched.
To do this it appears I need to save each partition individually using its full path, something like this:
singlePartition.write.mode(SaveMode.Overwrite).parquet(path + "/eventdate=2017-01-01/hour=0/processtime=1234567890")
However I'm having trouble understanding the best way to organize the data into single-partition DataFrames so that I can write them out using their full path. One idea was something like:
dataFrame.repartition("eventdate", "hour", "processtime").foreachPartition ...
But foreachPartition operates on an Iterator[Row] which is not ideal for writing out to Parquet format.
I also considered using a select...distinct eventdate, hour, processtime to obtain the list of partitions, and then filtering the original data frame by each of those partitions and saving the results to their full partitioned path. But the distinct query plus a filter for each partition doesn't seem very efficient since it would be a lot of filter/write operations.
I'm hoping there's a cleaner way to preserve existing partitions for which dataFrame has no data?
Thanks for reading.
Spark version: 2.1
This is an old topic, but I was having the same problem and found another solution, just set your partition overwrite mode to dynamic by using:
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
So, my spark session is configured like this:
spark = SparkSession.builder.appName('AppName').getOrCreate()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
The mode option Append has a catch!
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName)
I've tested and saw that this will keep the existing partition files. However, the problem this time is the following: If you run the same code twice (with the same data), then it will create new parquet files instead of replacing the existing ones for the same data (Spark 1.6). So, instead of using Append, we can still solve this problem with Overwrite. Instead of overwriting at the table level, we should overwrite at the partition level.
df.write.mode(SaveMode.Overwrite)
.parquet("/data/hive/warehouse/mydbname.db/" + tableName + "/y=" + year + "/m=" + month + "/d=" + day)
See the following link for more information:
Overwrite specific partitions in spark dataframe write method
(I've updated my reply after suriyanto's comment. Thnx.)
I know this is very old. As I can not see any solution posted, I will go ahead and post one. This approach assumes you have a hive table over the directory you want to write to.
One way to deal with this problem is to create a temp view from dataFrame which should be added to the table and then use normal hive-like insert overwrite table ... command:
dataFrame.createOrReplaceTempView("temp_view")
spark.sql("insert overwrite table table_name partition ('eventdate', 'hour', 'processtime')select * from temp_view")
It preserves old partitions while (over)writing to only new partitions.

Resources