How to truncate snapshottable table using Spark - apache-spark

Is there a way how to truncate/overwrite external Hive table which has some hdfs snapshots with Spark?
I´ve already tried to do
spark.sql("TRUNCATE TABLE xyz");
and
df.limit(0).write.mode("overwrite").insertInto("xyz")
but everytime I get something like
The directory /path/to/table/xyz cannot be deleted since /path/to/table/xyz is snapshottable and already has snapshots
I noticed that in Hive it works - apparently was implemented years ago https://issues.apache.org/jira/browse/HIVE-11667
I´m really surprised Spark cant handle that. Any ideas?

Related

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;
I knew in Spark 2.xx, the way to solve this issue is to add the following option.
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?
Thanks!
It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?
Something like : batchDF.createOrReplaceTempView("mytable")

Can SparkSession.catalog.clearCache() delete data from hdfs?

I am experiencing some data deletion issue since we have migrated from CDH to HDP (spark 2.2 to 2.3). The tables are being read from an hdfs location and after a certain time running spark job that reads and processes those tables, it throws table not found exception and when we check that location all the records are vanished. In my spark(Java) code I see before that table is read, clearCache() is called. Can it delete those files? If yes, how do I fix it?
I think, you should look at the source code -
Spark has their own implementation of caching user data and they never delete the same while managing this cache via CacheManager. Have alook

Spark + write to Hive table + workaround

Am trying to understand the pros and cons of one of the approaches that I am hearing frequently in my workspace.
Spark while writing data to Hive table (InsertInto) does following
Write to the staging folder
Move data to hive table using output committer.
Now I see people are complaining that the above 2 step approach is time-consuming and hence resorting to
1) Write files directly to HDFS
2) Refresh metastore for Hive
And I see people reporting a lot of improvement with this approach.
But somehow I am not yet convinced that it's the safe and right method. is it not a tradeoff for Automiticy? (Either full table write or no write)
What if the executor that's writing a file to HDFS crashes? I don't see a way to completely revert the half-done writes.
I also think Spark would have done this if it was the right way to do, isn't it?
Are my questions valid? Do you see any good in the above approach? Please comment.
It is not 100% right cause in hive v3 you could only access hive data with hive driver in order not to break new transaction machine.
Even if you are using hive2 you should at least keep in mind that you would not be able to access the data directly as soon as you upgrade.

SparkSQL attempts to read data from non-existing path

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

Spark dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem to work

I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it
DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sourcetbl bla bla");
//above query creates orc file at /user/db/a1/20-05-22
//I want only one part-00000 file at the end of above query so I tried the following and none worked
drame.coalesce(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR
drame.repartition(1).write().format("orc").mode(SaveMode.OverWrite).saveAsTable("summary1"); OR
drame.coalesce(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR
drame.repartition(1).write().format("orc").save("/user/db/a1/20-05-22",SaveMode.OverWrite); OR
No matter I use coalesce or repartition above query creates around 200 small files around 20 MBs at the location /user/db/a1/20-05-22. I want only one part0000 file for performance reason when using Hive. I was thinking if I call coalesce(1) then it will create final one part file but it does not seem to happen. Am I wrong?
Repartition manages how many pieces of the file are split up when doing the Spark job, however the actual saving of the file is managed by the Hadoop cluster.
Or that's how I understand it. Also you can see the same question answered here: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3CCA+2Pv=hF5SGC-SWTwTMh6zK2JeoHF1OHPb=WG94vp2GW-vL5SQ#mail.gmail.com%3E
This should never matter though, why are you set on a single file? getmerge will compile it together for you if it's just for your own system.
df.coalesce(1) worked for me in spark 2.1.1, So anyone seeing this page, don't have to worry like me.
df.coalesce(1).write.format("parquet").save("a.parquet")

Resources