Can SparkSession.catalog.clearCache() delete data from hdfs? - apache-spark

I am experiencing some data deletion issue since we have migrated from CDH to HDP (spark 2.2 to 2.3). The tables are being read from an hdfs location and after a certain time running spark job that reads and processes those tables, it throws table not found exception and when we check that location all the records are vanished. In my spark(Java) code I see before that table is read, clearCache() is called. Can it delete those files? If yes, how do I fix it?

I think, you should look at the source code -
Spark has their own implementation of caching user data and they never delete the same while managing this cache via CacheManager. Have alook

Related

Getting duplicate records while querying Hudi table using Hive on Spark Engine in EMR 6.3.1

I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1
Hudi version is 0.7
I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under the same partition with updated data/records.
Now, when I query the same table using Spark SQL, it works fine and does not return any duplicates. Basically, it only honours the latest records/parquet files for processing. It also works fine when I use Tez as the underlying engine for Hive.
But, when I run the same query on Hive prompt with Spark as underlying execution engine, it returns all the records and does not filter the previous parquet files.
Have tried setting the property spark.sql.hive.convertMetastoreParquet=false, still it did not work.
Please help.
This is a known issue in Hudi.
Still, using the below property, I am able to remove the duplicates in RO (read optimised) Hudi tables. The issue still persists in RT table (real time).
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

How to solve the following issue in Spark 3.0? Can not create the managed table. The associated location already exists.;

In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;
I knew in Spark 2.xx, the way to solve this issue is to add the following option.
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?
Thanks!
It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?
Something like : batchDF.createOrReplaceTempView("mytable")

Spark + write to Hive table + workaround

Am trying to understand the pros and cons of one of the approaches that I am hearing frequently in my workspace.
Spark while writing data to Hive table (InsertInto) does following
Write to the staging folder
Move data to hive table using output committer.
Now I see people are complaining that the above 2 step approach is time-consuming and hence resorting to
1) Write files directly to HDFS
2) Refresh metastore for Hive
And I see people reporting a lot of improvement with this approach.
But somehow I am not yet convinced that it's the safe and right method. is it not a tradeoff for Automiticy? (Either full table write or no write)
What if the executor that's writing a file to HDFS crashes? I don't see a way to completely revert the half-done writes.
I also think Spark would have done this if it was the right way to do, isn't it?
Are my questions valid? Do you see any good in the above approach? Please comment.
It is not 100% right cause in hive v3 you could only access hive data with hive driver in order not to break new transaction machine.
Even if you are using hive2 you should at least keep in mind that you would not be able to access the data directly as soon as you upgrade.

SparkSQL attempts to read data from non-existing path

I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.

Pyspark write to External Hive table in S3 is not parallel

I have an external hive table defined with a location in s3
LOCATION 's3n://bucket/path/'
When writing to this table at the end of a pyspark job that aggregates a bunch of data the write to Hive is extremely slow because only 1 executor/container is being used for the write. When writing to an HDFS backed table the write happens in parallel and is significantly faster.
I've tried defining the table using the s3a path but my job fails due to some vague errors.
This is on Amazon EMR 5.0 (hadoop 2.7), pyspark 2.0 but I have experienced the same issue with previous versions of EMR/spark.
Is there a configuration or alternative library that I can use to make this write more efficient?
I guess you're using parquet. The DirectParquetOutputCommitter has been removed to avoid potential data loss issue. The change was actually in 04/2016.
It means the data your write to S3 will firstly be saved in a _temporary folder, then "moved" to its final location. Unfortunately "moving" == "copying & deleting" in S3 and it is rather slow. To make it worse, this final "moving" is done only by the driver.
You will have to write to local HDFS then copy the data over (I do recommend this), if you don't want to fight to add that class back. In HDFS "moving" ~ "renaming" so it takes no time.

Resources