Delta Lake change data feed - delete, vacuum, read - java.io.FileNotFoundException - delta-lake

I used the following to write to google cloud storage
df.write.format("delta").partitionBy("g","p").option("delta.enableChangeDataFeed", "true").mode("append").save(path)
And then I inserted data in versions 1,2,3,4.
I deleted some of the data in version 5.
Ran
deltaTable.vacuum(8)
I tried to read starting Version 3
spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 3)
.load(path)
Caused by: java.io.FileNotFoundException:
File not found: gs://xxx/yyy.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
I deleted the cluster and tried to read again. Same issue. Why is it looking for the vacuumed files?
I expected to see all the data inserted starting version 3

When running deltaTable.vacuum(8), please note that you're removing the files that are more than 8 hours old. Even if you had deleted version 5 of the table, if the files for version 3 are older than 8 hours, then the only files that would be available are the most current version (in this case version 4).

Adding the setting worked!
spark.sql.files.ignoreMissingFiles ->true

Related

Manually Deleted data file from delta lake

I have manually deleted a data file from delta lake and now the below command is giving error
mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)
Error
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
i have tried restarting the cluster with no luck
also tried the below
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")
Any help on repairing the transaction log or fix the error
as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.
In your case you can also use the FSCK REPAIR TABLE command.
as per the docs :
"Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."
The above error indicates that you have manually deleted a data file without using the proper DELETE Statement.
As per MS Doc, you can try vacuum command. Using the vacuum command fix the error.
%sql
vacuum 'Your_path'
For more information refer this link
FSCK Command worked for me. Thanks All

Spark: refresh Delta Table in S3

how can I run the refresh table command on a Delta Table in S3?
When I do
deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)
I am getting the error:
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Does the refresh command only work for Hive tables?
Thanks!
Ok. It's really an incorrect function - the spark.catalog.refreshTable function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.
To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):
If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true
If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true), leave data only for that partitions, and write data using overwrite mode with replaceWhere option (doc) that contains condition
Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.

Why DeltaTable.forPath throws "[path] is not a Delta table"?

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)
AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

PySpark: how to clear readStream cache?

I am reading a directory with Spark's readStream. Earlier I gave the local path, but got FileNotFoundException. I have changed the path to hdfs path, but still the execution log shows its referring to the old settings (local path).
22/06/01 10:30:32 WARN scheduler.TaskSetManager: Lost task 0.2 in stage 1.0 (TID 3, my.nodes.com, executor 3): java.io.FileNotFoundException: File file:/home/myuser/testing_aiman/data/fix_rates.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:129)
Infact I have hardcoded the path variable, but still its referring to the earlier set local path.
df = spark.readStream.csv("hdfs:///user/myname/streaming_test_dir",sep=sep,schema=df_schema,inferSchema=True,header=True)
i also ran spark.sql("CLEAR CACHE").collect, but it didn't help either.
Before running the spark.readStream(), I ran the following code:
spark.sql("REFRESH \"file:///home/myuser/testing_aiman/data/fix_rates.csv\"").collect
spark.sql("CLEAR CACHE").collect
REFRESH <file:///path/that/showed/FileNotFoundException> actually did the trick.

Exclude some directories while Copying Data on Hadoop Cluster using hadoop distcp

Is there any way to skip some directories while copying hadoop data from one cluster to another cluster? I.e. I am copying some data from my existing cluster to new cluster but i don't want to copy current month data.
/user/username/year=2021/month=06/day=01
/user/username/year=2021/month=06/day=02
.
.
.
/user/username/year=2021/month=07/day=01
I don't want to include month 07's data. How can i skip current months directories?
I am trying to use filters in command but it is not working for me.
1st Approach:
hadoop distcp -filters /user/username/year=2021/month=07/day=0.*
-skipcrccheck
-update webhdfs://<src_host>:port/user/username/year=2021/month=07
webhdfs://<target_host>:port/user/username/year=2021/month=07
so that it can filter days less than 10 ie(01,02,03,...09) but its not taking this * and showing warning.
In logs its showing can not find filters
webhdfs://<src_host>:port//user/username/year=2021/month=07/day=0.*
Even I tried by giving complete path till filename.
I.e. webhdfs://<src_host>:port/user/username/year=2021/month=07/day=01/myfile.txt but it's showing the same issue. Can't find the filter file.
When I Checked the log for filter file instead of taking // its showing / in path
webhdfs:/<src_host>:port/user/username/year=2021/month=07/day=01/myfile.txt
2nd Approach:
created one file "myfilter" (/user/username/myfilter.txt)
.*webhdfs://<src_host>:src_port/user/username/year=2021/month=07/hour=0.*
hadoop distcp -filters
webhdfs://<src_host>:src_port/user/username/myfilter.txt
-skipcrccheck -update webhdfs://<src_host>:
<src_port>/user/username/year=2021/month=07/
webhdfs://<target_host>:
<target_port>/user/username/year=2021/month=07/
Error:
ERROR tools.RegexCopyFilter: Can't find filters file
webhdfs:/<src_host>:src_port/user/myfilter.txt
In Logs I saw this Issue after webhdfs its taking /(single /) but ideally it should be double(//) not sure if this is the issue kindly guide.​

Resources