Why DeltaTable.forPath throws "[path] is not a Delta table"? - delta-lake

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)

AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

Related

Manually Deleted data file from delta lake

I have manually deleted a data file from delta lake and now the below command is giving error
mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)
Error
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions
i have tried restarting the cluster with no luck
also tried the below
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")
Any help on repairing the transaction log or fix the error
as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.
In your case you can also use the FSCK REPAIR TABLE command.
as per the docs :
"Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."
The above error indicates that you have manually deleted a data file without using the proper DELETE Statement.
As per MS Doc, you can try vacuum command. Using the vacuum command fix the error.
%sql
vacuum 'Your_path'
For more information refer this link
FSCK Command worked for me. Thanks All

Spark: refresh Delta Table in S3

how can I run the refresh table command on a Delta Table in S3?
When I do
deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)
I am getting the error:
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Does the refresh command only work for Hive tables?
Thanks!
Ok. It's really an incorrect function - the spark.catalog.refreshTable function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.
To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):
If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true
If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true), leave data only for that partitions, and write data using overwrite mode with replaceWhere option (doc) that contains condition
Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Resources