Databricks Error: AnalysisException: Incompatible format detected. with Delta - apache-spark

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?

Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again

This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

Related

Spark : java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable

Received this error (java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable) while executing a pyspark py file which is reading the data from orc files in a partitioned folder.
This input folder has data, which should be read, transformed and needs to be written to a folder which has existing external table built on top(MSCK repair will be run post writing data to this target folder)
Code sample(process)
Step 1:
Df = spark.read.orc(“input_path”)
Step 2:
—> apply transformations(No cast function used)
Step 3:
Transformed_Df.write\
.partitionBy(“columns”)\
.mode(“overwrite”)\
.orc(“output_path”)
When I checked the logs, I see this error occurs multiple times right after reading partitions. I believe this is happening before applying transformations and writing the data to target.
Attached a picture of the log, please check.
enter image description here

Why DeltaTable.forPath throws "[path] is not a Delta table"?

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)
AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

AssertionError: assertion failed: No plan for DeleteFromTable In Databricks

Is there any reason this command works well:
%sql SELECT * FROM Azure.Reservations WHERE timestamp > '2021-04-02'
returning 2 rows, while the below:
%sql DELETE FROM Azure.Reservations WHERE timestamp > '2021-04-02'
fails with:
Error in SQL statement: AssertionError: assertion failed: No plan for
DeleteFromTable (timestamp#394 > 1617321600000000)
?
I'm new to Databricks but I'm sure I ran similar command on another table (without WHERE clause). The table is created basing on a Parquet file.
DELETE FROM (and similarly UPDATE, or MERGE) aren't supported on the Parquet files - right now on Databricks it's supported for Delta format. You can convert your parquet files into delta using CONVERT TO DELTA, and then this command will work for you.
Another alternative is to implement it is to read parquet files, filter out the rows that you want to keep, and overwrite your parquet files.
It could be that you are trying to DELETE from a VIEW (in case it is not a parquet file)
Unfortunately, there is no easy way to differentiate between a VIEW and a TABLE in databricks; the only way you can test if it's indeed a view is by:
SHOW VIEWS FROM Azure like 'reser*'
or, if it's a table:
SHOW TABLES FROM Azure like 'reser*'
Show tables syntax
Show views syntax
just delete from the delta
%sql
delete from delta.`/mnt/path`
where x

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Spark - Read and Write back to same S3 location

I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from.
However, I get below error message:
An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet
If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine.
my_df \
.write.mode('overwrite') \
.format('parquet') \
.save(s3_target_location)
Note: I have tried using .cache() after reading in the dataframe but still get the same error.
The reason this causes a problem is that you are reading and writing to the same path that you are trying to overwrite. It is standard Spark issue and nothing to do with AWS Glue.
Spark uses lazy transformation on DF and it is triggered when certain action is called. It creates DAG to keep information about all transformations which should be applied to DF.
When you read data from same location and write using override, 'write using override' is action for DF. When spark sees 'write using override', in it's execution plan it adds to delete the path first, then trying to read that path which is already vacant; hence error.
Possible workaround would be to write to some temp location first and then using it as source, override in dataset2 location

Resources