Pyspark: Delta table as stream source, How to do it? - apache-spark

I am facing issue in readStream on delta table.
What is expected, reference from following link
https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-stream-source
Ex:
spark.readStream.format("delta").table("events") -- As expected, should work fine
Issue, I have tried the same in the following way:
df.write.format("delta").saveAsTable("deltatable") -- Saved the Dataframe as a delta table
spark.readStream.format("delta").table("deltatable") -- Called readStream
error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'DataStreamReader' object has no attribute 'table'
Note:
I am running it in localhost, using pycharm IDE,
Installed latest version of pyspark, spark version = 2.4.5, Scala version 2.11.12

The DataStreamReader.table and DataStreamWriter.table methods are not in Apache Spark yet. Currently you need to use Databricks Notebook in order to call them.

Try now with Delta Lake 0.7.0 release which provides support for registering your tables with the Hive metastore. As mentioned in a comment, most of the Delta Lake examples used a folder path, because metastore support wasn't integrated before this.
Also note, it's best for the Open Source version of Delta Lake to follow the docs at https://docs.delta.io/latest/index.html

Related

Glue not able to recognize Delta Lake Python Library

I am trying to use Delta Lake Python Library in my Glue job. However, my Glue job is not able to recognize it and I get the error "NameError: name 'DeltaTable' is not defined". Per Glue-DeltaLake documentation , I added the paramter --datalake-formats = delta and also updated the required spark configuration
.config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog")
My code fails at below line
deltaTable = DeltaTable.forPath(self.spark,self.dest_path_sdad)
Any ideas?
These configuration properties configure Glue with the Delta Lake file format, so you can write spark.read.format("delta").load(...) or df.write.format("delta").save(...). But they doesn't provide the Python API that is available as the delta-spark package. It could be made available to Glue by using the --additional-python-modules option (doc).
I was missing the import statement
from delta.tables import *

Why DeltaTable.forPath throws "[path] is not a Delta table"?

I'm trying to read a delta lake table which I loaded previously using Spark and I'm using IntelliJ IDE.
val dt = DeltaTable.forPath(spark, "/some/path/")
Now when I'm trying to read the table again I'm getting below error, it was working fine but suddenly it throws error like these, what might be the reason for this?
Note:
Checked the files in the DeltaLake path - it looks good.
Colleague was able to read the same DeltaLake file.
Exception in thread "main" org.apache.spark.sql.AnalysisException: `/some/path/` is not a Delta table.
at org.apache.spark.sql.delta.DeltaErrors$.notADeltaTableException(DeltaErrors.scala:260)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:593)
at com.datalake.az.core.DeltaLake$.delayedEndpoint$com$walmart$sustainability$datalake$az$core$DeltaLake$1(DeltaLake.scala:66)
at com.datalake.az.core.DeltaLake$delayedInit$body.apply(DeltaLake.scala:18)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.datalake.az.core.DeltaLake$.main(DeltaLake.scala:18)
at com.datalake.az.core.DeltaLake.main(DeltaLake.scala)
AnalysisException: /some/path/ is not a Delta table.
AnalysisException is thrown when the given path has no transaction log under _delta_log directory.
There could be other issues but that's the first check.
BTW By the stacktrace I figured you may not be using the latest and greatest Delta Lake 2.0.0. Please upgrade as soon as possible as it brings tons of improvements you don't want to miss.

Unable to Save Apache Spark parquet file to csv with Databricks

I'm trying save/convert a parquet file to csv on Apache Spark with Databricks but not having much luck.
The following code successfully writes to a folder called tempDelta:
df.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(saveloc+"/tempDelta")
I then would like to convert the parquet file to csv as follows:
df.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(saveloc+"/tempDelta").csv(saveloc+"/tempDelta")
AttributeError Traceback (most recent call last)
<command-2887017733757862> in <module>
----> 1 df.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(saveloc+"/tempDelta").csv(saveloc+"/tempDelta")
AttributeError: 'NoneType' object has no attribute 'csv'
I have also tried the following after writing to the location:
df.write.option("header","true").csv(saveloc+"/tempDelta2")
But it get the error:
A transaction log for Databricks Delta was found at `/CURATED/F1Area/F1Domain/final/_delta_log`,
but you are trying to write to `/CURATED/F1Area/F1Domain/final/tempDelta2` using format("csv"). You must use
'format("delta")' when reading and writing to a delta table.
And when I try to save as a csv to folder that isn't a delta folder I get the following error:
df.write.option("header","true").csv("testfolder")
AnalysisException: CSV data source does not support struct data type.
Can someone let me know the best way of saving / converting from parquet to csv with Databricks
You can use either of the below 2 options
1. df.write.option("header",true).csv(path)
2. df.write.format("csv").save(path)
Note: You cant mention format as parquet and use .csv function at once.

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Resources