error: not found: value SchemaConverters - databricks

I am using databricks for my use-case where I have to convert avro schema to struct type. When I searched, it says spark-avro has SchemaConverters to do that. However, I am using spark-avro-2.11-4.0 library and when I use SchemaConverters, I get
"error: not found: value SchemaConverters".
Please help on how to solve this issue.

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

Spark wrongly casting integers as `struct<int:int,long:bigint>`

In a spark job, I am using
.withColumn("year", year(to_timestamp(lit(col("timestamp")))))
This code used to work. But now I get the error :
"cannot resolve 'CAST(`timestamp` AS TIMESTAMP)' due to data type mismatch: cannot cast struct<int:int,long:bigint> to timestamp;"
I looks like spark is reading my timestamp column as a struct<int:int,long:bigint> instead of a int
How can I prevent that ?
Context the initial data is in jsonline. I read it using AWS GLUE glueContext.create_dynamic_frame.from_catalog. In the GLUE catalog the timestamp column is typed int.
Finally I solved it this way :
GF_resolved = ResolveChoice.apply(
frame=GF_raw,
specs=[("timestamp", "cast:int")],
transformation_ctx="resolve timestamp type",
)
ResolveChoice is method avaible on AWS Glue DynamicFrame
The short answer is that you cannot prevent it if creating a dynamic frame from catalog because, as the name suggests, the schema is dynamic. See this SO for more information.
Alternative approach that is a little more compact is...
gf_resolved = gf_raw.resolveChoice(specs = [('timestamp','cast:int')])
Official documentation for the resolve choice class can be found here.
AWS Resolve Choice

Unresolved reference lit when adding a string constant as a column in PySpark

I'm trying to add a string constant as a new Column in pyspark. I'm using 2.4.4 version of spark.
I'm using this
data.withColumn("currentdate", lit(constant_name))
I'm getting error "Unresolved reference lit". It seems like there is no function lit in 2.4.4 from error but when I saw documentation, it was there.
You need to install pyspark-stubs package in order for your IDE to resolve the references to many of the Spark SQL functions including lit i.e.
pip install pyspark-stubs==2.4.0.post8

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

spark2.0 error Multiple sources found for json when read json file

when i use spark2.0 read json file like:
Dataset<Row> logDF = spark.read().json(path);
logDF.show();
but it failed :
16/08/04 15:35:05 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: Multiple sources found for json (org.apache.spark.sql.execution.datasources.json.JsonFileFormat, org.apache.spark.sql.execution.datasources.json.DefaultSource), please specify the fully qualified class name.
java.lang.RuntimeException: Multiple sources found for json (org.apache.spark.sql.execution.datasources.json.JsonFileFormat, org.apache.spark.sql.execution.datasources.json.DefaultSource), please specify the fully qualified class name.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:167)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:78)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:287)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
when I use spark 1.6 it was run correct.
the error tell specify the fully qualified class name , but i cant find which class conflict.
thank you very much!
I came across this and found below to work for me.
df = spark.read.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormat").load(path)
More details can be found here https://github.com/AbsaOSS/ABRiS/issues/147

Resources