Call from_avro function in spark sql - apache-spark

I have a dataframe that contains avro bytes and the schema string. I would like to convert the avro bytes using the schema.
Here is my sample code to convert the avro using spark sql:
val rsDf = ss.sql("SELECT from_avro(avro_payload, avro_schema_string) FROM testDF")
but I got this error:
Caused by: org.apache.spark.sql.AnalysisException: Undefined function: 'from_avro'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$16.$anonfun$applyOrElse$121(Analyzer.scala:2084)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
Any suggestion on how to call from_avro using spark-sql ?

Related

Error when passing Spark SQL higher order functions lambda variable to custom SQL user defined function (UDF)

I have created a SQL UDF which takes one string parameter and returns string.
Reference doc : https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html#parameters
I am trying to use this function inside Spark SQL's array_transform method.
SQL UDF function DDL
CREATE or REPLACE FUNCTION printStr(str String )
RETURNS STRING
COMMENT 'print given string'
LANGUAGE SQL
RETURN str
Temp view to test SQL UDF
CREATE or REPLACE TEMP VIEW values AS
select * from values (array("1","2","3")),(array("2","3","5")),(array("3","44"))
Calling this SQL UDF inside Spark SQL's array_transform method.
select transform(col1, val -> printStr(val)) from VALUES
Exception: (Spark version - 3.2.1, Scala - 2.12, Databricks runtime - 10.4 LTS)
Error in SQL statement: AnalysisException: Resolved attribute(s) val#1298163 missing from in operator !Project [cast(lambda val#1298163 as string) AS str#1298165].; line 1 pos 30
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Resolved attribute(s) val#1298163 missing from in operator !Project [cast(lambda val#1298163 as string) AS str#1298165].; line 1 pos 30
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:60)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:59)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:225)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$2(CheckAnalysis.scala:533)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$2$adapted(CheckAnalysis.scala:105)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:358)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:357)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:357)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:357)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:105)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:100)
Am guessing namedLambdaVariable is not getting resolved inside SQL UDf's

how to query data from Pyspark sql context if key is not present in json fie , How to catch give sql analysis execption

I am using Pyspark to transform JSON in a Dataframe. And I am successfully able to transform it. But the problem I am facing is there is a key which will be present in some JSON file and will not be present in another. When I flatten the JSON with Pyspark SQL context and the key is not present in some JSON file, it gives error in creating my Pyspark data frame, throwing SQL Analysis Exception.
for example my sample JSON
{
"_id" : ObjectId("5eba227a0bce34b401e7899a"),
"origin" : "inbound",
"converse" : "72412952",
"Start" : "2020-04-20T06:12:20.89Z",
"End" : "2020-04-20T06:12:53.919Z",
"ConversationMos" : 4.88228940963745,
"ConversationRFactor" : 92.4383773803711,
"participantId" : "bbe4de4c-7b3e-49f1-8",
}
The above JSON participant id will be available in some JSON and not in another JSON files
My pysaprk code snippet:
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
When, in some JSON file participantId is not present, an exception is coming. How to handle that kind of exception that if the key is not present so column will contain null or any other ways to handle it
You can simply check if the column is not there then add it will empty values.
The code for the same goes like:
from pyspark.sql import functions as f
fetchFile = sark.read.format(file_type)\
.option("inferSchema", "true")\
.option("header","true")\
.load(generated_FileLocation)
if not 'participantId' in df.columns:
df = df.withColumn('participantId', f.lit(''))
fetch file.registerTempTable("CreateDataFrame")
tempData = sqlContext.sql("select origin,converse,start,end,participantId from CreateDataFrame")
I think you're calling Spark to read one file at a time and inferring the schema at the same time.
What Spark is telling you with the SQL Analysis exception is that your file and your inferred schema doesn't have the key you're looking for. What you have to do is get to your good schema and apply it to all of the files you want to process. Ideally, processing all of your files at once.
There are three strategies:
Infer your schema from lots of files. You should get the aggregate of all of the keys. Spark will run two passes over the data.
df = spark.read.json('/path/to/your/directory/full/of/json/files')
schema = df.schema
print(schema)
Create a schema object
I find this tedious to do, but will speed up your code. Here is a reference: https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.types.StructType
Read the schema from a well formed file then use that to read your whole directory. Also, by printing the schema object, you can copy paste that back into your code for option #2.
schema = spark.read.json('path/to/well/formed/file.json')
print(schema)
my_df = spark.read.schema(schema).json('path/to/entire/folder/full/of/json')

How to collect a streaming dataset (to a Scala value)?

How can I store a dataframe value to a scala variable ?
I need to store values from the below dataframe (assuming column "timestamp" producing same values) to a variable and later I need to use this variable somewhere
i have tried following
val spark =SparkSession.builder().appName("micro").
enableHiveSupport().config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("spark.sql.streaming.checkpointLocation", "hdfs://dff/apps/hive/warehouse/area.db").
getOrCreate()
val xmlSchema = new StructType().add("id", "string").add("time_xml", "string")
val xmlData = spark.readStream.option("sep", ",").schema(xmlSchema).csv("file:///home/shp/sourcexml")
val xmlDf_temp = xmlData.select($"id",unix_timestamp($"time_xml", "dd/mm/yyyy HH:mm:ss").cast(TimestampType).as("timestamp"))
val collect_time = xmlDf_temp.select($"timestamp").as[String].collect()(0)
its thorwing error saying following:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
Is there any way i can store some dataframe values to a variable and use later?
is there any way i can store some dataframe values to a variable and use later ?
That's not possible in Spark Structured Streaming since a streaming query never ends and so it is not possible to express collect.
and later I need to use this variable somewhere
This "later" has to be another streaming query that you could join together and produce a result.

Spark SQL - Cast to UUID of the Dataset Column throws Parse Exception

Dataset<Row> finalResult = df.selectExpr("cast(col1 as uuid())", "col2");
When we tried to cast the Column in the dataset to UUID and persist in Postgres, i see the following exception. Please suggest the alternate solution to convert the column in a data set to UUID.
java.lang.RuntimeException: org.apache.spark.sql.catalyst.parser.ParseException:
DataType uuid() is not supported.(line 1, pos 21)
== SQL ==
cast(col1 as UUID)
---------------------^^^
Spark has no uuid type, so casting to one is just not going to work.
You can try to use database.column.type metadata property as explained in Custom Data Types for DataFrame columns when using Spark JDBC and SPARK-10849.

to_json function in spark sql

I am trying to use the new to_json function in Spark 2.1 to convert a Map type column to JSON. Code snippet below:
import org.apache.spark.sql.functions.to_json
val df = spark.sql("SELECT *, to_json(Properties) AS Prop, to_json(Measures) AS Meas FROM tempTable")
However, I get the below error. Any ideas what could be wrong?
org.apache.spark.sql.AnalysisException: Undefined function: 'to_json'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 10

Resources