to_json function in spark sql - apache-spark

I am trying to use the new to_json function in Spark 2.1 to convert a Map type column to JSON. Code snippet below:
import org.apache.spark.sql.functions.to_json
val df = spark.sql("SELECT *, to_json(Properties) AS Prop, to_json(Measures) AS Meas FROM tempTable")
However, I get the below error. Any ideas what could be wrong?
org.apache.spark.sql.AnalysisException: Undefined function: 'to_json'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 10

Related

Call from_avro function in spark sql

I have a dataframe that contains avro bytes and the schema string. I would like to convert the avro bytes using the schema.
Here is my sample code to convert the avro using spark sql:
val rsDf = ss.sql("SELECT from_avro(avro_payload, avro_schema_string) FROM testDF")
but I got this error:
Caused by: org.apache.spark.sql.AnalysisException: Undefined function: 'from_avro'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$16.$anonfun$applyOrElse$121(Analyzer.scala:2084)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
Any suggestion on how to call from_avro using spark-sql ?

apache-spark: How to use UDF in expr?

How do we use a spark UDF in expression?
I have:
def power3(input: Int) = input * input * input
import org.apache.spark.sql.functions.udf
val power3UDF = udf(power3(_:Int):Int).withName("power3UDF")
numDF
.select(expr("Number"), expr("power3UDF(Number) AS POWER3"))
.show()
This gives error:
Undefined function: power3UDF. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.power3udf.
Need to register the function spark SQL function:
spark.udf.register("power3", power3(_:Int):Int)

How to remove special characters from dataframe using udf function

I am a learner in spark sql. Could anyone please help with below scenario?
package name: sparksql,class name:custommethod, method name:removespecialchar
create custom method in scala which takes 1 string as argument and 1 return on type string
Method has to remove all special characters numbers 0 to 9 - ? , / _ ( ) [ ] from dataframe one column using replaceall function.
input: windows-X64 (os system)
output : windows x os system
I have a dataframe called df1 with 6 columns inside another class called sparksql2
3.Import the package, instantiate the custommethod method inside sparksql2 class and register the method generated in above step as a udf for invoking spark sql dataframe.
Call the above udf in the DSL by passing single columnname as an argument to get the special characters removed from dataframe and save the result as json into hdfs location
You don't need UDFs for that you can just use plain spark and define it in a function with regexp_replace.
take this example:
import org.apache.spark.sql.{SparkSession,DataFrame}
import org.apache.spark.sql.functions.regexp_replace
def removeFromColumn(spark: SparkSession, columnName: String, df: DataFrame) =
df.select(regexp_replace(
df(columnName),
"[0-9]|\\[|\\]|\\-|\\?|\\(|\\)|\\,|_|/",
""
).as(columnName))
with this you can use it on a DataFrame without going into the trouble of registering a UDF:
import spark.implicits._
val df = Seq("2res012-?,/_()[]ult").toDF("columnName")
removeFromColumn(spark, "columnName", df)
Output:
+----------+
|columnName|
+----------+
| result|
+----------+

Cast datatype from array to String for multiple column in Spark Throwing Error

I have a dataframe df that contains Three column of type array, i am trying to save output
to csv, so converted data type to string.
import org.apache.spark.sql.functions._
val df2 = df.withColumn("Total", col("total").cast("string")),
("BOOKID", col("BOOKID").cast("string"),
"PublisherID", col("PublisherID").cast("string")
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")
But getting error.
error as "Cannot Resolve symbol write"
Spark 2.2
Scala
Try below code.
Its not possible to add multiple columns inside withColumn function.
val df2 = df
.withColumn("Total", col("total").cast("string"))
.withColumn("BOOKID", col("BOOKID").cast("string"))
.withColumn("PublisherID", col("PublisherID").cast("string"))
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")

Not able to register UDF in spark sql

I trying to register my UDF function and want to use this in my spark sql query but not able to register my udf Im getting below error.
val squared = (s: Column) => {
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
}
squared: org.apache.spark.sql.Column => org.apache.spark.sql.Column = <function1>
scala> sqlContext.udf.register("dc",squared)
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:143)
... 48 elided
I tried to change Column to String but getting below error.
val squared = (s: String) => {
| concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
| }
<console>:28: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
can someone please guide me how should i implement this.
All spark functions from this package org.apache.spark.sql.functions._ will not be able to access inside UDF.
Instead of built in spark functions ..you can use plain scala code to get same result.
val df = spark.sql("select * from your_table")
def date_concat(date:Column): Column = {
concat(substring(date,4,2),year(to_date(from_unixtime(unix_timestamp(date,"dd-MM-yyyy")))))
}
df.withColumn("date_column_name",date_concat($"date_column_name")) // with function.
df.withColumn("date_column_name",concat(substring($"date_column_name",4,2),year(to_date(from_unixtime(unix_timestamp($"date_column_name","dd-MM-yyyy")))))) // without function, direct method.
df.createOrReplaceTempView("table_name")
spark.sql("[...]") // Write your furthur logic in sql if you want.

Resources