Spark 2.1 register UDF to functionRegistry - apache-spark

Hi I want to register a UDF object that is already created. I'm using spark 2.1, and the sparkSession.udf.register() function does not accept a UDF parameter only a regular scala function. It's easy to miss something from the large Spark API so just asking is there a function or constructor that will allow this in 2.1?

In this case I'd reverse the the problem and use udf registration to get UserDefinedFunction:
import org.apache.spark.sql.expressions.UserDefinedFunction
val id: UserDefinedFunction = spark.udf.register("id", (x: Int) => x)
Which would work both in DataFrames:
val id: UserDefinedFunction = spark.udf.register("id", (x: Int) => x)
and SQL:
spark.sql("SELECT id(id) FROM RANGE(42)")

Related

Kotlin with spark create dataframe from POJO which has pojo classes within

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.
It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.
Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

Higher Order functions in Spark SQL

Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?
In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.
Thanks in advance.
Not going down the .filter road as I cannot see the focus there.
For .transform
dataframe transform at DF-level
transform on an array of a DF in v 2.4
transform on an array of a DF in v 3
The following:
dataframe transform
From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.
This they say is messy:
...
def inc(i: Int) = i + 1
val tmp0 = func0(inc, 3)(testDf)
val tmp1 = func1(1)(tmp0)
val tmp2 = func2(2)(tmp1)
val res = tmp2.withColumn("col3", expr("col2 + 3"))
compared to:
val res = testDf.transform(func0(inc, 4))
.transform(func1(1))
.transform(func2(2))
.withColumn("col3", expr("col2 + 3"))
transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination
import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),
Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))
transform with lambda function new array function on a DF in v 3 using withColumn
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df = Seq(
(Array("New York", "Seattle")),
(Array("Barcelona", "Bangalore"))
).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"),
(col: Column) => concat(col, lit(" is fun!"))))
Try them.
Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):
transform works similar to the map function in Scala. I’m not sure why
they chose to name this function transform… I think array_map would
have been a better name, especially because the Dataset#transform
function is commonly used to chain DataFrame transformations.
Update
If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

Not able to register UDF in spark sql

I trying to register my UDF function and want to use this in my spark sql query but not able to register my udf Im getting below error.
val squared = (s: Column) => {
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
}
squared: org.apache.spark.sql.Column => org.apache.spark.sql.Column = <function1>
scala> sqlContext.udf.register("dc",squared)
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:143)
... 48 elided
I tried to change Column to String but getting below error.
val squared = (s: String) => {
| concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
| }
<console>:28: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
concat(substring(s,4,2),year(to_date(from_unixtime(unix_timestamp(s,"dd-MM-yyyy")))))
can someone please guide me how should i implement this.
All spark functions from this package org.apache.spark.sql.functions._ will not be able to access inside UDF.
Instead of built in spark functions ..you can use plain scala code to get same result.
val df = spark.sql("select * from your_table")
def date_concat(date:Column): Column = {
concat(substring(date,4,2),year(to_date(from_unixtime(unix_timestamp(date,"dd-MM-yyyy")))))
}
df.withColumn("date_column_name",date_concat($"date_column_name")) // with function.
df.withColumn("date_column_name",concat(substring($"date_column_name",4,2),year(to_date(from_unixtime(unix_timestamp($"date_column_name","dd-MM-yyyy")))))) // without function, direct method.
df.createOrReplaceTempView("table_name")
spark.sql("[...]") // Write your furthur logic in sql if you want.

HBase batch get with spark scala

I am trying to fetch data from HBase based on a list of row keys, in the API document there is a method called get(List gets), I am trying to use that, however the compiler is complaining something like this, does anyone had this experiance
overloaded method value get with alternatives: (x$1: java.util.List[org.apache.hadoop.hbase.client.Get])Array[org.apache.hadoop.hbase.client.Result] <and> (x$1: org.apache.hadoop.hbase.client.Get)org.apache.hadoop.hbase.client.Result cannot be applied to (List[org.apache.hadoop.hbase.client.Get])
The code I tried.
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)
I ended up using JavaConvert to make it java.util.List, then it worked
val gets:List[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
import scala.collection.JavaConverters._
val getJ=gets.asJava
val results = hTable.get(getJ).toList
your gets is of type List[Get]. Here List is of Scala type. However, HBase get request expects Java List type. You can use Seq[Get] instead of List[Get] as Scala Seq is more closer to Java List.
So, you can try with below code:
val keys: List[String] = df.select("id").rdd.map(r => r.getString(0)).collect.toList
val gets:Seq[Get]=keys.map(x=> new Get(Bytes.toBytes(x)))
val results = hTable.get(gets)

Convert Hive Sql to Spark Sql

i want to convert my Hive Sql to Spark Sql to test the performance of query. Here is my Hive Sql. Can anyone suggests me how to convert the Hive Sql to Spark Sql.
SELECT split(DTD.TRAN_RMKS,'/')[0] AS TRAB_RMK1,
split(DTD.TRAN_RMKS,'/')[1] AS ATM_ID,
DTD.ACID,
G.FORACID,
DTD.REF_NUM,
DTD.TRAN_ID,
DTD.TRAN_DATE,
DTD.VALUE_DATE,
DTD.TRAN_PARTICULAR,
DTD.TRAN_RMKS,
DTD.TRAN_AMT,
SYSDATE_ORA(),
DTD.PSTD_DATE,
DTD.PSTD_FLG,
G.CUSTID,
NULL AS PROC_FLG,
DTD.PSTD_USER_ID,
DTD.ENTRY_USER_ID,
G.schemecode as SCODE
FROM DAILY_TRAN_DETAIL_TABLE2 DTD
JOIN ods_gam G
ON DTD.ACID = G.ACID
where substr(DTD.TRAN_PARTICULAR,1,3) rlike '(PUR|POS).*'
AND DTD.PART_TRAN_TYPE = 'D'
AND DTD.DEL_FLG <> 'Y'
AND DTD.PSTD_FLG = 'Y'
AND G.schemecode IN ('SBPRV','SBPRS','WSSTF','BGFRN','NREPV','NROPV','BSNRE','BSNRO')
AND (SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,6) IN ('405997','406228','406229','415527','415528','417917','417918','418210','421539','421572','432198','435736','450502','450503','450504','468805','469190','469191','469192','474856','478286','478287','486292','490222','490223','490254','512932','512932','514833','522346','522352','524458','526106','526701','527114','527479','529608','529615','529616','532731','532734','533102','534680','536132','536610','536621','539149','539158','549751','557654','607118','607407','607445','607529','652189','652190','652157') OR SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,8) IN ('53270200','53270201','53270202','60757401','60757402') )
limit 50;
Query is lengthy to write code for above, I won't attempt to write code here, But I would offer DataFrames approach.
which has flexibility to implement above query Using DataFrame , Column operations
like filter,withColumn(if you want to convert/apply hive UDF to scala function/udf) , cast for casting datatypes etc..
Recently I've done this and its performant.
Below is the psuedo code in Scala
val df1 = hivecontext.sql ("select * from ods_gam").as("G")
val df2 = hivecontext.sql("select * from DAILY_TRAN_DETAIL_TABLE2).as("DTD")
Now, join using your dataframes
val joinedDF = df1.join(df2 , df1("G.ACID") = df2("DTD.ACID"), "inner")
// now apply your string functions here...
joinedDF.withColumn or filter ,When otherwise ... blah.. blah here
Note : I think in your case udfs are not required, simple string functions would suffice.
Also have a look at DataFrameJoinSuite.scala which could be very useful for you...
Further details refer docs
Spark 1.5 :
DataFrame.html
All the dataframe column operations Column.html
If you are looking for sample code of UDF below is code snippet.
Construct Dummy Data
import util.Random
import org.apache.spark.sql.Row
implicit class Crossable[X](xs: Traversable[X]) {
def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y)
}
val students = Seq("John", "Mike","Matt")
val subjects = Seq("Math", "Sci", "Geography", "History")
val random = new Random(1)
val data =(students cross subjects).map{x => Row(x._1, x._2,random.nextInt(100))}.toSeq
// Create Schema Object
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val schema = StructType(Array(
StructField("student", StringType, nullable=false),
StructField("subject", StringType, nullable=false),
StructField("score", IntegerType, nullable=false)
))
// Create DataFrame
import org.apache.spark.sql.hive.HiveContext
val rdd = sc.parallelize(data)
val df = sqlContext.createDataFrame(rdd, schema)
// Define udf
import org.apache.spark.sql.functions.udf
def udfScoreToCategory=udf((score: Int) => {
score match {
case t if t >= 80 => "A"
case t if t >= 60 => "B"
case t if t >= 35 => "C"
case _ => "D"
}})
df.withColumn("category", udfScoreToCategory(df("score"))).show(10)
Just try to use it as it is, you should benefit from this right away if you run this query with Hive on MapReduce before that, from there if you still would need to get better results you can analyze Query plan and optimize it further like using partitioning for example. Spark uses memory more heavily and beyond simple transformations is generally faster than MapReduce, Spark sql also uses Catalyst Optimizer, your query benefit from that too.
Considering your comment about "using spark functions like Map, Filter etc", map() just transforms data, but you just have string functions I don't think you will gain anything by rewriting them using .map(...), spark will do transformations for you, filter() if you can filter the input data, you can just rewrite query using sub queries and other sql capabilities.

Resources