spark udf with data frame

spark udf with data frame - apache-spark

I am using Spark 1.3. I have a dataset where the dates in column (ordering_date column) are in yyyy/MM/dd format. I want to do some calculations with dates and therefore I want to use jodatime to do some conversions/formatting. Here is the udf that I have :
val return_date = udf((str: String, dtf: DateTimeFormatter) => dtf.formatted(str))
Here is the code where the udf is being called. However, I get error saying "Not Applicable". Do I need to register this UDF or am I missing something here?
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), DateTimeFormat.forPattern("yyyy/MM/dd")
)

I don't believe you can pass in the DateTimeFormatter as an argument to the UDF. You can only pass in a Column. One solution would be to do:
val return_date = udf((str: String, format: String) => {
DateTimeFormat.forPatten(format).formatted(str))
})
And then:
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), lit("yyyy/MM/dd"))
)
Honestly, though -- both this and your original algorithms have the same problem. They both parse yyyy/MM/dd using forPattern for every record. Better would be to create a singleton object wrapped around a Map[String,DateTimeFormatter], maybe like this (thoroughly untested, but you get the idea):
object DateFormatters {
var formatters = Map[String,DateTimeFormatter]()
def getFormatter(format: String) : DateTimeFormatter = {
if (formatters.get(format).isEmpty) {
formatters = formatters + (format -> DateTimeFormat.forPattern(format))
}
formatters.get(format).get
}
}
Then you would change your UDF to:
val return_date = udf((str: String, format: String) => {
DateFormatters.getFormatter(format).formatted(str))
})
That way, DateTimeFormat.forPattern(...) is only called once per format per executor.
One thing to note about the singleton object solution is that you can't define the object in the spark-shell -- you have to pack it up in a JAR file and use the --jars option to spark-shell if you want to use the DateFormatters object in the shell.

Related

Overloaded method value udf with alternatives

I am trying to register a table in Databricks Community Edition using the following code:
import org.apache.spark.sql.functions.udf
val getDataUDF(url: String):Unit = udf(getData(url: String):Unit)
However, I get an error:
overloaded method value udf with alternatives:

Your UDF syntax looks a bit strange, you shouldn't define the type when calling getData(). In addtion, the input to the UDF should be inside the method itself.
For example, you have a method getData like this (it should have a return value):
def getData(url: String): String = {...}
To make it into an udf, there are two ways:
Rewrite getData as a function
val getData: (String => String) = {...}
val getDataUDF = udf(getData)
Call the getData method inside the udf
val getDataUDF = udf((url: String) => {
getData(url)
})
Both of these ways should work, personally I think method 1 looks a bit better.

Define UDF in Spark Scala

I need to use an UDF in Spark that takes in a timestamp, an Integer and another dataframe and returns a tuple of 3 values.
I keep hitting error after error and I'm not sure I'm trying to fix it right anymore.
Here is the function:
def determine_price (view_date: org.apache.spark.sql.types.TimestampType , product_id: Int, price_df: org.apache.spark.sql.DataFrame) : (Double, java.sql.Timestamp, Double) = {
var price_df_filtered = price_df.filter($"mkt_product_id" === product_id && $"created"<= view_date)
var price_df_joined = price_df_filtered.groupBy("mkt_product_id").agg("view_price" -> "min", "created" -> "max").withColumn("last_view_price_change", lit(1))
var price_df_final = price_df_joined.join(price_df_filtered, price_df_joined("max(created)") === price_df_filtered("created")).filter($"last_view_price_change" === 1)
var result = (price_df_final.select("view_price").head().getDouble(0), price_df_final.select("created").head().getTimestamp(0), price_df_final.select("min(view_price)").head().getDouble(0))
return result
}
val det_price_udf = udf(determine_price)
the error it gives me is:
error: missing argument list for method determine_price
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `determine_price _` or `determine_price(_,_,_)` instead of `determine_price`.
If I start adding the arguments I keep running in other errors such as Int expected Int.type found or object DataFrame is not a member of package org.apache.spark.sql
To give some context:
The idea is that I have a dataframe of prices, a product id and a date of creation and another dataframe containing product IDs and view dates.
I need to determine the price based on which was the last created price entry that is older than the view date.
Since each product ID has multiple view dates in the second dataframe. I thought an UDF is faster than a cross join. If anyone has a different idea, I'd be grateful.

You cannot pass the Dataframe inside UDF as UDF will be running on the Worker On a particular partition. And as you cannot use RDD on Worker( Is it possible to create nested RDDs in Apache Spark? ), similarly you cannot use the DataFrame on Worker too.!
You need to do a work around for this !

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?

You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.

For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306

If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

How to use java.time.LocalDate in Cassandra query from Spark?

We have a table in Cassandra with column start_time of type date.
When we execute following code:
val resultRDD = inputRDD.joinWithCassandraTable(KEY_SPACE,TABLE)
.where("start_time = ?", java.time.LocalDate.now)
We get following error:
com.datastax.spark.connector.types.TypeConversionException: Cannot convert object 2016-10-13 of type class java.time.LocalDate to com.datastax.driver.core.LocalDate.
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:45)
at com.datastax.spark.connector.types.TypeConverter$$anonfun$convert$1.apply(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$LocalDateConverter$$anonfun$convertPF$14.applyOrElse(TypeConverter.scala:449)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$LocalDateConverter$.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:439)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$LocalDateConverter$.convert(TypeConverter.scala:439)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter$$anonfun$convertPF$29.applyOrElse(TypeConverter.scala:788)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:771)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.convert(TypeConverter.scala:771)
at com.datastax.spark.connector.writer.BoundStatementBuilder$$anonfun$8.apply(BoundStatementBuilder.scala:93)
I've tried to register custom converters according to documentation:
object JavaLocalDateToCassandraLocalDateConverter extends TypeConverter[com.datastax.driver.core.LocalDate] {
def targetTypeTag = typeTag[com.datastax.driver.core.LocalDate]
def convertPF = {
case ld: java.time.LocalDate => com.datastax.driver.core.LocalDate.fromYearMonthDay(ld.getYear, ld.getMonthValue, ld.getDayOfMonth)
case _ => com.datastax.driver.core.LocalDate.fromYearMonthDay(1971, 1, 1)
}
}
object CassandraLocalDateToJavaLocalDateConverter extends TypeConverter[java.time.LocalDate] {
def targetTypeTag = typeTag[java.time.LocalDate]
def convertPF = { case ld: com.datastax.driver.core.LocalDate => java.time.LocalDate.of(ld.getYear(), ld.getMonth(), ld.getDay())
case _ => java.time.LocalDate.now
}
}
TypeConverter.registerConverter(JavaLocalDateToCassandraLocalDateConverter)
TypeConverter.registerConverter(CassandraLocalDateToJavaLocalDateConverter)
But it didn't help.
How can I use JDK8 Date/Time classes in Cassandra queries executed from Spark?

I think the simplest thing to do in a where clause like this is to just call
sc
.cassandraTable("test","test")
.where("start_time = ?", java.time.LocalDate.now.toString)
.collect`
And just pass in the string since that will be a well defined conversion.
There seems to be an issue in the TypeConverters where your converter is not taking precedence over the built in converter. I'll take a quick look.
--Edit--
It seems like the registered converters are not being properly transferred to the Executors. In Local mode the code works as expected which makes me think this is a serialization issue. I would open a ticket on the Spark Cassandra Connector for this issue.

Cassandra date format is yyyy-MM-dd HH:mm:ss.SSS
so you can use the below code, if you are using Java 8 to convert Cassandra date to LocalDate, then you can do your logic.
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS")
val dateTime = LocalDateTime.parse(cassandraDateTime, formatter);
Or you can convert LocalDate to Cassandra date format and check it.

Spark dataframe: Schema for type Unit is not supported

I am using Spark 1.5.0 and I have this issue:
val df = paired_rdd.reduceByKey {
case (val1, val2) => val1 + "|" + val2
}.toDF("user_id","description")
Here is sample data for df, as you can see the column description has
this format (text1#text3#weight | text1#text3#weight|....)
user1
book1#author1#0.07841217886795074|tool1#desc1#0.27044260397331488|song1#album1#-0.052661673730870676|item1#category1#-0.005683148395350108
I want to sort this df based on weight in descending order here is what I tried:
First split the contents at "|" and then for each of those strings, split them at "#" and get the 3rd string which is weight and then convert that into a double value
val getSplitAtWeight = udf((str: String) => {
str.split("|").foreach(_.split("#")(2).toDouble)
})
Sort based on the weigh value returned by the udf (in descending manner)
val df_sorted = df.sort(getSplitAtWeight(col("description")).desc)
I get the following error:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type Unit is not supported at
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:153)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:64)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:29)
at org.apache.spark.sql.functions$.udf(functions.scala:2242)

Change foreach in your udf to map as following will eliminate the exception:
def getSplitAtWeight = udf((str: String) => {
str.split('|').map(_.split('#')(2).toDouble)
})
The problem with your method is that foreach method on List doesn't return anything, i.e., its result is of type Unit that's why you get the Exception. To understand more about the foreach, check this blog.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark udf with data frame - apache-spark

Related

Overloaded method value udf with alternatives

Define UDF in Spark Scala

spark spelling correction via udf

How to use java.time.LocalDate in Cassandra query from Spark?

Spark dataframe: Schema for type Unit is not supported

Categories

Resources