I am trying to register a table in Databricks Community Edition using the following code:
import org.apache.spark.sql.functions.udf
val getDataUDF(url: String):Unit = udf(getData(url: String):Unit)
However, I get an error:
overloaded method value udf with alternatives:
Your UDF syntax looks a bit strange, you shouldn't define the type when calling getData(). In addtion, the input to the UDF should be inside the method itself.
For example, you have a method getData like this (it should have a return value):
def getData(url: String): String = {...}
To make it into an udf, there are two ways:
Rewrite getData as a function
val getData: (String => String) = {...}
val getDataUDF = udf(getData)
Call the getData method inside the udf
val getDataUDF = udf((url: String) => {
getData(url)
})
Both of these ways should work, personally I think method 1 looks a bit better.
Related
I have the following udt type
CREATE TYPE tag_partitions(
year bigint,
month bigint);
and the following table
CREATE TABLE ${tableName} (
tag text,
partition_info set<FROZEN<tag_partitions>>,
PRIMARY KEY ((tag))
)
The table schema is mapped using the following model
case class TagPartitionsInfo(year:Long, month:Long)
case class TagPartitions(tag:String, partition_info:Set[TagPartitionsInfo])
I have written a function which should create an Update.IfExists query: But I don't know how I should update the udt value. I tried to use set but it isn't working.
def updateValues(tableName:String, model:TagPartitions, id:TagPartitionKeys):Update.IfExists = {
val partitionInfoType:UserType = session.getCluster().getMetadata
.getKeyspace("codingjedi").getUserType("tag_partitions")
//create value
//the logic below assumes that there is only one element in the set
val partitionsInfoSet:Set[UDTValue] = model.partition_info.map((partitionInfo:TagPartitionsInfo) =>{
partitionInfoType.newValue()
.setLong("year",partitionInfo.year)
.setLong("month",partitionInfo.month)
})
println("partition info converted to UDTValue: "+partitionsInfoSet)
QueryBuilder.update(tableName).
`with`(QueryBuilder.WHAT_TO_DO_HERE_TO_UPDATE_UDT("partition_info",partitionsInfoSet))
.where(QueryBuilder.eq("tag", id.tag)).ifExists()
}
The mistake was I was adding partitionsInfoSet in the table but it is a Set of Scala. I needed to convert into Set of Java using setAsJavaSet
QueryBuilder.update(tableName).`with`(QueryBuilder.set("partition_info",setAsJavaSet(partitionsInfoSet)))
.where(QueryBuilder.eq("tag", id.tag))
.ifExists()
}
Although, it didn't answer your exact question, wouldn't it be easier to use Object Mapper for this? Something like this (I didn't modify it heavily to match your code):
#UDT(name = "scala_udt")
case class UdtCaseClass(id: Integer, #(Field #field)(name = "t") text: String) {
def this() {
this(0, "")
}
}
#Table(name = "scala_test_udt")
case class TableObjectCaseClassWithUDT(#(PartitionKey #field) id: Integer,
udts: java.util.Set[UdtCaseClass]) {
def this() {
this(0, new java.util.HashSet[UdtCaseClass]())
}
}
and then just create case class and use mapper.save on it. (Also note that you need to use Java collections, until you're imported Scala codecs).
The primary reason for using Object Mapper could be ease of use, and also better performance, because it's using prepared statements under the hood, instead of built statements that are much less efficient.
You can find more information about Object Mapper + Scala in article that I wrote recently.
I defined a UDF with UDO as parameters. But when I tried to call it in dataframe, I got the error message "org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array) => int)". Just want to know it is expected that the exception mentioned the UDO as binary, and also how should I fix it?
val logCount = (logs: util.List[LogRecord]) => logs.size()
val logCountUdf = udf(logCount)
// The column 'LogRecords' is the agg function collect_list of UDO LogRecord
df.withColumn("LogCount", logCountUdf($"LogRecords"))
in general you cant pass in custom objects into UDFs, and you should only call the udf for non-null rows otherwise there will be a NullPointerException inside your UDF. try:
val logCount = (logs: Seq[Row]) => logs.size()
val logCountUdf = udf(logCount)
df.withColumn("LogCount", when($"LogRecords".isNotNull,logCountUdf($"LogRecords")))
or just use the built-in function size to get the logCount :
df.withColumn("LogCount", size($"LogRecords"))
I have a UDF that accepts string parameters as well as fields, but it seems that "callUDF" can only accept fields.
I found a workaround using selectExpr(...) or by using spark.sql(...), but I wonder if there is any better way of doing that.
Here is an example:
Schema - id, map[String, String]
spark.sqlContext.udf.register("get_from_map", (map: Map[String, String], att: String) => map.getOrElse(att, ""))
val data = spark.read...
data.selectExpr("id", "get_from_map(map, 'attr')").show(15)
This will work, but I was kind of hoping for a better approach like:
data.select($"id", callUDF("get_from_map", $"map", "attr"))
Any ideas? Am I missing something?
I haven't seen any JIRA ticket open about this, so either I'm missing something or I'm miss-using.
Thanks!
You can use a lit function for that
data.select($"id", callUDF("get_from_map", $"map", lit("attr")))
essentially using lit() would allow you to pass literals (strings, numbers) where columns are expected.
You might also want to register your function using the udf function - so you'd be able to use it directly rather than call callUDF :
import org.apache.spark.sql.functions._
val getFromMap = udf((map:Map[String,String], att : String) => map.getOrElse(att,""))
data.select($"id", getFromMap($"map", lit("attr")))
I have the following Slick class that includes a date:
import java.sql.Date
import java.time.LocalDate
class ReportDateDB(tag: Tag) extends Table[ReportDateVO](tag, "report_dates") {
def reportDate = column[LocalDate]("report_date")(localDateColumnType)
def * = (reportDate) <> (ReportDateVO.apply, ReportDateVO.unapply)
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate
)
}
When I attempt to sort the table by date:
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I get the following compilation error
not enough arguments for method sortBy: (implicit evidence$2: slick.lifted.Rep[java.time.LocalDate] ⇒
slick.lifted.Ordered)slick.lifted.Query[fdic.ReportDateDB,fdic.ReportDateDB#TableElementType,Seq].
Unspecified value parameter evidence$2.
No implicit view available from slick.lifted.Rep[java.time.LocalDate] ⇒ slick.lifted.Ordered.
How to specify the implicit default order?
You need to make your implicit val localDateColumnType available where you run the query. For example, this will work:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate)
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I'm not sure where the best place to put this is, but I usually put all these conversions in a package object.
It should work like described here:
implicit def localDateOrdering: Ordering[LocalDate] = Ordering.fromLessThan(_ isBefore _)
Try add this line to your import list:
import slick.driver.MySQLDriver.api._
I am using Spark 1.3. I have a dataset where the dates in column (ordering_date column) are in yyyy/MM/dd format. I want to do some calculations with dates and therefore I want to use jodatime to do some conversions/formatting. Here is the udf that I have :
val return_date = udf((str: String, dtf: DateTimeFormatter) => dtf.formatted(str))
Here is the code where the udf is being called. However, I get error saying "Not Applicable". Do I need to register this UDF or am I missing something here?
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), DateTimeFormat.forPattern("yyyy/MM/dd")
)
I don't believe you can pass in the DateTimeFormatter as an argument to the UDF. You can only pass in a Column. One solution would be to do:
val return_date = udf((str: String, format: String) => {
DateTimeFormat.forPatten(format).formatted(str))
})
And then:
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), lit("yyyy/MM/dd"))
)
Honestly, though -- both this and your original algorithms have the same problem. They both parse yyyy/MM/dd using forPattern for every record. Better would be to create a singleton object wrapped around a Map[String,DateTimeFormatter], maybe like this (thoroughly untested, but you get the idea):
object DateFormatters {
var formatters = Map[String,DateTimeFormatter]()
def getFormatter(format: String) : DateTimeFormatter = {
if (formatters.get(format).isEmpty) {
formatters = formatters + (format -> DateTimeFormat.forPattern(format))
}
formatters.get(format).get
}
}
Then you would change your UDF to:
val return_date = udf((str: String, format: String) => {
DateFormatters.getFormatter(format).formatted(str))
})
That way, DateTimeFormat.forPattern(...) is only called once per format per executor.
One thing to note about the singleton object solution is that you can't define the object in the spark-shell -- you have to pack it up in a JAR file and use the --jars option to spark-shell if you want to use the DateFormatters object in the shell.