Spark SQL doesn't call my UDT equals/hashcode methods - apache-spark

I want to implement my comparison operators(equals, hashcode, ordering) in a data type defined by me in Spark SQL. Although Spark SQL UDT's still remains private, I follow some examples like this, to workaround this situation.
I have a class called MyPoint:
#SQLUserDefinedType(udt = classOf[MyPointUDT])
case class MyPoint(x: Double, y: Double) extends Serializable {
override def hashCode(): Int = {
println("hash code")
31 * (31 * x.hashCode()) + y.hashCode()
}
override def equals(other: Any): Boolean = {
println("equals")
other match {
case that: MyPoint => this.x == that.x && this.y == that.y
case _ => false
}
}
Then, I have the UDT class:
private class MyPointUDT extends UserDefinedType[MyPoint] {
override def sqlType: DataType = ArrayType(DoubleType, containsNull = false)
override def serialize(obj: MyPoint): ArrayData = {
obj match {
case features: MyPoint =>
new GenericArrayData2(Array(features.x, features.y))
}
}
override def deserialize(datum: Any): MyPoint = {
datum match {
case data: ArrayData if data.numElements() == 2 => {
val arr = data.toDoubleArray()
new MyPoint(arr(0), arr(1))
}
}
}
override def userClass: Class[MyPoint] = classOf[MyPoint]
override def asNullable: MyPointUDT = this
}
Then I create a simple DataFrame:
val p1 = new MyPoint(1.0, 2.0)
val p2 = new MyPoint(1.0, 2.0)
val p3 = new MyPoint(10.0, 20.0)
val p4 = new MyPoint(11.0, 22.0)
val points = Seq(
("P1", p1),
("P2", p2),
("P3", p3),
("P4", p4)
).toDF("label", "point")
points.registerTempTable("points")
spark.sql("SELECT Distinct(point) FROM points").show()
The problem is: Why the SQL query doesn't execute the equals method inside MyPoint class? How comparasions are being made? How can I implement my comparasion operators in this example?

Related

Defining and reading a nullable date column in Slick 3.x

I have a table with a column type date. This column accepts null values, therefore, I declared it as an Option (see field perDate below). The issue is that apparently the implicit conversion from/to java.time.LocalDate/java.sql.Date is incorrect as reading from this table when perDate is null fails with the error:
slick.SlickException: Read NULL value (null) for ResultSet column <computed>
This is the Slick table definition, including the implicit function:
import java.sql.Date
import java.time.LocalDate
class FormulaDB(tag: Tag) extends Table[Formula](tag, "formulas") {
def sk = column[Int]("sk", O.PrimaryKey, O.AutoInc)
def name = column[String]("name")
def descrip = column[Option[String]]("descrip")
def formula = column[Option[String]]("formula")
def notes = column[Option[String]]("notes")
def periodicity = column[Int]("periodicity")
def perDate = column[Option[LocalDate]]("per_date")(localDateColumnType)
def * = (sk, name, descrip, formula, notes, periodicity, perDate) <>
((Formula.apply _).tupled, Formula.unapply)
implicit val localDateColumnType = MappedColumnType.base[Option[LocalDate], Date](
{
case Some(localDate) => Date.valueOf(localDate)
case None => null
},{
sqlDate => if (sqlDate != null) Some(sqlDate.toLocalDate) else None
}
)
}
Actually your implicit conversion from/to java.time.LocalDate/java.sql.Date is not incorrect.
I have faced the same error, and doing some research I found that the Node created by the Slick SQL Compiler is actually of type MappedJdbcType[Scala.Option -> LocalDate], and not Option[LocalDate].
That is the reason why when the mapping compiler create the column converter for your def perDate it is creating a Base ResultConverterand not a Option ResultConverter
Here is the Slick code for the base converter:
def base[T](ti: JdbcType[T], name: String, idx: Int) = (ti.scalaType match {
case ScalaBaseType.byteType => new BaseResultConverter[Byte](ti.asInstanceOf[JdbcType[Byte]], name, idx)
case ScalaBaseType.shortType => new BaseResultConverter[Short](ti.asInstanceOf[JdbcType[Short]], name, idx)
case ScalaBaseType.intType => new BaseResultConverter[Int](ti.asInstanceOf[JdbcType[Int]], name, idx)
case ScalaBaseType.longType => new BaseResultConverter[Long](ti.asInstanceOf[JdbcType[Long]], name, idx)
case ScalaBaseType.charType => new BaseResultConverter[Char](ti.asInstanceOf[JdbcType[Char]], name, idx)
case ScalaBaseType.floatType => new BaseResultConverter[Float](ti.asInstanceOf[JdbcType[Float]], name, idx)
case ScalaBaseType.doubleType => new BaseResultConverter[Double](ti.asInstanceOf[JdbcType[Double]], name, idx)
case ScalaBaseType.booleanType => new BaseResultConverter[Boolean](ti.asInstanceOf[JdbcType[Boolean]], name, idx)
case _ => new BaseResultConverter[T](ti.asInstanceOf[JdbcType[T]], name, idx) {
override def read(pr: ResultSet) = {
val v = ti.getValue(pr, idx)
if(v.asInstanceOf[AnyRef] eq null) throw new SlickException("Read NULL value ("+v+") for ResultSet column "+name)
v
}
}
}).asInstanceOf[ResultConverter[JdbcResultConverterDomain, T]]
Unfortunately I have no solution for this problem, what I suggest as a workaround, is to map your perDate property as follows:
import java.sql.Date
import java.time.LocalDate
class FormulaDB(tag: Tag) extends Table[Formula](tag, "formulas") {
def sk = column[Int]("sk", O.PrimaryKey, O.AutoInc)
def name = column[String]("name")
def descrip = column[Option[String]]("descrip")
def formula = column[Option[String]]("formula")
def notes = column[Option[String]]("notes")
def periodicity = column[Int]("periodicity")
def perDate = column[Option[Date]]("per_date")
def toLocalDate(time : Option[Date]) : Option[LocalDate] = time.map(t => t.toLocalDate))
def toSQLDate(localDate : Option[LocalDate]) : Option[Date] = localDate.map(localDate => Date.valueOf(localDate)))
private type FormulaEntityTupleType = (Int, String, Option[String], Option[String], Option[String], Int, Option[Date])
private val formulaShapedValue = (sk, name, descrip, formula, notes, periodicity, perDate).shaped[FormulaEntityTupleType]
private val toFormulaRow: (FormulaEntityTupleType => Formula) = { formulaTuple => {
Formula(formulaTuple._1, formulaTuple._2, formulaTuple._3, formulaTuple._4, formulaTuple._5, formulaTuple._6, toLocalDate(formulaTuple._7))
}
}
private val toFormulaTuple: (Formula => Option[FormulaEntityTupleType]) = { formulaRow =>
Some((formulaRow.sk, formulaRow.name, formulaRow.descrip, formulaRow.formula, formulaRow.notes, formulaRow.periodicity, toSQLDate(formulaRow.perDate)))
}
def * = formulaShapedValue <> (toFormulaRow, toFormulaTuple)
Hopefully the answer comes not too late.
I'm pretty sure the problem is that your'e returning null from your mapping function instead of None.
Try rewriting your mapping function as a function from LocalDate to Date:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
{
localDate => Date.valueOf(localDate)
},{
sqlDate => sqlDate.toLocalDate
}
)
Alternately, mapping from Option[LocalDate] to Option[Date] should work:
implicit val localDateColumnType =
MappedColumnType.base[Option[LocalDate], Option[Date]](
{
localDateOption => localDateOption.map(Date.valueOf)
},{
sqlDateOption => sqlDateOption.map(_.toLocalDate)
}
)

Slick 3.0 best practices

Coming from JPA and Hibernate, I find using Slick pretty straight forward apart from using some Joins and Aggregate queries.
Now is there some best practices that I could employ when defining my tables and the mapping case classes? I currently have a single class that holds all the queries, Table definitions and the case classes to which I map to when I query the database. This class deals with say 10 to 15 tables and the file has become quite big.
I'm now wondering if I should split this into different packages for readability. What do you guys think?
You can separate out slick mapping table from slick queries. Put slick mapping table into trait. Mixed this trait where you want write quires and join into table. for example:
package com.knol.db.repo
import com.knol.db.connection.DBComponent
import scala.concurrent.Future
trait BankRepository extends BankTable { this: DBComponent =>
import driver.api._
def create(bank: Bank): Future[Int] = db.run { bankTableAutoInc += bank }
def update(bank: Bank): Future[Int] = db.run { bankTableQuery.filter(_.id === bank.id.get).update(bank) }
def getById(id: Int): Future[Option[Bank]] = db.run { bankTableQuery.filter(_.id === id).result.headOption }
def getAll(): Future[List[Bank]] = db.run { bankTableQuery.to[List].result }
def delete(id: Int): Future[Int] = db.run { bankTableQuery.filter(_.id === id).delete }
}
private[repo] trait BankTable { this: DBComponent =>
import driver.api._
private[BankTable] class BankTable(tag: Tag) extends Table[Bank](tag,"bank") {
val id = column[Int]("id", O.PrimaryKey, O.AutoInc)
val name = column[String]("name")
def * = (name, id.?) <> (Bank.tupled, Bank.unapply)
}
protected val bankTableQuery = TableQuery[BankTable]
protected def bankTableAutoInc = bankTableQuery returning bankTableQuery.map(_.id)
}
case class Bank(name: String, id: Option[Int] = None)
For joining two table:
package com.knol.db.repo
import com.knol.db.connection.DBComponent
import scala.concurrent.Future
trait BankInfoRepository extends BankInfoTable { this: DBComponent =>
import driver.api._
def create(bankInfo: BankInfo): Future[Int] = db.run { bankTableInfoAutoInc += bankInfo }
def update(bankInfo: BankInfo): Future[Int] = db.run { bankInfoTableQuery.filter(_.id === bankInfo.id.get).update(bankInfo) }
def getById(id: Int): Future[Option[BankInfo]] = db.run { bankInfoTableQuery.filter(_.id === id).result.headOption }
def getAll(): Future[List[BankInfo]] = db.run { bankInfoTableQuery.to[List].result }
def delete(id: Int): Future[Int] = db.run { bankInfoTableQuery.filter(_.id === id).delete }
def getBankWithInfo(): Future[List[(Bank, BankInfo)]] =
db.run {
(for {
info <- bankInfoTableQuery
bank <- info.bank
} yield (bank, info)).to[List].result
}
def getAllBankWithInfo(): Future[List[(Bank, Option[BankInfo])]] =
db.run {
bankTableQuery.joinLeft(bankInfoTableQuery).on(_.id === _.bankId).to[List].result
}
}
private[repo] trait BankInfoTable extends BankTable { this: DBComponent =>
import driver.api._
private[BankInfoTable] class BankInfoTable(tag: Tag) extends Table[BankInfo](tag,"bankinfo") {
val id = column[Int]("id", O.PrimaryKey, O.AutoInc)
val owner = column[String]("owner")
val bankId = column[Int]("bank_id")
val branches = column[Int]("branches")
def bank = foreignKey("bank_product_fk", bankId, bankTableQuery)(_.id)
def * = (owner, branches, bankId, id.?) <> (BankInfo.tupled, BankInfo.unapply)
}
protected val bankInfoTableQuery = TableQuery[BankInfoTable]
protected def bankTableInfoAutoInc = bankInfoTableQuery returning bankInfoTableQuery.map(_.id)
}
case class BankInfo(owner: String, branches: Int, bankId: Int, id: Option[Int] = None)
For more explanation see Blog and github.
It might be helpful!!!

Slick DBIO: master-detail

I need to collect report data from a master-detail relations. Here is a simplified example:
case class Person(id: Int, name: String)
case class Order(id: String, personId: Int, description: String)
class PersonTable(tag: Tag) extends Table[Person](tag, "person") {
def id = column[Int]("id")
def name = column[String]("name")
override def * = (id, name) <>(Person.tupled, Person.unapply)
}
class OrderTable(tag: Tag) extends Table[Order](tag, "order") {
def id = column[String]("id")
def personId = column[Int]("personId")
def description = column[String]("description")
override def * = (id, personId, description) <>(Order.tupled, Order.unapply)
}
val persons = TableQuery[PersonTable]
val orders = TableQuery[OrderTable]
case class PersonReport(nameToDescription: Map[String, Seq[String]])
/** Some complex function that cannot be expressed in SQL and
* in slick's #join.
*/
def myScalaCondition(person: Person): Boolean =
person.name.contains("1")
// Doesn't compile:
// val reportDbio1:DBIO[PersonReport] =
// (for{ allPersons <- persons.result
// person <- allPersons
// if myScalaCondition(person)
// descriptions <- orders.
// filter(_.personId == person.id).
// map(_.description).result
// } yield (person.name, descriptions)
// ).map(s => PersonReport(s.toMap))
val reportDbio2: DBIO[PersonReport] =
persons.result.flatMap {
allPersons =>
val dbios = allPersons.
filter(myScalaCondition).map { person =>
orders.
filter(_.personId == person.id).
map(_.description).result.map { seq => (person.name, seq) }
}
DBIO.sequence(dbios)
}.map(ps => PersonReport(ps.toMap))
It looks far away from straightforward. When I need to collect master-detail data with 3 levels, it becomes incomprehensible.
Is there a better way?

Exception when using UDT in Spark DataFrame

I'm trying to create a user defined type in spark sql, but I receive:
com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType even when using their example. Has anyone made this work?
My code:
test("udt serialisation") {
val points = Seq(new ExamplePoint(1.3, 1.6), new ExamplePoint(1.3, 1.8))
val df = SparkContextForStdout.context.parallelize(points).toDF()
}
#SQLUserDefinedType(udt = classOf[ExamplePointUDT])
case class ExamplePoint(val x: Double, val y: Double)
/**
* User-defined type for [[ExamplePoint]].
*/
class ExamplePointUDT extends UserDefinedType[ExamplePoint] {
override def sqlType: DataType = ArrayType(DoubleType, false)
override def pyUDT: String = "pyspark.sql.tests.ExamplePointUDT"
override def serialize(obj: Any): Seq[Double] = {
obj match {
case p: ExamplePoint =>
Seq(p.x, p.y)
}
}
override def deserialize(datum: Any): ExamplePoint = {
datum match {
case values: Seq[_] =>
val xy = values.asInstanceOf[Seq[Double]]
assert(xy.length == 2)
new ExamplePoint(xy(0), xy(1))
case values: util.ArrayList[_] =>
val xy = values.asInstanceOf[util.ArrayList[Double]].asScala
new ExamplePoint(xy(0), xy(1))
}
}
override def userClass: Class[ExamplePoint] = classOf[ExamplePoint]
}
The usefull stackstrace is this:
com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType
java.lang.ClassCastException: com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:316)
at org.apache.spark.sql.SQLContext$implicits$.rddToDataFrameHolder(SQLContext.scala:254)
It seems that the UDT needs to be used inside of another class to work (as the type of a field). One solution to use it directly is to wrap it into a Tuple1:
test("udt serialisation") {
val points = Seq(new Tuple1(new ExamplePoint(1.3, 1.6)), new Tuple1(new ExamplePoint(1.3, 1.8)))
val df = SparkContextForStdout.context.parallelize(points).toDF()
df.collect().foreach(println(_))
}

scala parallel collections: Idiomatic way of having thread-local-variables for worker threads

The progress function below is my worker function. I need to give it access to some classes which are costly to create / acquire. Is there any standard machinery for thread-local-variables in the libraries for this ? Or will I have to write a object pool manager myself ?
object Start extends App {
def progress {
val current = counter.getAndIncrement
if(current % 100 == 0) {
val perc = current.toFloat * 100 / totalPosts
print(f"\r$perc%4.2f%%")
}
}
val lexicon = new Global()
def processTopic(forumId: Int, topicId: Int) {
val(topic, posts) = ProcessingQueries.getTopicAndPosts(forumId,topicId)
progress
}
val (fid, tl) = ProcessingQueries.getAllTopics("hon")
val totalPosts = tl.size
val counter = new AtomicInteger(0)
val par = tl.par
par.foreach { topic_id =>
processTopic(fid,topic_id)
}
}
Replaced the previous answer. This does the trick nice and tidy
object MyAnnotator extends ThreadLocal[StanfordCoreNLP] {
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
props.put("ssplit.newlineIsSentenceBreak", "two")
props.put("parse.maxlen", "40")
override def initialValue = new StanfordCoreNLP(props)
}

Resources