Scala slick retrieve data tables in parallel - multithreading

I need to read data from two different tables (both with above 100k rows) in the same database. So I tried to create two Futures and connection pool size is 50, but the performance doesn't seem to improve (total time around 5 seconds). Then I found this article
So if you want to run multiple queries in parallel: no problem, just start them in separate Futures. However you won't have performance benefits, JDBC simply blocks a different Thread, not your main Thread of execution.
This means all the threads will be stuck at JDBC and processed sequentially. Is this true even if my connection pool size is 50? If yes, could you suggest an efficient way when dealing with data tables with large rows (such as load the data in less than 2 seconds)?
Here is my piece of code:
case class User(name: String, age: Int)
class User(tag: Tag) extends Table[User](tag, "User"){
def user_id = column[Long]("user_id")
def name = column[String]("user_name")
def age = column[Int]("user_age")
def * = (user_id, name, age) <> (
{ (row:(Long, String, Int)) => User(row._1, row._2, row._3)}
{ (p: User) => Some(p.name, p.age) }
)
}
val users = TableQuery[User]
case class Patron(name: String, type: Int)
class Patron(tag: Tag) extends Table[Patron](tag, "Patron"){
def patron_id = column[Long]("patron_id")
def name = column[String]("patron_name")
def type = column[Int]("patron_type")
def * = (patron_id, name, type) <> (
{ (row:(Long, String, Int)) => User(row._1, row._2, row._3)}
{ (p: Patron) => Some(p.name, p.type) }
)
}
val patrons = TableQuery[Patron]
def getUsers(implicit session:Session): Future[Map[String, Int]] = Future {
val allUserQuery = for (
user <- users
) yield(user.name, user.age)
allUserQuery.run.toMap
}
def getPatrons(implicit session:Session): Future[Map[String, Int]] = Future {
val allPatronQuery = for (
patron <- patrons
) yield(patron.name, patron.type)
allPatronQuery.run.toMap
}
val (users: List[Map[String, Int]], patrons: Map[Long, String]) = Await.result(
for {
userData <- getUsers
patronData <- getPatrons
} yield (userData, patronData),
10.seconds)

Related

In Spark How do i read a field by its name itself instead by its index

I use Spark 1.3.
My data has 50 and more attributes and hence I went for a custom class.
How do I access a Field from a Custom Class by its name not by its position
Here every time I need to invoke a method productElement(0)
Also i am not supposed to use case class , Hence i am using a Custom class for schema.
class OnlineEvents(gsm_id:String,
attribution_id:String,
event_date:String,
event_timestamp:String,
event_type:String
) extends Product {
override def productElement(n: Int): Any = n match {
case 0 => impression_id
case 1 => attribution_id
case 2 => event_date
case 3 => event_timestamp
case 4 => event_type
case _ => throw new IndexOutOfBoundsException(n.toString)
}
override def productArity: Int = 5
override def canEqual(that: Any): Boolean = that.isInstanceOf[OnlineEvents]
}
My Spark Code :
val onlineRDD = sc.textFile("/user/cloudera/input_files/online_events.txt")
val schemaRDD = onlineRDD.map(record => {
val arr: Array[String] = record.split(",")
new OnlineEvents(arr(0),arr(1),arr(2),arr(3),arr(4))
})
val keyvalueRDD = schemaRDD .map(online => ((online.productElement(0).toString,online.productElement(4).toString),online))
If i try to access any field from OnlineEvents then i need to use productElement() .(i.e online.productElement(0) for gsm_id )
Can i directly access the field as online.gsm_id ... online.event_type , so that my code is easily readable
How do i directly access a field by its name when i use Custom Class for schema?
According to my understanding of your question, you need to define some functions inside OnlineEvents to return the types. So your solution should be
class OnlineEvents(gsm_id:String,
attribution_id:String,
event_date:String,
event_timestamp:String,
event_type:String
) extends Product {
def get_gsm_id(): String ={
gsm_id
}
def get_attribution_id(): String ={
attribution_id
}
def get_event_date(): String ={
event_date
}
def get_event_timestamp(): String ={
event_timestamp
}
def get_event_type(): String ={
event_type
}
override def productElement(n: Int): Any = n match {
case 0 => gsm_id
case 1 => attribution_id
case 2 => event_date
case 3 => event_timestamp
case 4 => event_type
case _ => throw new IndexOutOfBoundsException(n.toString)
}
override def productArity: Int = 5
override def canEqual(that: Any): Boolean = that.isInstanceOf[OnlineEvents]
}
And call the funtions as below
val keyvalueRDD = schemaRDD .map(online => ((online.get_gsm_id().toString,online.get_event_type().toString),online))
I strongly recommend using a case class per use case (which all together cover all the use cases that use the data).
A single use case would then be a single case class that would save you a lot of thinking about how to maintain the 50+ fields.
Yeah, you'd "trade" a single big 50-or-more-field class for 10 5-field case classes, but given how easy it is to create a case class and how nicely they would describe your data I think it's worth the hassle.

Defining and reading a nullable date column in Slick 3.x

I have a table with a column type date. This column accepts null values, therefore, I declared it as an Option (see field perDate below). The issue is that apparently the implicit conversion from/to java.time.LocalDate/java.sql.Date is incorrect as reading from this table when perDate is null fails with the error:
slick.SlickException: Read NULL value (null) for ResultSet column <computed>
This is the Slick table definition, including the implicit function:
import java.sql.Date
import java.time.LocalDate
class FormulaDB(tag: Tag) extends Table[Formula](tag, "formulas") {
def sk = column[Int]("sk", O.PrimaryKey, O.AutoInc)
def name = column[String]("name")
def descrip = column[Option[String]]("descrip")
def formula = column[Option[String]]("formula")
def notes = column[Option[String]]("notes")
def periodicity = column[Int]("periodicity")
def perDate = column[Option[LocalDate]]("per_date")(localDateColumnType)
def * = (sk, name, descrip, formula, notes, periodicity, perDate) <>
((Formula.apply _).tupled, Formula.unapply)
implicit val localDateColumnType = MappedColumnType.base[Option[LocalDate], Date](
{
case Some(localDate) => Date.valueOf(localDate)
case None => null
},{
sqlDate => if (sqlDate != null) Some(sqlDate.toLocalDate) else None
}
)
}
Actually your implicit conversion from/to java.time.LocalDate/java.sql.Date is not incorrect.
I have faced the same error, and doing some research I found that the Node created by the Slick SQL Compiler is actually of type MappedJdbcType[Scala.Option -> LocalDate], and not Option[LocalDate].
That is the reason why when the mapping compiler create the column converter for your def perDate it is creating a Base ResultConverterand not a Option ResultConverter
Here is the Slick code for the base converter:
def base[T](ti: JdbcType[T], name: String, idx: Int) = (ti.scalaType match {
case ScalaBaseType.byteType => new BaseResultConverter[Byte](ti.asInstanceOf[JdbcType[Byte]], name, idx)
case ScalaBaseType.shortType => new BaseResultConverter[Short](ti.asInstanceOf[JdbcType[Short]], name, idx)
case ScalaBaseType.intType => new BaseResultConverter[Int](ti.asInstanceOf[JdbcType[Int]], name, idx)
case ScalaBaseType.longType => new BaseResultConverter[Long](ti.asInstanceOf[JdbcType[Long]], name, idx)
case ScalaBaseType.charType => new BaseResultConverter[Char](ti.asInstanceOf[JdbcType[Char]], name, idx)
case ScalaBaseType.floatType => new BaseResultConverter[Float](ti.asInstanceOf[JdbcType[Float]], name, idx)
case ScalaBaseType.doubleType => new BaseResultConverter[Double](ti.asInstanceOf[JdbcType[Double]], name, idx)
case ScalaBaseType.booleanType => new BaseResultConverter[Boolean](ti.asInstanceOf[JdbcType[Boolean]], name, idx)
case _ => new BaseResultConverter[T](ti.asInstanceOf[JdbcType[T]], name, idx) {
override def read(pr: ResultSet) = {
val v = ti.getValue(pr, idx)
if(v.asInstanceOf[AnyRef] eq null) throw new SlickException("Read NULL value ("+v+") for ResultSet column "+name)
v
}
}
}).asInstanceOf[ResultConverter[JdbcResultConverterDomain, T]]
Unfortunately I have no solution for this problem, what I suggest as a workaround, is to map your perDate property as follows:
import java.sql.Date
import java.time.LocalDate
class FormulaDB(tag: Tag) extends Table[Formula](tag, "formulas") {
def sk = column[Int]("sk", O.PrimaryKey, O.AutoInc)
def name = column[String]("name")
def descrip = column[Option[String]]("descrip")
def formula = column[Option[String]]("formula")
def notes = column[Option[String]]("notes")
def periodicity = column[Int]("periodicity")
def perDate = column[Option[Date]]("per_date")
def toLocalDate(time : Option[Date]) : Option[LocalDate] = time.map(t => t.toLocalDate))
def toSQLDate(localDate : Option[LocalDate]) : Option[Date] = localDate.map(localDate => Date.valueOf(localDate)))
private type FormulaEntityTupleType = (Int, String, Option[String], Option[String], Option[String], Int, Option[Date])
private val formulaShapedValue = (sk, name, descrip, formula, notes, periodicity, perDate).shaped[FormulaEntityTupleType]
private val toFormulaRow: (FormulaEntityTupleType => Formula) = { formulaTuple => {
Formula(formulaTuple._1, formulaTuple._2, formulaTuple._3, formulaTuple._4, formulaTuple._5, formulaTuple._6, toLocalDate(formulaTuple._7))
}
}
private val toFormulaTuple: (Formula => Option[FormulaEntityTupleType]) = { formulaRow =>
Some((formulaRow.sk, formulaRow.name, formulaRow.descrip, formulaRow.formula, formulaRow.notes, formulaRow.periodicity, toSQLDate(formulaRow.perDate)))
}
def * = formulaShapedValue <> (toFormulaRow, toFormulaTuple)
Hopefully the answer comes not too late.
I'm pretty sure the problem is that your'e returning null from your mapping function instead of None.
Try rewriting your mapping function as a function from LocalDate to Date:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
{
localDate => Date.valueOf(localDate)
},{
sqlDate => sqlDate.toLocalDate
}
)
Alternately, mapping from Option[LocalDate] to Option[Date] should work:
implicit val localDateColumnType =
MappedColumnType.base[Option[LocalDate], Option[Date]](
{
localDateOption => localDateOption.map(Date.valueOf)
},{
sqlDateOption => sqlDateOption.map(_.toLocalDate)
}
)

Slick 3 for scala update query not functioning

First of I would like to state that I am new to slick and am using version 3.1.1 . I been reading the manual but i am having trouble getting my query to work. Either something is wrong with my connection string or something is wrong with my Slick code. I got my config from here http://slick.typesafe.com/doc/3.1.1/database.html and my update example from here bottom of page http://slick.typesafe.com/doc/3.1.1/queries.html . Ok so here is my code
Application Config.
mydb= {
dataSourceClass = org.postgresql.ds.PGSimpleDataSource
properties = {
databaseName = "Jsmith"
user = "postgres"
password = "unique"
}
numThreads = 10
}
My Controller -- Database table is called - relations
package controllers
import play.api.mvc._
import slick.driver.PostgresDriver.api._
class Application extends Controller {
class relations(tag: Tag) extends Table[(Int,Int,Int)](tag, "relations") {
def id = column[Int]("id", O.PrimaryKey)
def me = column[Int]("me")
def following = column[Int]("following")
def * = (id,me,following)
}
val profiles = TableQuery[relations]
val db = Database.forConfig("mydb")
try {
// ...
} finally db.close()
def index = Action {
val q = for { p <- profiles if p.id === 2 } yield p.following
val updateAction = q.update(322)
val invoker = q.updateStatement
Ok()
}
}
What could be wrong with my code above ? I have a separate project that uses plain JDBC and this configuration works perfectly for it
db.default.driver=org.postgresql.Driver
db.default.url="jdbc:postgresql://localhost:5432/Jsmith"
db.default.user="postgres"
db.default.password="unique"
You did not run your action yet. db.run(updateAction) executes your query respectively your action on a database (untested):
def index = Action.async {
val q = for { p <- profiles if p.id === 2 } yield p.following
val updateAction = q.update(322)
val db = Database.forConfig("mydb")
db.run(updateAction).map(_ => Ok())
}
db.run() returns a Future which will be eventually completed. It is then simply mapped to a Result in play.
q.updateStatement on the other hand just generates a sql statement. This can be useful while debugging.
Look the code from my project:
def updateStatus(username: String, password: String, status: Boolean): Future[Boolean] = {
db.run(
(for {
user <- Users if user.username === username
} yield {
user
}).map(_.online).update(status)
}

Slick DBIO: master-detail

I need to collect report data from a master-detail relations. Here is a simplified example:
case class Person(id: Int, name: String)
case class Order(id: String, personId: Int, description: String)
class PersonTable(tag: Tag) extends Table[Person](tag, "person") {
def id = column[Int]("id")
def name = column[String]("name")
override def * = (id, name) <>(Person.tupled, Person.unapply)
}
class OrderTable(tag: Tag) extends Table[Order](tag, "order") {
def id = column[String]("id")
def personId = column[Int]("personId")
def description = column[String]("description")
override def * = (id, personId, description) <>(Order.tupled, Order.unapply)
}
val persons = TableQuery[PersonTable]
val orders = TableQuery[OrderTable]
case class PersonReport(nameToDescription: Map[String, Seq[String]])
/** Some complex function that cannot be expressed in SQL and
* in slick's #join.
*/
def myScalaCondition(person: Person): Boolean =
person.name.contains("1")
// Doesn't compile:
// val reportDbio1:DBIO[PersonReport] =
// (for{ allPersons <- persons.result
// person <- allPersons
// if myScalaCondition(person)
// descriptions <- orders.
// filter(_.personId == person.id).
// map(_.description).result
// } yield (person.name, descriptions)
// ).map(s => PersonReport(s.toMap))
val reportDbio2: DBIO[PersonReport] =
persons.result.flatMap {
allPersons =>
val dbios = allPersons.
filter(myScalaCondition).map { person =>
orders.
filter(_.personId == person.id).
map(_.description).result.map { seq => (person.name, seq) }
}
DBIO.sequence(dbios)
}.map(ps => PersonReport(ps.toMap))
It looks far away from straightforward. When I need to collect master-detail data with 3 levels, it becomes incomprehensible.
Is there a better way?

scala parallel collections: Idiomatic way of having thread-local-variables for worker threads

The progress function below is my worker function. I need to give it access to some classes which are costly to create / acquire. Is there any standard machinery for thread-local-variables in the libraries for this ? Or will I have to write a object pool manager myself ?
object Start extends App {
def progress {
val current = counter.getAndIncrement
if(current % 100 == 0) {
val perc = current.toFloat * 100 / totalPosts
print(f"\r$perc%4.2f%%")
}
}
val lexicon = new Global()
def processTopic(forumId: Int, topicId: Int) {
val(topic, posts) = ProcessingQueries.getTopicAndPosts(forumId,topicId)
progress
}
val (fid, tl) = ProcessingQueries.getAllTopics("hon")
val totalPosts = tl.size
val counter = new AtomicInteger(0)
val par = tl.par
par.foreach { topic_id =>
processTopic(fid,topic_id)
}
}
Replaced the previous answer. This does the trick nice and tidy
object MyAnnotator extends ThreadLocal[StanfordCoreNLP] {
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
props.put("ssplit.newlineIsSentenceBreak", "two")
props.put("parse.maxlen", "40")
override def initialValue = new StanfordCoreNLP(props)
}

Resources