How to use DBIO.sequence and avoid StackOverflowError - slick

I am a Slick beginner, just experimenting with Slick 3.0 RC1. In my first project I'd like to import data from a text file to various tables. The whole import should happen in sequence as the data are in the file within one transaction.
I tried to create an Iterator of the actions and wrap them in a DBO.sequence.
The problem is that when the number of rows is big then the import fails with a StackOverflowError. Obviously I have misunderstood how to use Slick to do what I want to do. Is there a better way how to chain a large number into one transaction?
Here a simplified version of my code, where instead of reading the data from a file, I simply "import" the numbers from a Range. The even ones to the table XS, the odd ones to the table YS.
val db = Database.forConfig("h2mem1")
try {
class Xs(tag: Tag) extends Table[(Long, String)](tag, "XS") {
def id = column[Long]("ID", O.PrimaryKey)
def name = column[String]("NAME")
override def * : ProvenShape[(Long, String)] = (id, name)
}
class Ys(tag: Tag) extends Table[(Long, String)](tag, "YS") {
def id = column[Long]("ID", O.PrimaryKey)
def name = column[String]("NAME")
override def * : ProvenShape[(Long, String)] = (id, name)
}
val xs = TableQuery[Xs]
val ys = TableQuery[Ys]
val setupAction = DBIO.seq((xs.schema ++ ys.schema).create)
val importAction = DBIO.sequence((1L to 100000L).iterator.map { x =>
if (x % 2 == 0) {
xs +=(x, x.toString)
} else {
ys +=(x, x.toString)
}
})
val f = db.run((setupAction andThen importAction))
Await.result(f, Duration.Inf)
} finally {
db.close
}

The problem was caused by the not very efficient implementation in the RC1 version of Slick. The implementation has been improved as seen in the issue thread here.
In the Slick 3 RC 3 the problem is solved :)

Related

Spark Aggregator on sorted Window never uses merge - is this reliable?

I am using org.apache.spark.sql.expressions.Aggregator to implement custom logic on a series of rows. I have noticed that the merge() function is never called when the Aggregator is applied to an ordered window with rows between unboundedPreceding and currentRow, i.e. the aggregation behavior is entirely determined by how new elements are added to the latest reduction, reduce().
If merge() is indeed never called in this case, UDAFs would be a great tool to integrate arbitrary custom logic on large partitions of ordered rows; see https://softwarerecs.stackexchange.com/questions/83666/foss-data-stack-to-perform-complex-custom-logic-on-billions-of-ordered-rows. However, I cannot find this being mentioned in the Spark documentation or the Spark issue tracker, and hence I am wondering if it is safe to use in this way - specifically for custom algorithms that don't allow for a merge()-like operation.
Below is some code specifically to test this behavior. I have locally checked the observation with a set of 300 million rows and partitioning based on three columns (each partition having a few million rows), and the observation holds up.
timestampdata.csv
category,eventTime
a,240
a,489
b,924
a,890
b,563
a,167
a,134
b,600
b,901
OrderedProcessing.scala
object OrderedProcessing {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val checkOrderingUdf: UserDefinedFunction = udaf[Int, OrderProcessingInfo, OrderProcessingInfo](CheckOrdering)
val df_data = spark.read
.options(Map("inferSchema" -> "true", "delimiter" -> ",", "header" -> "true"))
.csv("./timestampdata.csv")
val df_checked = df_data
.withColumn("orderProcessingInfo",
checkOrderingUdf.apply($"eventTime").over(
Window.partitionBy("category").orderBy("eventTime")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
.select($"category", $"eventTime",
$"orderProcessingInfo".getItem("processedAllInOrder").alias("processedAllInOrder"),
$"orderProcessingInfo".getItem("haveUsedReduce").alias("haveUsedReduce"),
$"orderProcessingInfo".getItem("haveUsedMerge").alias("haveUsedMerge"))
df_checked.groupBy("processedAllInOrder", "haveUsedReduce", "haveUsedMerge").count().show()
}
}
OrderProcessingInfo.scala
case class OrderProcessingInfo(latestTime: Int, processedAllInOrder: Boolean, haveUsedReduce: Boolean, haveUsedMerge: Boolean)
CheckOrdering.scala
object CheckOrdering extends Aggregator[Int, OrderProcessingInfo, OrderProcessingInfo] {
override def zero = OrderProcessingInfo(0, true, false, false)
override def reduce(agg: OrderProcessingInfo, e: Int) = OrderProcessingInfo(
latestTime = e, processedAllInOrder = agg.processedAllInOrder & (e >= agg.latestTime),
haveUsedReduce = true, haveUsedMerge = agg.haveUsedMerge
)
override def merge(agg1: OrderProcessingInfo, agg2: OrderProcessingInfo) = OrderProcessingInfo(
latestTime = agg1.latestTime.max(agg2.latestTime),
processedAllInOrder = agg1.processedAllInOrder & agg2.processedAllInOrder & (agg2.latestTime >= agg1.latestTime),
haveUsedReduce = agg1.haveUsedReduce | agg2.haveUsedReduce,
haveUsedMerge = true
)
override def finish(agg: OrderProcessingInfo) = agg
override def bufferEncoder: Encoder[OrderProcessingInfo] = implicitly(ExpressionEncoder[OrderProcessingInfo])
override def outputEncoder: Encoder[OrderProcessingInfo] = implicitly(ExpressionEncoder[OrderProcessingInfo])
}
output
+-------------------+--------------+-------------+-----+
|processedAllInOrder|haveUsedReduce|haveUsedMerge|count|
+-------------------+--------------+-------------+-----+
| true| true| false| 9|
+-------------------+--------------+-------------+-----+

How come the test case is still passing even if I have not provided correct mocking in my opinion

I am testing this function. The main bit for me is the call to add method of a respository (partitionsOfATagTransactionRepository.add(transaction, infoToAdd,mutationCondition))
def updateOrCreateTagPartitionInfo(transaction:DistributedTransaction,currentTagPartition: Option[TagPartitions], tag: String) = {
val currentCalendar = Calendar.getInstance() //TODOM - should I use a standard Locale/Timezone (eg GMT) to keep time consistent across all instances of the server application
val currentYear = currentCalendar.get(Calendar.YEAR).toLong
val currentMonth = currentCalendar.get(Calendar.MONTH).toLong
val newTagParitionInfo = TagPartitionsInfo(currentYear.toLong, currentMonth.toLong)
val (infoToAdd,mutationCondition) = currentTagPartition match {
case Some(tagPartitionInfo) => {
//checktest-should add new tag partition info to existing partition info
(TagPartitions(tagPartitionInfo.tag, tagPartitionInfo.partitionInfo + (newTagParitionInfo)),new PutIfExists)
}
case None => {
//checktest-should add new tag partition info if existing partition doesn't exist
(TagPartitions(tag, Set(newTagParitionInfo)),new PutIfNotExists)
}
}
partitionsOfATagTransactionRepository.add(transaction, infoToAdd,mutationCondition) //calling a repositoru method which I suppose needs mocking
infoToAdd
}
I wrote this test case to test the method
"should add new tag partition info if existing partition doesn't exist" in {
val servicesTestEnv = new ServicesTestEnv(components = components)
val questionTransactionDBService = new QuestionsTransactionDatabaseService(
servicesTestEnv.mockAnswersTransactionRepository,
servicesTestEnv.mockPartitionsOfATagTransactionRepository,
servicesTestEnv.mockPracticeQuestionsTagsTransactionRepository,
servicesTestEnv.mockPracticeQuestionsTransactionRepository,
servicesTestEnv.mockSupportedTagsTransactionRepository,
servicesTestEnv.mockUserProfileAndPortfolioTransactionRepository,
servicesTestEnv.mockQuestionsCreatedByUserRepo,
servicesTestEnv.mockTransactionService,
servicesTestEnv.mockPartitionsOfATagRepository,
servicesTestEnv.mockHelperMethods
)
val currentCalendar = Calendar.getInstance() //TODOM - should I use a standard Locale/Timezone (eg GMT) to keep time consistent across all instances of the server application
val currentYear = currentCalendar.get(Calendar.YEAR).toLong
val currentMonth = currentCalendar.get(Calendar.MONTH).toLong
val newTagParitionInfo = TagPartitionsInfo(currentYear.toLong, currentMonth.toLong)
val existingTag = "someExistingTag"
val existingTagPartitions = None
val result = questionTransactionDBService.updateOrCreateTagPartitionInfo(servicesTestEnv.mockDistributedTransaction,
existingTagPartitions,existingTag) //calling the funtion under test but have not provided mock for the repository's add method. The test passes! how? Shouldn't the test throw Null Pointer exception?
val expectedResult = TagPartitions(existingTag,Set(newTagParitionInfo))
verify(servicesTestEnv.mockPartitionsOfATagTransactionRepository,times(1))
.add(servicesTestEnv.mockDistributedTransaction,expectedResult,new PutIfNotExists())
result mustBe expectedResult
result mustBe TagPartitions(existingTag,Set(newTagParitionInfo))
}
The various mocks are defined as
val mockCredentialsProvider = mock(classOf[CredentialsProvider])
val mockUserTokenTransactionRepository = mock(classOf[UserTokenTransactionRepository])
val mockUserTransactionRepository = mock(classOf[UserTransactionRepository])
val mockUserProfileAndPortfolioTransactionRepository = mock(classOf[UserProfileAndPortfolioTransactionRepository])
val mockHelperMethods = mock(classOf[HelperMethods])
val mockTransactionService = mock(classOf[TransactionService])
val mockQuestionsCreatedByUserRepo = mock(classOf[QuestionsCreatedByAUserForATagTransactionRepository])
val mockQuestionsAnsweredByUserRepo = mock(classOf[QuestionsAnsweredByAUserForATagTransactionRepository])
val mockDistributedTransaction = mock(classOf[DistributedTransaction])
val mockQuestionTransactionDBService = mock(classOf[QuestionsTransactionDatabaseService])
val mockQuestionNonTransactionDBService = mock(classOf[QuestionsNonTransactionDatabaseService])
val mockAnswersTransactionRepository = mock(classOf[AnswersTransactionRepository])
val mockPartitionsOfATagTransactionRepository = mock(classOf[PartitionsOfATagTransactionRepository])
val mockPracticeQuestionsTagsTransactionRepository = mock(classOf[PracticeQuestionsTagsTransactionRepository])
val mockPracticeQuestionsTransactionRepository = mock(classOf[PracticeQuestionsTransactionRepository])
val mockSupportedTagsTransactionRepository = mock(classOf[SupportedTagsTransactionRepository])
val mockPartitionsOfATagRepository = mock(classOf[PartitionsOfATagRepository])
The test case passes even though I have not provided any mock for partitionsOfATagTransactionRepository.add. Should I get a NullPointer exception when the add method is called.
I was expecting that I would need to write something like doNothing().when(servicesTestEnv.mockPartitionsOfATagTransactionRepository).add(ArgumentMatchers.any[DistributedTransaction],ArgumentMatchers.any[TagPartitions],ArgumentMatchers.any[MutationCondition]) or when(servicesTestEnv.mockPartitionsOfATagTransactionRepository).add(ArgumentMatchers.any[DistributedTransaction],ArgumentMatchers.any[TagPartitions],ArgumentMatchers.any[MutationCondition]).thenReturn(...) for the test case to pass.
Mockito team made a decision to return default value for a method if no stubbing is provided.
See: https://javadoc.io/doc/org.mockito/mockito-core/latest/org/mockito/Mockito.html#stubbing
By default, for all methods that return a value, a mock will return either null, a primitive/primitive wrapper value, or an empty collection, as appropriate. For example 0 for an int/Integer and false for a boolean/Boolean.
This decision was made consciously: if you are focusing on a different aspect of behaviour of method under test, and the default value is good enough, you don't need to specify it.
Note that other mocking frameworks have taken opposite path - they raise an exception when unstubbed call is detected (for example: EasyMock).
See EasyMock vs Mockito: design vs maintainability?

How to parse a string of data into key-value pairs and append it into "put" method?

I have a string of data and want to parse it into key/value pairs and then append it to “put” method and encode then to GenericRecord. However, it doesn’t work out and I would be grateful for a hint how to do it.
I converted a string into the list of strings, but apperently here just two strings are expected. Would be grateful for any ideas how to solve the issue.
data = "{"name":"John", "surname":"Peterson", "country":"France", “amount”: null}"
val parsedData = data.split(",").map(x => {val y = x.split(":");(y(0),y(1))}).map(x => (x._1,x._2)).toList
//output type here List[(String, String)]
rec.put(parsedData)
//input for “put” method - public void put(String key, Object value)
Expected results: to append data dynamically as they come from a message:
rec.put("name", "John")
rec.put("surname", "Peterson")
rec.put("country", "France")
rec.put(“amount”, null)
I think this is more elegant way of doing what you asked for.
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = org.json4s.DefaultFormats
val data = "{"name":"John", "surname":"Peterson", "country":"France", “amount”: null}"
val parsedData: Map[String, String] = parse(data).extract[Map[String, String]]
parsedData.foreach { case (key, value) => rec.put(key, value) }

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?
You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.
For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306
If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

Sorting a table by date in Slick 3.1.x

I have the following Slick class that includes a date:
import java.sql.Date
import java.time.LocalDate
class ReportDateDB(tag: Tag) extends Table[ReportDateVO](tag, "report_dates") {
def reportDate = column[LocalDate]("report_date")(localDateColumnType)
def * = (reportDate) <> (ReportDateVO.apply, ReportDateVO.unapply)
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate
)
}
When I attempt to sort the table by date:
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I get the following compilation error
not enough arguments for method sortBy: (implicit evidence$2: slick.lifted.Rep[java.time.LocalDate] ⇒
slick.lifted.Ordered)slick.lifted.Query[fdic.ReportDateDB,fdic.ReportDateDB#TableElementType,Seq].
Unspecified value parameter evidence$2.
No implicit view available from slick.lifted.Rep[java.time.LocalDate] ⇒ slick.lifted.Ordered.
How to specify the implicit default order?
You need to make your implicit val localDateColumnType available where you run the query. For example, this will work:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate)
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I'm not sure where the best place to put this is, but I usually put all these conversions in a package object.
It should work like described here:
implicit def localDateOrdering: Ordering[LocalDate] = Ordering.fromLessThan(_ isBefore _)
Try add this line to your import list:
import slick.driver.MySQLDriver.api._

Resources