Spark Aggregator on sorted Window never uses merge - is this reliable? - apache-spark

I am using org.apache.spark.sql.expressions.Aggregator to implement custom logic on a series of rows. I have noticed that the merge() function is never called when the Aggregator is applied to an ordered window with rows between unboundedPreceding and currentRow, i.e. the aggregation behavior is entirely determined by how new elements are added to the latest reduction, reduce().
If merge() is indeed never called in this case, UDAFs would be a great tool to integrate arbitrary custom logic on large partitions of ordered rows; see https://softwarerecs.stackexchange.com/questions/83666/foss-data-stack-to-perform-complex-custom-logic-on-billions-of-ordered-rows. However, I cannot find this being mentioned in the Spark documentation or the Spark issue tracker, and hence I am wondering if it is safe to use in this way - specifically for custom algorithms that don't allow for a merge()-like operation.
Below is some code specifically to test this behavior. I have locally checked the observation with a set of 300 million rows and partitioning based on three columns (each partition having a few million rows), and the observation holds up.
timestampdata.csv
category,eventTime
a,240
a,489
b,924
a,890
b,563
a,167
a,134
b,600
b,901
OrderedProcessing.scala
object OrderedProcessing {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val checkOrderingUdf: UserDefinedFunction = udaf[Int, OrderProcessingInfo, OrderProcessingInfo](CheckOrdering)
val df_data = spark.read
.options(Map("inferSchema" -> "true", "delimiter" -> ",", "header" -> "true"))
.csv("./timestampdata.csv")
val df_checked = df_data
.withColumn("orderProcessingInfo",
checkOrderingUdf.apply($"eventTime").over(
Window.partitionBy("category").orderBy("eventTime")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
.select($"category", $"eventTime",
$"orderProcessingInfo".getItem("processedAllInOrder").alias("processedAllInOrder"),
$"orderProcessingInfo".getItem("haveUsedReduce").alias("haveUsedReduce"),
$"orderProcessingInfo".getItem("haveUsedMerge").alias("haveUsedMerge"))
df_checked.groupBy("processedAllInOrder", "haveUsedReduce", "haveUsedMerge").count().show()
}
}
OrderProcessingInfo.scala
case class OrderProcessingInfo(latestTime: Int, processedAllInOrder: Boolean, haveUsedReduce: Boolean, haveUsedMerge: Boolean)
CheckOrdering.scala
object CheckOrdering extends Aggregator[Int, OrderProcessingInfo, OrderProcessingInfo] {
override def zero = OrderProcessingInfo(0, true, false, false)
override def reduce(agg: OrderProcessingInfo, e: Int) = OrderProcessingInfo(
latestTime = e, processedAllInOrder = agg.processedAllInOrder & (e >= agg.latestTime),
haveUsedReduce = true, haveUsedMerge = agg.haveUsedMerge
)
override def merge(agg1: OrderProcessingInfo, agg2: OrderProcessingInfo) = OrderProcessingInfo(
latestTime = agg1.latestTime.max(agg2.latestTime),
processedAllInOrder = agg1.processedAllInOrder & agg2.processedAllInOrder & (agg2.latestTime >= agg1.latestTime),
haveUsedReduce = agg1.haveUsedReduce | agg2.haveUsedReduce,
haveUsedMerge = true
)
override def finish(agg: OrderProcessingInfo) = agg
override def bufferEncoder: Encoder[OrderProcessingInfo] = implicitly(ExpressionEncoder[OrderProcessingInfo])
override def outputEncoder: Encoder[OrderProcessingInfo] = implicitly(ExpressionEncoder[OrderProcessingInfo])
}
output
+-------------------+--------------+-------------+-----+
|processedAllInOrder|haveUsedReduce|haveUsedMerge|count|
+-------------------+--------------+-------------+-----+
| true| true| false| 9|
+-------------------+--------------+-------------+-----+

Related

How come the test case is still passing even if I have not provided correct mocking in my opinion

I am testing this function. The main bit for me is the call to add method of a respository (partitionsOfATagTransactionRepository.add(transaction, infoToAdd,mutationCondition))
def updateOrCreateTagPartitionInfo(transaction:DistributedTransaction,currentTagPartition: Option[TagPartitions], tag: String) = {
val currentCalendar = Calendar.getInstance() //TODOM - should I use a standard Locale/Timezone (eg GMT) to keep time consistent across all instances of the server application
val currentYear = currentCalendar.get(Calendar.YEAR).toLong
val currentMonth = currentCalendar.get(Calendar.MONTH).toLong
val newTagParitionInfo = TagPartitionsInfo(currentYear.toLong, currentMonth.toLong)
val (infoToAdd,mutationCondition) = currentTagPartition match {
case Some(tagPartitionInfo) => {
//checktest-should add new tag partition info to existing partition info
(TagPartitions(tagPartitionInfo.tag, tagPartitionInfo.partitionInfo + (newTagParitionInfo)),new PutIfExists)
}
case None => {
//checktest-should add new tag partition info if existing partition doesn't exist
(TagPartitions(tag, Set(newTagParitionInfo)),new PutIfNotExists)
}
}
partitionsOfATagTransactionRepository.add(transaction, infoToAdd,mutationCondition) //calling a repositoru method which I suppose needs mocking
infoToAdd
}
I wrote this test case to test the method
"should add new tag partition info if existing partition doesn't exist" in {
val servicesTestEnv = new ServicesTestEnv(components = components)
val questionTransactionDBService = new QuestionsTransactionDatabaseService(
servicesTestEnv.mockAnswersTransactionRepository,
servicesTestEnv.mockPartitionsOfATagTransactionRepository,
servicesTestEnv.mockPracticeQuestionsTagsTransactionRepository,
servicesTestEnv.mockPracticeQuestionsTransactionRepository,
servicesTestEnv.mockSupportedTagsTransactionRepository,
servicesTestEnv.mockUserProfileAndPortfolioTransactionRepository,
servicesTestEnv.mockQuestionsCreatedByUserRepo,
servicesTestEnv.mockTransactionService,
servicesTestEnv.mockPartitionsOfATagRepository,
servicesTestEnv.mockHelperMethods
)
val currentCalendar = Calendar.getInstance() //TODOM - should I use a standard Locale/Timezone (eg GMT) to keep time consistent across all instances of the server application
val currentYear = currentCalendar.get(Calendar.YEAR).toLong
val currentMonth = currentCalendar.get(Calendar.MONTH).toLong
val newTagParitionInfo = TagPartitionsInfo(currentYear.toLong, currentMonth.toLong)
val existingTag = "someExistingTag"
val existingTagPartitions = None
val result = questionTransactionDBService.updateOrCreateTagPartitionInfo(servicesTestEnv.mockDistributedTransaction,
existingTagPartitions,existingTag) //calling the funtion under test but have not provided mock for the repository's add method. The test passes! how? Shouldn't the test throw Null Pointer exception?
val expectedResult = TagPartitions(existingTag,Set(newTagParitionInfo))
verify(servicesTestEnv.mockPartitionsOfATagTransactionRepository,times(1))
.add(servicesTestEnv.mockDistributedTransaction,expectedResult,new PutIfNotExists())
result mustBe expectedResult
result mustBe TagPartitions(existingTag,Set(newTagParitionInfo))
}
The various mocks are defined as
val mockCredentialsProvider = mock(classOf[CredentialsProvider])
val mockUserTokenTransactionRepository = mock(classOf[UserTokenTransactionRepository])
val mockUserTransactionRepository = mock(classOf[UserTransactionRepository])
val mockUserProfileAndPortfolioTransactionRepository = mock(classOf[UserProfileAndPortfolioTransactionRepository])
val mockHelperMethods = mock(classOf[HelperMethods])
val mockTransactionService = mock(classOf[TransactionService])
val mockQuestionsCreatedByUserRepo = mock(classOf[QuestionsCreatedByAUserForATagTransactionRepository])
val mockQuestionsAnsweredByUserRepo = mock(classOf[QuestionsAnsweredByAUserForATagTransactionRepository])
val mockDistributedTransaction = mock(classOf[DistributedTransaction])
val mockQuestionTransactionDBService = mock(classOf[QuestionsTransactionDatabaseService])
val mockQuestionNonTransactionDBService = mock(classOf[QuestionsNonTransactionDatabaseService])
val mockAnswersTransactionRepository = mock(classOf[AnswersTransactionRepository])
val mockPartitionsOfATagTransactionRepository = mock(classOf[PartitionsOfATagTransactionRepository])
val mockPracticeQuestionsTagsTransactionRepository = mock(classOf[PracticeQuestionsTagsTransactionRepository])
val mockPracticeQuestionsTransactionRepository = mock(classOf[PracticeQuestionsTransactionRepository])
val mockSupportedTagsTransactionRepository = mock(classOf[SupportedTagsTransactionRepository])
val mockPartitionsOfATagRepository = mock(classOf[PartitionsOfATagRepository])
The test case passes even though I have not provided any mock for partitionsOfATagTransactionRepository.add. Should I get a NullPointer exception when the add method is called.
I was expecting that I would need to write something like doNothing().when(servicesTestEnv.mockPartitionsOfATagTransactionRepository).add(ArgumentMatchers.any[DistributedTransaction],ArgumentMatchers.any[TagPartitions],ArgumentMatchers.any[MutationCondition]) or when(servicesTestEnv.mockPartitionsOfATagTransactionRepository).add(ArgumentMatchers.any[DistributedTransaction],ArgumentMatchers.any[TagPartitions],ArgumentMatchers.any[MutationCondition]).thenReturn(...) for the test case to pass.
Mockito team made a decision to return default value for a method if no stubbing is provided.
See: https://javadoc.io/doc/org.mockito/mockito-core/latest/org/mockito/Mockito.html#stubbing
By default, for all methods that return a value, a mock will return either null, a primitive/primitive wrapper value, or an empty collection, as appropriate. For example 0 for an int/Integer and false for a boolean/Boolean.
This decision was made consciously: if you are focusing on a different aspect of behaviour of method under test, and the default value is good enough, you don't need to specify it.
Note that other mocking frameworks have taken opposite path - they raise an exception when unstubbed call is detected (for example: EasyMock).
See EasyMock vs Mockito: design vs maintainability?

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?
You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.
For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306
If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

Sorting a table by date in Slick 3.1.x

I have the following Slick class that includes a date:
import java.sql.Date
import java.time.LocalDate
class ReportDateDB(tag: Tag) extends Table[ReportDateVO](tag, "report_dates") {
def reportDate = column[LocalDate]("report_date")(localDateColumnType)
def * = (reportDate) <> (ReportDateVO.apply, ReportDateVO.unapply)
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate
)
}
When I attempt to sort the table by date:
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I get the following compilation error
not enough arguments for method sortBy: (implicit evidence$2: slick.lifted.Rep[java.time.LocalDate] ⇒
slick.lifted.Ordered)slick.lifted.Query[fdic.ReportDateDB,fdic.ReportDateDB#TableElementType,Seq].
Unspecified value parameter evidence$2.
No implicit view available from slick.lifted.Rep[java.time.LocalDate] ⇒ slick.lifted.Ordered.
How to specify the implicit default order?
You need to make your implicit val localDateColumnType available where you run the query. For example, this will work:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate)
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I'm not sure where the best place to put this is, but I usually put all these conversions in a package object.
It should work like described here:
implicit def localDateOrdering: Ordering[LocalDate] = Ordering.fromLessThan(_ isBefore _)
Try add this line to your import list:
import slick.driver.MySQLDriver.api._

Accessing nested data in spark

I have a collection of nested case classes. I've got a job that generates a dataset using these case classes, and writes the output to parquet.
I was pretty annoyed to discover that I have to manually do a load of faffing around to load and convert this data back to case classes to work with it in subsequent jobs. Anyway, that's what I'm now trying to do.
My case classes are like:
case class Person(userId: String, tech: Option[Tech])
case class Tech(browsers: Seq[Browser], platforms: Seq[Platform])
case class Browser(family: String, version: Int)
So I'm loading my parquet data. I can get the tech data as a Row with:
val df = sqlContext.load("part-r-00716.gz.parquet")
val x = df.head
val tech = x.getStruct(x.fieldIndex("tech"))
But now I can't find how to actually iterate over the browsers. If I try val browsers = tech.getStruct(tech.fieldIndex("browsers")) I get an exception:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to org.apache.spark.sql.Row
How can I iterate over my nested browser data using spark 1.5.2?
Update
In fact, my case classes contain optional values, so Browser actually is:
case class Browser(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
language: String,
timesSeen: Long = 1,
firstSeenAt: Long,
lastSeenAt: Long)
I also have similar for Os:
case class Os(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
patchMinor: Option[String],
override val timesSeen: Long = 1,
override val firstSeenAt: Long,
override val lastSeenAt: Long)
And so Tech is really:
case class Technographic(browsers: Seq[Browser],
devices: Seq[Device],
oss: Seq[Os])
Now, given the fact that some values are optional, I need a solution that will allow me to reconstruct my case classes correctly. The current solution doesn't support None values, so for example given the input data:
Tech(browsers=Seq(
Browser(family=Some("IE"), major=Some(7), language=Some("en"), timesSeen=3),
Browser(family=None, major=None, language=Some("en-us"), timesSeen=1),
Browser(family=Some("Firefox), major=None, language=None, timesSeen=1)
)
)
I need it to load the data as follows:
family=IE, major=7, language=en, timesSeen=3,
family=None, major=None, language=en-us, timesSeen=1,
family=Firefox, major=None, language=None, timesSeen=1
Because the current solution doesn't support None values, it in fact has an arbitrary number of values per list item, i.e.:
browsers.family = ["IE", "Firefox"]
browsers.major = [7]
browsers.language = ["en", "en-us"]
timesSeen = [3, 1, 1]
As you can see, there's no way of converting the final data (returned by spark) into the case classes that generated it.
How can I work around this insanity?
Some examples
// Select two columns
df.select("userId", "tech.browsers").show()
// Select the nested values only
df.select("tech.browsers").show(truncate = false)
+-------------------------+
|browsers |
+-------------------------+
|[[Firefox,4], [Chrome,2]]|
|[[Firefox,4], [Chrome,2]]|
|[[IE,25]] |
|[] |
|null |
+-------------------------+
// Extract the family (nested value)
// This way you can iterate over the persons, and get their browsers
// Family values are nested
df.select("tech.browsers.family").show()
+-----------------+
| family|
+-----------------+
|[Firefox, Chrome]|
|[Firefox, Chrome]|
| [IE]|
| []|
| null|
+-----------------+
// Normalize the family: One row for each family
// Then you can iterate over all families
// Family values are un-nested, empty values/null/None are handled by explode()
df.select(explode(col("tech.browsers.family")).alias("family")).show()
+-------+
| family|
+-------+
|Firefox|
| Chrome|
|Firefox|
| Chrome|
| IE|
+-------+
Based on the last example:
val families = df.select(explode(col("tech.browsers.family")))
.map(r => r.getString(0)).distinct().collect().toList
println(families)
gives the unique list of browers in a "normal" local Scala list:
List(IE, Firefox, Chrome)

How to use DBIO.sequence and avoid StackOverflowError

I am a Slick beginner, just experimenting with Slick 3.0 RC1. In my first project I'd like to import data from a text file to various tables. The whole import should happen in sequence as the data are in the file within one transaction.
I tried to create an Iterator of the actions and wrap them in a DBO.sequence.
The problem is that when the number of rows is big then the import fails with a StackOverflowError. Obviously I have misunderstood how to use Slick to do what I want to do. Is there a better way how to chain a large number into one transaction?
Here a simplified version of my code, where instead of reading the data from a file, I simply "import" the numbers from a Range. The even ones to the table XS, the odd ones to the table YS.
val db = Database.forConfig("h2mem1")
try {
class Xs(tag: Tag) extends Table[(Long, String)](tag, "XS") {
def id = column[Long]("ID", O.PrimaryKey)
def name = column[String]("NAME")
override def * : ProvenShape[(Long, String)] = (id, name)
}
class Ys(tag: Tag) extends Table[(Long, String)](tag, "YS") {
def id = column[Long]("ID", O.PrimaryKey)
def name = column[String]("NAME")
override def * : ProvenShape[(Long, String)] = (id, name)
}
val xs = TableQuery[Xs]
val ys = TableQuery[Ys]
val setupAction = DBIO.seq((xs.schema ++ ys.schema).create)
val importAction = DBIO.sequence((1L to 100000L).iterator.map { x =>
if (x % 2 == 0) {
xs +=(x, x.toString)
} else {
ys +=(x, x.toString)
}
})
val f = db.run((setupAction andThen importAction))
Await.result(f, Duration.Inf)
} finally {
db.close
}
The problem was caused by the not very efficient implementation in the RC1 version of Slick. The implementation has been improved as seen in the issue thread here.
In the Slick 3 RC 3 the problem is solved :)

Resources