The function in map is not executed [duplicate] - apache-spark

When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?

Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava

Related

How do you transform a `FixedSqlAction` into a `StreamingDBIO` in Slick?

I'm creating an akka-stream using Alpakka and the Slick module but I'm stuck in a type mismatch problem.
One branch is about getting the total number of invoices in their table:
def getTotal(implicit session: SlickSession) = {
import session.profile.api._
val query = TableQuery[Tables.Invoice].length.result
Slick.source(query)
}
But the end line doesn't compile because Alpakka is expecting a StreamingDBIO but I'm providing a FixedSqlAction[Int,slick.dbio.NoStream,slick.dbio.Effect.Read].
How can I move from the non-streaming result to the streaming one?
Taking the length of a table results in a single value, not a stream. So the simplest way to get a Source to feed a stream is
def getTotal(implicit session: SlickSession): Source[Int, NotUsed] =
Source.lazyFuture { () =>
// Don't actually run the query until the stream has materialized and
// demand has reached the source
val query = TableQuery[Tables.Invoice].length.result
session.db.run(query)
}
Alpakka's Slick connector is more oriented towards streaming (including managing pagination etc.) results of queries that have a lot of results. For a single result, converting the Future of the result that vanilla Slick gives you into a stream is sufficient.
If you want to start executing the query as soon as you call getTotal (note that this whether or not the downstream ever runs or demands data from the source), you can have
def getTotal(implicit session: SlickSession): Source[Int, NotUsed] = {
val query = TableQuery[Tables.Invoice].length.result
Source.future(session.db.run(query))
}
Would sth like this work for you?
def getTotal() = {
// Doc Expressions (Scalar values)
// https://scala-slick.org/doc/3.2.0/queries.html
val query = TableQuery[Tables.Invoice].length.result
val res = Await.result(session.db.run(query), 60.seconds)
println(s"Result: $res")
res
}

Read, update and save cached value atomically

I have a multiple streams (N) which should update the same cache. So, assume, that there is at least N threads. Each thread may process values with similar keys. The problem is that if i do update as following:
1. Read old value from cache (multiple threads get the same old value)
2. Merge new value with old value (each thread update old value)
3. Save updated value back to the cache (only the last update was saved, another one is lost)
i can lost some updates if multiple threads will simultaneously try to update the same record. At first glance, there is a solution to make all updates atomic: for example, use Increment mutation in hbase or add in aerospike (currently, i'm considering these caches for my case). If value consists only of numeric primitive types, then it is ok, because both cache implementations support atomic inc/dec.
1. Inc/dec each value (cache will resolve sequence of this ops by it's self)
But what if value consists not only of primitives? Then i have to read value and update it in my code. In this case i still can lose some updates.
As i wrote, currently i'm considering hbase and aerospike, but both not fully fit for my case. In hbase, as i know, there is no way to lock row from client side (> ~0.98), so i have to use checkAndPut operation for each complex type. In aerospike i can achieve something like row-based lock using lua udfs, but i want to avoid them. Redis allow to watch record and if there is was update from another thread the transaction will fail and i can catch this error and try again.
So, my question is how to achieve something like row-based lock for such updates and is row-based lock will be a correct way? Maybe there is another approach?
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("sample")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Duration(500))
val source = Source()
val stream = source.stream(ssc)
stream.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
rdd.foreachPartition(partition => {
if (partition.nonEmpty) {
val cache = Cache()
partition.foreach(entity=> {
// in this block if 2 distributed workers (in case of apache spark, for example)
//will process entities with the same keys i can lose one of this update
// worker1 and worker2 will get the same value
val value = cache.get(entity.key)
// both workers will update this value but may get different results
val updatedValue = ??? // some non-trivial update depends on entity
// for example, worker1 put new value, then worker2 put new value. In this case only updates from worker2 are visible and updates from worker1 are lost
cache.put(entity.key, updatedValue)
})
}
})
}
})
ssc.start()
ssc.awaitTermination()
}
So, in case if i use kafka as source i can workaround this if messages are partitioned by keys. In this case i can rely on the fact that only 1 worker will process particular record at any point of time. But how to handle the same situation when messages partitioned randomly (key is inside message body)?

How does Spark Structured Streaming flush in-memory state when state data is no longer being checked?

I am trying to build a sessionization application with Spark Structured Streaming(version 2.2.0).
In case of using mapGroupWithState with Update mode, I understand that the executor will crash with an OOM exception if the state data grows large. Hence, I have to manage the memory with GroupStateTimeout option.
(Ref. How does Spark Structured Streaming handle in-memory state when state data is growing?)
However, I can't check if the state is timed-out and ready to be removed if there is no more new streaming data for the particular keys.
For example, let's say I have the following code.
myDataset
.groupByKey(_.key)
.flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.EventTimeTimeout)(makeSession)
makeSession() function will check if the state is timed-out and remove the timed-out state.
Now, let's say the key "foo" has some stored state in memory already, and no new data with the key "foo" is streaming into the application. As a result, makeSession() does not process the data with key "foo" and the stored state is not being checked. Which means, the stored state with key "foo" persists in memory. If there are many keys like "foo", the stored states will not be flushed and JVM will raise OOM exception.
I might be misunderstanding with mapGroupWithState, but I suspect my OOM exception is caused by the above issue.
If I am correct, what would be the solution for this case?
I want to flush all the stored states that has been timedout and have no more new streaming data.
Is there any good code example?
Now, let's say the key "foo" has some stored state in memory already,
and no new data with the key "foo" is streaming into the application.
As a result, makeSession() does not process the data with key "foo"
and the stored state is not being checked.
This is incorrect. As long as you have new data for any key, Spark will make sure that each batch validates the entire key set, and invoke the timed out keys one last time.
A part of every call to flat/mapGroupsWithState, we have:
val outputIterator =
updater.updateStateForKeysWithData(filteredIter) ++
updater.updateStateForTimedOutKeys()
And this is updateStateForTimedOutKeys:
def updateStateForTimedOutKeys(): Iterator[InternalRow] = {
if (isTimeoutEnabled) {
val timeoutThreshold = timeoutConf match {
case ProcessingTimeTimeout => batchTimestampMs.get
case EventTimeTimeout => eventTimeWatermark.get
case _ =>
throw new IllegalStateException(
s"Cannot filter timed out keys for $timeoutConf")
}
val timingOutKeys = store.filter { case (_, stateRow) =>
val timeoutTimestamp = getTimeoutTimestamp(stateRow)
timeoutTimestamp != NO_TIMESTAMP && timeoutTimestamp < timeoutThreshold
}
timingOutKeys.flatMap { case (keyRow, stateRow) =>
callFunctionAndUpdateState(keyRow, Iterator.empty, Some(stateRow), hasTimedOut = true)
}
} else Iterator.empty
}
Where the relevant part is flatMap over the timed out keys and invoking each function one last time with hasTimedOut = true.

Pyspark applying foreach

I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Your function definition of div_two seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd to an array variable but rdds are distributed collection in itself and Array is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd

Updating an Array via Scala Parallel Collections

I have this array of HashMap defined as below
var distinctElementsDefinitionMap: scala.collection.mutable.ArrayBuffer[HashMap[String, Int]] = new scala.collection.mutable.ArrayBuffer[HashMap[String, Int]](300) with scala.collection.mutable.SynchronizedBuffer[HashMap[String, Int]]
Now, I have a parallel collection of 300 elements
val max_length = 300
val columnArray = (0 until max_length).toParArray
import scala.collection.parallel.ForkJoinTaskSupport
columnArray.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(100))
columnArray foreach(i => {
// Do Some Computation and get a HashMap
var distinctElementsMap: HashMap[String, Int] = //Some Value
//This line might result in Concurrent Access Exception
distinctElementsDefinitionMap.update(i, distinctElementsMap)
})
I am now running a computation intensive task within a foreach loop on the columnArray defined above.
After the computation is complete, I would like each of the threads to update a particular entry of the distinctElementsDefinitionMap array.
Each thread would update only particular index value, unique to the thread executing it.
I want to know if this updation of an entry of the array is safe with multiple threads possibly writing to it at the same time?
If not is there a synchronized way of doing it so it's thread-safe?
Thank You!
Update:
It appears this is really not the safe way to do it. I am getting a java.util.ConcurrentModificationException
Any tips on how to avoid this whilst using the parallel collections.
Use .groupBy operation, as far as I can judge it is parallelized (unlike some other methods, such as .sorted)
case class Row(a: String, b: String, c: String)
val data = Vector(
Row("foo", "", ""),
Row("bar", "", ""),
Row("foo", "", "")
)
data.par.groupBy(x => x.a).seq
// Map(bar -> ParVector(Row(bar,,)), foo -> ParVector(Row(foo,,), Row(foo,,)))
Hope you got the idea.
Alternatively, if your RAM allows you, parallelize processing over each column, not row, it has to be waaaay more efficient than your current approach (less contention).
val columnsCount = 3 // 300 in your case
Vector.range(0, columnsCount).par.map { column =>
data.groupBy(row => row(column))
}.seq
Though you likely will have memory problems even with the single column (8M rows might be quite a lot).

Resources