does Spark unpersist() have different strategy? - apache-spark

Just did some experiment on spark unpersist() and feel confused on what it actually did. I googled a lot and almost all people say the unpersist() will immediately evict the RDD from excutor's memory. but in this test, we can see it's not always ture. see the simple test below:
private static int base = 0;
public static Integer[] getInts(){
Integer[] res = new Integer[5];
for(int i=0;i<5;i++){
res[i] = base++;
}
System.out.println("number generated:" + res[0] + " to " + res[4] + "---------------------------------");
return res;
}
public static void main( String[] args )
{
SparkSession sparkSession = SparkSession.builder().appName("spark test").getOrCreate();
JavaSparkContext spark = new JavaSparkContext(sparkSession.sparkContext());
JavaRDD<Integer> first = spark.parallelize(Arrays.asList(getInts()));
System.out.println("first: " + Arrays.toString(first.collect().toArray())); // action
first.unpersist();
System.out.println("first is unpersisted");
System.out.println("compute second ========================");
JavaRDD<Integer> second = first.map(i -> {
System.out.println("double " + i);
return i*2;
}).cache(); // transform
System.out.println("second: " + Arrays.toString(second.collect().toArray())); // action
second.unpersist();
System.out.println("compute third ========================");
JavaRDD<Integer> third = second.map(i -> i+100); // transform
System.out.println("third: " + Arrays.toString(third.collect().toArray())); // action
}
the output is:
number generated:0 to 4---------------------------------
first: [0, 1, 2, 3, 4]
first is unpersisted
compute second ========================
double 0
double 1
double 2
double 3
double 4
second: [0, 2, 4, 6, 8]
compute third ========================
double 0
double 1
double 2
double 3
double 4
third: [100, 102, 104, 106, 108]
As we can see, unpersist() 'first' is useless, it will not recalculate.
but unpersist() 'second' will trigger recalculation.
Anyone can help me to figure out why unpersist() 'first' will not trigger recalculation? if I want to force 'first' to be evicted out of memory, how should I do? is there any special for RDD from parallelize or textFile() API?
Thanks!

This behavior has nothing to do with caching and unpersisting. In fact first is not even persisted, although it wouldn't make much difference here.
When you parallelize, you pass a local, non-distributed object. parallelize takes its argument by value and its life cycle is completely out of Spark's scope. As a result Spark has no reason to recompute it at all, once ParallelCollectionRDD has been initialized. If you want to distribute different collection, just create a new RDD.
It is also worth noting that unpersist can be called in both blocking and non-blocking mode, depending on the blocking argument.

Related

how to write >1 file from a partition

First I wanted to spit a partition by a prefixed size, so I can update the file system. For example if a partition data size is 200MB(each row in the partition RDD can be different size), I want to write 4 files in that partition, each file 50MB, while trying to avoid shuffle. Is it possible to do that without a repartition or a coelesce which would cause shuffle? I dont have a fixed row size I cant really use the maxRecordsPerFile spark config.
Next option is to repartition the entire dataset causing shuffle. So to calculate the size I did the following, but it fails with: "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

Collecting a GPars loop to a Map

I need to iterate on a List and for every item run a time-expensive operation and then collect its results to a map, something like this:
List<String> strings = ['foo', 'bar', 'baz']
Map<String, Object> result = strings.collectEntries { key ->
[key, expensiveOperation(key)]
}
So that then my result is something like
[foo: <an object>, bar: <another object>, baz: <another object>]
Since the operations i need to do are pretty long and don't depend on each other, I've been willing to investigate using GPars to run the loop in parallel.
However, GPars has a collectParallel method that loops through a collection in parallel and collects to a List but not a collectEntriesParallel that collects to a Map: what's the correct way to do this with GPars?
There is no collectEntriesParallel because it would have to produce the same result as:
collectParallel {}.collectEntries {}
as Tim mentioned in the comment. It's hard to make reducing list of values to map (or any other mutable container) in a deterministic way other than collecting results to a list in parallel and in the end collecting to map entries in a sequential manner. Consider following sequential example:
static def expensiveOperation(String key) {
Thread.sleep(1000)
return key.reverse()
}
List<String> strings = ['foo', 'bar', 'baz']
GParsPool.withPool {
def result = strings.inject([:]) { seed, key ->
println "[${Thread.currentThread().name}] (${System.currentTimeMillis()}) seed = ${seed}, key = ${key}"
seed + [(key): expensiveOperation(key.toString())]
}
println result
}
In this example we are using Collection.inject(initialValue, closure) which is an equivalent of good old "fold left" operation - it starts with initial value [:] and iterates over all values and adds them as key and value to initial map. Sequential execution in this case takes approximately 3 seconds (each expensiveOperation() sleeps for 1 second).
Console output:
[main] (1519925046610) seed = [:], key = foo
[main] (1519925047773) seed = [foo:oof], key = bar
[main] (1519925048774) seed = [foo:oof, bar:rab], key = baz
[foo:oof, bar:rab, baz:zab]
And this is basically what collectEntries() does - it's kind of reduction operation where initial value is an empty map.
Now let's see what happens if we try to parallelize it - instead of inject we will use injectParallel method:
GParsPool.withPool {
def result = strings.injectParallel([:]) { seed, key ->
println "[${Thread.currentThread().name}] (${System.currentTimeMillis()}) seed = ${seed}, key = ${key}"
seed + [(key): expensiveOperation(key.toString())]
}
println result
}
Let's see what is the result:
[ForkJoinPool-1-worker-1] (1519925323803) seed = foo, key = bar
[ForkJoinPool-1-worker-2] (1519925323811) seed = baz, key = [:]
[ForkJoinPool-1-worker-1] (1519925324822) seed = foo[bar:rab], key = baz[[:]:]:[]
foo[bar:rab][baz[[:]:]:[]:][:]:]:[[zab]
As you can see parallel version of inject does not care about the order (which is expected) and e.g. first thread received foo as a seed variable and bar as a key. This is what could happen if reduction to a map (or any mutable object) was performed in parallel and without specific order.
Solution
There are two ways to parallelize the process:
1. collectParallel + collectEntries combination
As Tim Yates mentioned in the comment you can parallel expensive operation execution and in the end collect results to a map sequentially:
static def expensiveOperation(String key) {
Thread.sleep(1000)
return key.reverse()
}
List<String> strings = ['foo', 'bar', 'baz']
GParsPool.withPool {
def result = strings.collectParallel { [it, expensiveOperation(it)] }.collectEntries { [(it[0]): it[1]] }
println result
}
This example executes in approximately 1 second and produces following output:
[foo:oof, bar:rab, baz:zab]
2. Java's parallel stream
Alternatively you can use Java's parallel stream with Collectors.toMap() reducer function:
static def expensiveOperation(String key) {
Thread.sleep(1000)
return key.reverse()
}
List<String> strings = ['foo', 'bar', 'baz']
def result = strings.parallelStream()
.collect(Collectors.toMap(Function.identity(), { str -> expensiveOperation(str)}))
println result
This example also executes in approximately 1 second and produces output like that:
[bar:rab, foo:oof, baz:zab]
Hope it helps.

Spark,Graphx program does not utilize cpu and memory

I have a function that takes the neighbors of a node ,for the neighbors i use broadcast variable and the id of the node itself and it calculates the closeness centrality for that node.I map each node of the graph with the result of that function.When i open the task manager the cpu is not utilized at all as if it is not working in parallel , the same goes for memory , but the every node executes the function in parallel and also the data is large and it takes time to complete ,its not like it does not need the resources.Every help is truly appreciated , thank you.
For loading the graph i use val graph = GraphLoader.edgeListFile(sc, path).cache
object ClosenessCentrality {
case class Vertex(id: VertexId)
def run(graph: Graph[Int, Float],sc: SparkContext): Unit = {
//Have to reverse edges and make graph undirected because is bipartite
val neighbors = CollectNeighbors.collectWeightedNeighbors(graph).collectAsMap()
val bNeighbors = sc.broadcast(neighbors)
val result = graph.vertices.map(f => shortestPaths(f._1,bNeighbors.value))
//result.coalesce(1)
result.count()
}
def shortestPaths(source: VertexId, neighbors: Map[VertexId, Map[VertexId, Float]]): Double ={
val predecessors = new mutable.HashMap[VertexId, ListBuffer[VertexId]]()
val distances = new mutable.HashMap[VertexId, Double]()
val q = new FibonacciHeap[Vertex]
val nodes = new mutable.HashMap[VertexId, FibonacciHeap.Node[Vertex]]()
distances.put(source, 0)
for (w <- neighbors) {
if (w._1 != source)
distances.put(w._1, Int.MaxValue)
predecessors.put(w._1, ListBuffer[VertexId]())
val node = q.insert(Vertex(w._1), distances(w._1))
nodes.put(w._1, node)
}
while (!q.isEmpty) {
val u = q.minNode
val node = u.data.id
q.removeMin()
//discover paths
//println("Current node is:"+node+" "+neighbors(node).size)
for (w <- neighbors(node).keys) {
//print("Neighbor is"+w)
val alt = distances(node) + neighbors(node)(w)
// if (distances(w) > alt) {
// distances(w) = alt
// q.decreaseKey(nodes(w), alt)
// }
// if (distances(w) == alt)
// predecessors(w).+=(node)
if(alt< distances(w)){
distances(w) = alt
predecessors(w).+=(node)
q.decreaseKey(nodes(w), alt)
}
}//For
}
val sum = distances.values.sum
sum
}
To provide somewhat of an answer to your original question, I suspect that your RDD only has a single partition, thus using a single core to process.
The edgeListFile method has an argument to specify the minimum number of partitions you want.
Also, you can use repartition to get more partitions.
You mentionned coalesce but that only reduces the number of partitions by default, see this question : Spark Coalesce More Partitions

Spark - operate a non serialized method on every element in a JavaPairRDD

Is it possible to go over elements in the JavaPairRDD but to perform some calculation f() on one machine only?
Each element (x,y) in the JavaPairRDD is a record (features and label), and I would like to call some calculation f(x,y) which can't be serialized.
f() is an self-implemented algorithm that updates some weights $w_i$.
The weights $w_i$ should be the same object for data from all partitions (rater broke up into tasks - each of which is operated on by an executor).
Update:
What I'm trying to do is to run "algo" f() 100 times in such a way that every iteration of algo is single-threaded but the whole 100 iterations run in parallel on different nodes. In high level, the code is something like that:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}
I would like that algo.fit() will update the same weights $w_i$ instead of fitting weights for every partition.

Performance difference in toString.map and toString.toArray.map

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

Resources