Pregel API - why iterations on small graph are consuming so much memory? - apache-spark

I'm relatively new to Spark and Scala however I've decided to post here an example of code that is quite simple and in my perception shouldn't cause a serious problem, however in practice it cause Out of Memory error quite often in AWS EMR Spark environment depending on the value of maxIterations :
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx._
import scala.util.Try
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.IOUtils
import java.io.IOException
val config = new SparkConf().setAppName("test graphx")
config.set("spark.driver.allowMultipleContexts","true")
val batch_id=new Integer(31)
val maxIterations=2 //200 interations are causing out of memory
var myVertices = sc.makeRDD(Array( (1L, ("A",batch_id,0.0,0.0,0.0,11.0)), (2L, ("B",batch_id,0.0,1000.0,0.0,300.0)), (3L, ( "C", batch_id, 1000.0, 1000.0, 0.0, 8.0)), (4L, ("D",batch_id,1000.0, 0.0, 0.0, 400.0)) ))
var myEdges = sc.makeRDD(Array(Edge(4L, 3L, (7.7, 0.0) ), Edge(2L, 3L, (5.0, 0.0) ), Edge(2L, 1L, (12.0, 0.0))))
var myGraph=Graph(myVertices,myEdges)
myGraph.cache
myGraph.triplets.foreach(println)
//we need to calculate some constant values for each edge before start of pregel
val initGraph=myGraph.mapTriplets(tr =>
(tr.attr._1, (tr.attr._1 *
(scala.math.sqrt((tr.dstAttr._3-tr.srcAttr._3)*(tr.dstAttr._3-tr.srcAttr._3)+( tr.dstAttr._4-tr.srcAttr._4)*( tr.dstAttr._4-tr.srcAttr._4)+(tr.dstAttr._5-tr.srcAttr._5)*(tr.dstAttr._5-tr.srcAttr._5))) *
(scala.math.sqrt((tr.dstAttr._3-tr.srcAttr._3)*(tr.dstAttr._3-tr.srcAttr._3)+( tr.dstAttr._4-tr.srcAttr._4)*( tr.dstAttr._4-tr.srcAttr._4)+(tr.dstAttr._5-tr.srcAttr._5)*(tr.dstAttr._5-tr.srcAttr._5))) /
(tr.dstAttr._6 * tr.dstAttr._6))
)
)
initGraph.triplets.take(100).foreach(println)
val distanceStep = 0.1
val tolerance = 1
val sssp = initGraph.pregel( (0.0, 0.0, 0.0, 0.0), maxIterations //500-3000
)(
(id: VertexId, vert: ((String, Integer, Double, Double, Double, Double)), msg: (Double, Double, Double, Double)) =>
(
vert._1,vert._2,
( if (scala.math.abs(msg._1)> tolerance) {vert._3+distanceStep*msg._1 } else { vert._3 }),
( if (scala.math.abs(msg._2)> tolerance) {vert._4+distanceStep*msg._2 } else { vert._4 }),
( if (scala.math.abs(msg._3)> tolerance) {vert._5+distanceStep*msg._3 } else { vert._5 }),
vert._6
),// Vertex Program
e => { // Send Message
Iterator(
(
e.dstId,
(
((e.srcAttr._3 - e.dstAttr._3)*distanceStep*scala.math.sqrt( 2*e.attr._2*e.srcAttr._6 / ((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) )), //x
((e.srcAttr._4 - e.dstAttr._4)*distanceStep*scala.math.sqrt( 2*e.attr._2*e.srcAttr._6 / ((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) )), //y
((e.srcAttr._5 - e.dstAttr._5)*distanceStep*scala.math.sqrt( 2*e.attr._2*e.srcAttr._6 / ((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) )), //z
e.attr._1*distanceStep*scala.math.sqrt((e.dstAttr._3-e.srcAttr._3)*(e.dstAttr._3-e.srcAttr._3)+( e.dstAttr._4-e.srcAttr._4)*( e.dstAttr._4-e.srcAttr._4)+(e.dstAttr._5-e.srcAttr._5)*(e.dstAttr._5-e.srcAttr._5)) //vector module
)
)
)
},
{
(a, b) => (a._1 + b._1, a._2 + b._2, a._3 + b._3, 0) // Merge Message
}
)
sssp.vertices.take(10).foreach(println)
I run it in AWS EMR on 4 node m5.x2large cluster via Zeppelin, however it can be quickly adopted and executed as a job in spark.
In short this code creates a myGraph graph with 4 vertices and 3 edges. then for each triplet I calculate some constants value and use graph object initGraph for this.
Then for initGraph I apply pregel API which execution is only limited by number of iterations maxIterations. And at this moment for Pregel API I see strange behavior. For small maxIterations values (less than 10) it works quite fast, for 100-150 iterations it is performing for 3-4minutes in zeppelin, and for 200 iterations it fails with different errors (ConnectionClosed etc).
I tried to monitor what's going on with the cluster once I put maxIterations= 150 or 200 and it looks like this Allocated memory goes straight up and available memory is decreasing at the same pace.
As I'm quite new to Spark I'm not sure this is correct behavior and quite honestly, I can't find an explanation of what could consume gigabytes of memory even with 200 iterations of pregel on such a small graph. If you can reproduce it on your end and check, I'm curios to listen to your advice on performance optimization, because if I expand the cluster and run the same code on a larger hardware setup, it simply just a question of maxIterations and size of graph to actually get same OutOfMemory error. And real graph I need to run this on is more than 1M vertices and ~7M edges so I can't figure out what kind of hardware it might require if this problem won't be solved.

Related

Is .show() a Spark action? [duplicate]

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

Is dataframe.show() an action in spark?

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

performance of UDF in apache spark

I am trying to do a high performance calculations which require custom functions.
As a first stage I am trying to profile the effect of using UDF and I am getting weird results.
I created a simple test (in https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6100413009427980/1838590056414286/3882373005101896/latest.html)
Basically I create a dataframe using the range option with 50M records and cache it.
I then do a filter to find those smaller than 10 and count them. Once by doing column < 10 and once by doing it via UDF.
I ran each action 10 times to get a good time estimate.
What I found was that both methods took around the same time: ~4 seconds.
I also tried it in an on premise cluster I have (8 nodes, using yarn, each node with ~40GB memory and plenty of cores). There I got a result of 1 second for the first option and 8 for the second.
First I do not understand how is it that on the databricks cluster I got the same performance. Shouldn't the UDF be much slower? After all, there is no codegen so I should be seeing a much slower process.
Second I don't understand the huge differences between the two cluters: In one I have almost the same time and the other an x8 difference.
Lastly, I was trying to figure out how to write a custom function natively (i.e. the way spark does it). I tried to look at the code for spark and came out with something like this:
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.catalyst.util.TypeUtils
import org.apache.spark.sql.types._
import org.apache.spark.util.Utils
import org.apache.spark.sql.catalyst.expressions._
case class genf(child: Expression) extends UnaryExpression with Predicate with ImplicitCastInputTypes {
override def inputTypes: Seq[AbstractDataType] = Seq(IntegerType)
override def toString: String = s"$child < 10"
override def eval(input: InternalRow): Any = {
val value = child.eval(input)
if (value == null)
{
false
} else {
child.dataType match {
case IntegerType => value.asInstanceOf[Int] < 10
}
}
}
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
defineCodeGen(ctx, ev, c => s"($c) < 10")
}
}
This, however doesn't work as it would only work from within the sql package (for example AbstractDataType is private).
Is this code even in the right direction? How would I make it work?
Thanks.

spark GMM fail to divide points to correct clusters

I have data set which looks like this: (user names has been obfuscated, also there are 40 users - I didn't want to show them all)
(a1,List(1.0, 1015.0))
(a2,List(2.0, 2015.0))
(a3,List(3.0, 3015.0))
(a4,List(1.0, 1015.0))
(a5,List(0.0, 15.0))
(a6,List(0.0, 15.0))
Basically, I want to create a sample app where there are 4 clusters which are really obvious (10 users from each cluster). And show that users with same characteristics fall into the same cluster.
The code creates vectors from the data, trains a model and predicts according to the model learned:
val cuttedData = data1.map(s => s.replace("List", "").replace("(", "").replace(")", "").trim.split(','))
val parsedData1 = cuttedData.map(item => Vectors.dense(item.drop(1).map(t => t.toDouble))).cache()
parsedData1.foreach(println)
val gmm1 = new GaussianMixture().setK(4).run(parsedData1)
cuttedData.foreach(item => {println(item(0) + " : " + gmm1.predict(Vectors.dense(item.drop(1).map(i => i.toDouble))))})
The issue is that the users are predicted to only 3 clusters, so that the features -> cluster relations are:
(1.0, 1015.0) -> 1
(0.0, 15.0) -> 0
(2.0, 2015.0), (3.0, 3015.0) -> 2
the clusters as printed from the model are:
weight=0.150000
mu=[2.4445803037860923E-11,15.000000024445805]
sigma=
5.7824573923325105E-11 5.782457392332511E-8
5.782457392332511E-8 5.782457377877571E-5
weight=0.150000
mu=[2.4445803037860923E-11,15.000000024445805]
sigma=
5.7824573923325105E-11 5.782457392332511E-8
5.782457392332511E-8 5.782457377877571E-5
weight=0.371495
mu=[2.1293846059560595,2144.3846059560597]
sigma=
0.4824411855092578 482.4411855092562
482.4411855092562 482441.185509255
weight=0.328505
mu=[1.8536826474572847,1868.6826474572852]
sigma=
0.4795182413266584 479.51824132665774
479.51824132665774 479518.24132665637
I don't understand why the users are classified wrong, I tried increasing the K parameter to 100, and using Normalizer for the data but it didn't help.
Another thing to notice is that when I use KMEANS for the same data it works perfectly.

Spark RDD.isEmpty costs much time

I built a Spark cluster.
workers:2
Cores:12
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
/*
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
*/
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd = rdd.map(arg=>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 = nameRdd.map(tuple=>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 = timeRangeRdd1.map(tuple=>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
if(timeRangeRdd4.isEmpty()){
return basicBSONRDD(name, time)
}
else{
return timeRangeRdd4.map(tuple => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
}
}
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update ===
timeRangeRdd3.map(x => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.
First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/shuffle...in this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.

Resources