I am trying to use connected components but having issue with scaling. My Here is what I have -
// get vertices
val vertices = stage_2.flatMap(x => GraphUtil.getVertices(x)).cache
// get edges
val edges = stage_2.map(x => GraphUtil.getEdges(x)).filter(_ != null).flatMap(x => x).cache
// create graph
val identityGraph = Graph(vertices, edges)
// get connected components
val cc = identityGraph.connectedComponents.vertices
Where, GraphUtil has helper functions to return vertices and edges. At this point, my graph has ~1 million nodes and ~2 million edges (btw, this is expected to grow to ~100 million nodes). My graph is pretty sparsely connected - so I expect plenty of small graphs.
When I run the above, I keep getting java.lang.OutOfMemoryError: Java heap space. I have tried with executor-memory 32g and running a cluster of 15 nodes with 45g as yarn container size.
Here is the exception detail:
16/10/26 10:32:26 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:360)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:98)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2216)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:146)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:146)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:146)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:173)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
In addition, I am getting plenty of the following logs:
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 320 is 263 bytes
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 321 is 268 bytes
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 322 is 264 bytes
My question is has anyone tried ConnectedComponents at this scale? If yes, what am I doing wrong?
As I posted above in the comments, I implemented connected component using map/reduce on Spark. You can find more details here - https://www.linkedin.com/pulse/connected-component-using-map-reduce-apache-spark-shirish-kumar and the source code under MIT license here - https://github.com/kwartile/connected-component.
The connected component algorithm does not scale very well, and its performance depends quite a lot on the topology of you graph. The sparsity of you edges doesn't mean you have small components. A long string of edges is very sparse (number of edges = number of vertices - 1), but the brute force algo implemented in GraphX wouldn't be very efficient (see source of cc and pregel).
Here is what you can try (sorted, code only):
Checkpoint your vertices and edges in parquet (on disk), then load them again to build you graph. Caching sometimes just doesn't cut it when you execution plan grows too big.
Transform your graph in a way that will leave the result of the algorithm unchanged. For instance, you can see in the code that the algo is propagating the information in both directions (as it should, by default). So if you have several edges connecting the same two vertices, filter them out from your graph on which you apply the algo.
Optimize GraphX code yourself (it is really quite straightforward), using either generic optimisation saving memory (i.e. checkpointing on disk at each iteration to avoid OOM), or domain specific optimisation (similar to point 2)
If you are ok to leave GraphX (which is becoming somewhat legacy) behind, you can consider GraphFrames (package, blog
). I never tried, so I don't know if it has CC.
I'm certain you can find other possibilities among spark packages, but maybe you will even want to use something outside of Spark. But this is out of scope of the question.
Best of luck!
Related
I am doing some benchmark in a cluster using Spark. Among the various things I want to get a good approximation of the average size reduction achieved by serialization and compression. I am running in client deploy-mode and with the local master, and tired with both shells of versions 1.6 and 2.2 of spark.
I want to do that calculating the in-memory size and then the size on disk, so the fraction should be my answer. I have obviously no problems getting the on-disk size, but I am really struggling with the in-memory one.
Since my RDD is made of doubles and they occupy 8 bytes each in memory I tried counting the number of elements in the RDD and multiplying by 8, but that leaves out a lot of things.
The second approach was using "SizeEstimator" (https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
), but this is giving me crazy results! In Spark 1.6 it is either 30, 130 or 230 randomly (47 MB on disk), in Spark 2.2 it starts at 30 and everytime I execute it it increases by 0 or by 1. I know it says it's not super accurate but I can't even find a bit of consistency! I even tried setting persisting level in memory only
rdd.persist(StorageLevel.MEMORY_ONLY)
but still, nothing changed.
Is there any other way I can get the in-memory size of the RDD? Or should I try another approach? I am writing to disk with rdd.SaveAsTextFile, and generating the rdd via RandomRDDs.uniformRDD.
EDIT
sample code:
write
val rdd = RandomRDDs.uniformRDD(sc, nBlocks, nThreads)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
println("RDD count: " + rdd.count)
rdd.saveAsObjectFile("file:///path/to/folder")
read
val rdd = sc.wholeTextFiles(name,nThreads)
rdd.count() //action so I'm sure the file is actually read
webUI
Try caching the rdd as you mentioned and check in the storage tab of the spark UI.
By default rdd is deserialised and stored in memory. If you want to serialise it then specifically use persist with option MEMORY_ONLY_SER.The memory consumption will be less. In disk data always will be stored in serialised fashion
Check once the spark UI
I'm working with a largish (?) graph (60 million vertices and 9.5 billion edges) using Spark Graphframes. The underlying data is not large - the vertices take about 500mb on disk and the edges are about 40gb. My containers are frequently shutting down due to java heap out of memory problems, but I think the underlying problem is that the graphframe is constantly shuffling data around (I'm seeing shuffle read/write of up to 150gb). Is there a way to efficiently partition a Graphframe or the underlying edges/vertices to reduce shuffle?
TL;DR It is not possible to efficiently partition Graphframe.
Graphframe algorithms can be separated into two categories:
Methods which delegate processing to GraphX counterpart. GraphX supports a number of partitioning methods but these are not exposed via Graphframe API. If you use one of these it is probably better to use GraphX directly.
Unfortunately development of GraphX stopped almost completely with only a handful of small fixes over the last two years and overall performance is highly disappointing compared to both in-core libraries and out-of-core libraries.
Methods which are implemented natively using Spark Datasets, which considering limited programming model and only a single partitioning mode, are deeply unfit for complex graph processing.
While relational columnar storage can be used for efficient graph processing naive iterative join approach employed by Graphframes just don't scale (but it is OK for shallow traversing with one or two hops).'
You can try to repartition vertices and edges DataFrames by id and src respectively:
val nPart: Int = ???
GraphFrame(v.repartition(nPart, v("id")), e.repartition(e(nPart, "src")))
what should help in some cases.
Overall, at it's current (Dec, 2016) state, Spark is not a good choice for intensive graph analytics.
Here's the partial solution / workaround - create a UDF that mimics one of the partition functions to create a new column and partition on that.
num_parts = 256
random_vertex_cut = udf.register("random_vertex_cut", lambda src, dst: math.abs((src, dst).hashCode()) % num_parts, IntegerType())
edge.withColumn("v_cut", random_vertex_cut(col("src"), col("dst")).repartition(256, "v_cut")
This approach can help some, but not as well as GraphX.
I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.
I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.
Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.
Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.
I tried various settings for:
driver-memory (8G)
executor-memory (1-225G)
spark.driver.maxResultSize (including disabling it)
spark.memory.offheap.enabled (true/false)
spark.broadcast.blockSize (currently at 8m)
spark.rdd.compress (currently true)
changing the serializer (currently Kryo) and its max buffer (512m)
increasing various timeouts to allow for longer computation
(executor.heartbeatInterval, rpc.ask/lookupTimeout,
spark.network.timeout) spark.akka.frameSize (1024)
At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.
Questions:
Does my prediction routine look OK? Is this an off-by-one error
somewhere w.r.t the irrelevant predicted topics?
Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.
Any pointers as to what I can try next would be much appreciated!
Problem with Gradient Boosted Trees (GBT):
I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6
What happens if I run GBT for 40 iterations, the input as
seen in spark UI becomes larger and larger for certain stages
(and the runtime increases correspondingly)
MapPartition in DecisionTree.scala L613
Collect in DecisionTree.scala L977
count DecistionTreeMetadata.scala L 111.
I start with 4GB input and eventually this goes up to over 100GB
input increasing by a constant amount. The completion of the related tasks
becomes slower and slower.
The question is whether this is a correct procedure or whether this is a bug in the MLLib.
My feeling is that somehow more and more data is bound to the relevant data rdd.
Does anyone know how to fix it?
I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.
I am referring to
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree