ReliableCheckpointRDD - Error writing partitioner org.apache.spark.HashPartitioner

ReliableCheckpointRDD - Error writing partitioner org.apache.spark.HashPartitioner - apache-spark

I am working on a large graph with 200m vertices and 350m edges. As the graph size is large, I call the checkpoint method at regular intervals on the interim graph created in the long-running graphx algorithms. The majority of the time, the checkpoint works fine. However, I have observed the following warnings in the driver logs a few times.
22/08/04 03:25:48 [WARN] o.a.s.r.ReliableCheckpointRDD - Error writing partitioner org.apache.spark.HashPartitioner#e10 to file:/mnts/nfsvol/AML/tmp/checkpoint/a153fc7b-1be7-460d-a41b-f17c820c3bf1/rdd-496
22/08/04 03:29:35 [WARN] o.a.s.r.ReliableCheckpointRDD - Error writing partitioner org.apache.spark.HashPartitioner#e10 to file:/mnts/nfsvol/AML/tmp/checkpoint/a153fc7b-1be7-460d-a41b-f17c820c3bf1/rdd-482
What is the cause of such warnings? How to fix such warnings?
As I am new to spark and graphx, I will highly appreciate a detailed explanation. Thanks!

Related

Spark Mllib DecisionTree skewed task runtime

I'm using apache spark mllib to learn a regression tree for a quite large dataset and I have found what I think is an abnormal behviour in one of the algorithm stages.
As you can see in the next picture the time distribution is clearly skewed and it is caused just by 3 tasks
It happens during the following algorithm step, which always is 200 tasks no matter what
And I cant get rid of it, I do not know is it is an intended behaviour or if I am doing something wrong. I have tried:
Repartitioning the dataframe just before calling fit
Changing the number of features/tree-depth with no avail
I am using Spark 3.1.1 on AWS EMR.

Identifying why data is skewed in Spark

I am investigating a Spark SQL job (Spark 1.6.0) that is performing poorly due to badly skewed data across the 200 partitions, most of the data is in 1 partition:
What I'm wondering is...is there anything in the Spark UI to help me find out more about how the data is partitioned? From looking at this I don't know which columns the dataframe is partitioned on. How can I find that out? (other than looking at the code - I'm wondering if there's anything in the logs and/or UI that could help me)?
Additional details, this is using Spark's dataframe API, Spark version 1.6. Underlying data is stored in parquet format.

The Spark UI and logs will not be terribly helpful for this. Spark uses a simple hash partitioning algorithm as the default for almost everything. As you can see here this basically recycles the Java hashCode method.
I would suggest the following:
Try to debug by sampling and printing the contents of the RDD or data frame. See if there's obvious issues with the data distribution (ie. low variance or low cardinality) of the key.
If thats ineffective, you can work back from the logs and UI to figure our how many partitions there are. You can find the hashCode of the data using spark and then take the modulus to see what the collision is.
Once you find the source of the collision you can try to a few techniques to remove it:
See if there's a better key you can use
See if you can improve the hashCode function of the key (the default one in Java isn't that great)
See if you can process the data in two steps by doing an initial scatter/gather step to force some parallelism and reduce the processing overhead for that one partition. This is probably the trickiest optimization to get right of those mentioned here. Basically, partition the data once using a random number generator to force some initial parallel combining of the data, then push it through again with the natural partitioner to get the final result. This requires that the operation you're applying be transitive and associative. This technique hits the network twice and is therefore very expensive unless the data is really actually that highly skewed.

Spark - GraphX - scaling connected components

I am trying to use connected components but having issue with scaling. My Here is what I have -
// get vertices
val vertices = stage_2.flatMap(x => GraphUtil.getVertices(x)).cache
// get edges
val edges = stage_2.map(x => GraphUtil.getEdges(x)).filter(_ != null).flatMap(x => x).cache
// create graph
val identityGraph = Graph(vertices, edges)
// get connected components
val cc = identityGraph.connectedComponents.vertices
Where, GraphUtil has helper functions to return vertices and edges. At this point, my graph has ~1 million nodes and ~2 million edges (btw, this is expected to grow to ~100 million nodes). My graph is pretty sparsely connected - so I expect plenty of small graphs.
When I run the above, I keep getting java.lang.OutOfMemoryError: Java heap space. I have tried with executor-memory 32g and running a cluster of 15 nodes with 45g as yarn container size.
Here is the exception detail:
16/10/26 10:32:26 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:360)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:98)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2216)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:146)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:146)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:146)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:173)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
In addition, I am getting plenty of the following logs:
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 320 is 263 bytes
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 321 is 268 bytes
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 322 is 264 bytes
My question is has anyone tried ConnectedComponents at this scale? If yes, what am I doing wrong?

As I posted above in the comments, I implemented connected component using map/reduce on Spark. You can find more details here - https://www.linkedin.com/pulse/connected-component-using-map-reduce-apache-spark-shirish-kumar and the source code under MIT license here - https://github.com/kwartile/connected-component.

The connected component algorithm does not scale very well, and its performance depends quite a lot on the topology of you graph. The sparsity of you edges doesn't mean you have small components. A long string of edges is very sparse (number of edges = number of vertices - 1), but the brute force algo implemented in GraphX wouldn't be very efficient (see source of cc and pregel).
Here is what you can try (sorted, code only):
Checkpoint your vertices and edges in parquet (on disk), then load them again to build you graph. Caching sometimes just doesn't cut it when you execution plan grows too big.
Transform your graph in a way that will leave the result of the algorithm unchanged. For instance, you can see in the code that the algo is propagating the information in both directions (as it should, by default). So if you have several edges connecting the same two vertices, filter them out from your graph on which you apply the algo.
Optimize GraphX code yourself (it is really quite straightforward), using either generic optimisation saving memory (i.e. checkpointing on disk at each iteration to avoid OOM), or domain specific optimisation (similar to point 2)
If you are ok to leave GraphX (which is becoming somewhat legacy) behind, you can consider GraphFrames (package, blog
). I never tried, so I don't know if it has CC.
I'm certain you can find other possibilities among spark packages, but maybe you will even want to use something outside of Spark. But this is out of scope of the question.
Best of luck!

Spark LDA woes - prediction and OOM questions

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.
Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.
Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.
I tried various settings for:
driver-memory (8G)
executor-memory (1-225G)
spark.driver.maxResultSize (including disabling it)
spark.memory.offheap.enabled (true/false)
spark.broadcast.blockSize (currently at 8m)
spark.rdd.compress (currently true)
changing the serializer (currently Kryo) and its max buffer (512m)
increasing various timeouts to allow for longer computation
(executor.heartbeatInterval, rpc.ask/lookupTimeout,
spark.network.timeout) spark.akka.frameSize (1024)
At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.
Questions:
Does my prediction routine look OK? Is this an off-by-one error
somewhere w.r.t the irrelevant predicted topics?
Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.
Any pointers as to what I can try next would be much appreciated!

Spark Indefinite Waiting with "Asked to send map output locations for shuffle"

My jobs often hang with this kind of message:
14/09/01 00:32:18 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark#*:37619
Would be great if someone could explain what Spark is doing when it spits out this message. What does this message mean? What could the user be doing wrong to cause this? What configurables should be tuned?
It's really hard to debug because it doesn't OOM, it doesn't give an ST, it just sits and sits and sits.
This has been an issue from Spark at least as far back as 1.0.0 and is still ongoing with Spark 1.5.0

Based on this thread more recent versions of spark have gotten better at shuffling (and reporting errors if it fails anyway). Also, the following tips were mentioned:
This is very likely because the serialized map output locations buffer
exceeds the akka frame size. Please try setting "spark.akka.frameSize"
(default 10 MB) to some higher number, like 64 or 128.
In the newest version of Spark, this would throw a better error, for
what it's worth.
A possible workaround:
If the distribution of the keys in your groupByKey is skewed (some
keys appear way more often than others) you should consider modifying
your job to use reduceByKey instead wherever possible.
And a side track:
The issue was fixed for me by allocating just one core per executor.
maybe your executor-memory config should be divided by executor-cores

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string