Spark LDA woes - prediction and OOM questions - apache-spark

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.
Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (​skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (​topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.
Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is ​hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.
I tried various settings for:
driver-memory (8G)
executor-memory (1-225G)
spark.driver.maxResultSize (including disabling it)
spark.memory.offheap.enabled (true/false)
spark.broadcast.blockSize (currently at 8m)
spark.rdd.compress (currently true)
changing the serializer (currently Kryo) and its max buffer (512m)
increasing various timeouts to allow for longer computation
(executor.heartbeatInterval, rpc.ask/lookupTimeout,
spark.network.timeout) spark.akka.frameSize (1024)
At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.
Questions:
Does my prediction routine look OK? Is this an off-by-one error
somewhere w.r.t the irrelevant predicted topics?
Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.
Any pointers as to what I can try next would be much appreciated!

Related

What happens if a Spark broadcast join is too large?

In doing Spark performance tuning, I've found (unsurprisingly) that doing broadcast joins eliminates shuffles and improves performance. I've been experimenting with broadcasting on larger joins, and I've been able to successfully use far larger broadcast joins that I expected -- e.g. broadcasting a 2GB compressed (and much larger uncompressed) dataset, running on a 60-node cluster with 30GB memory/node.
However, I have concerns about putting this into production, as the size of our data fluctuates, and I'm wondering what will happen if the broadcast becomes "too large". I'm imagining two scenarios:
A) Data is too big to fit in memory, so some of it gets written to disk, and performance degrades slightly. This would be okay. Or,
B) Data is too big to fit in memory, so it throws an OutOfMemoryError and crashes the whole application. Not so okay.
So my question is: What happens when a Spark broadcast join is too large?
Broadcast variables are plain local objects and excluding distribution and serialization they the behave as any other object you use. If they don't fit into memory you'll get OOM. Other than memory paging there is no magic that can prevent that.
So broadcasting is not applicable for objects that may not fit into memory (and leave a lot of free memory for standard Spark operations).

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

Spark - GraphX - scaling connected components

I am trying to use connected components but having issue with scaling. My Here is what I have -
// get vertices
val vertices = stage_2.flatMap(x => GraphUtil.getVertices(x)).cache
// get edges
val edges = stage_2.map(x => GraphUtil.getEdges(x)).filter(_ != null).flatMap(x => x).cache
// create graph
val identityGraph = Graph(vertices, edges)
// get connected components
val cc = identityGraph.connectedComponents.vertices
Where, GraphUtil has helper functions to return vertices and edges. At this point, my graph has ~1 million nodes and ~2 million edges (btw, this is expected to grow to ~100 million nodes). My graph is pretty sparsely connected - so I expect plenty of small graphs.
When I run the above, I keep getting java.lang.OutOfMemoryError: Java heap space. I have tried with executor-memory 32g and running a cluster of 15 nodes with 45g as yarn container size.
Here is the exception detail:
16/10/26 10:32:26 ERROR util.Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:360)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:98)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2216)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:32)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:44)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:146)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:146)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:146)
at org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:173)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:34)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
In addition, I am getting plenty of the following logs:
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 320 is 263 bytes
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 321 is 268 bytes
16/10/26 10:30:32 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 322 is 264 bytes
My question is has anyone tried ConnectedComponents at this scale? If yes, what am I doing wrong?
As I posted above in the comments, I implemented connected component using map/reduce on Spark. You can find more details here - https://www.linkedin.com/pulse/connected-component-using-map-reduce-apache-spark-shirish-kumar and the source code under MIT license here - https://github.com/kwartile/connected-component.
The connected component algorithm does not scale very well, and its performance depends quite a lot on the topology of you graph. The sparsity of you edges doesn't mean you have small components. A long string of edges is very sparse (number of edges = number of vertices - 1), but the brute force algo implemented in GraphX wouldn't be very efficient (see source of cc and pregel).
Here is what you can try (sorted, code only):
Checkpoint your vertices and edges in parquet (on disk), then load them again to build you graph. Caching sometimes just doesn't cut it when you execution plan grows too big.
Transform your graph in a way that will leave the result of the algorithm unchanged. For instance, you can see in the code that the algo is propagating the information in both directions (as it should, by default). So if you have several edges connecting the same two vertices, filter them out from your graph on which you apply the algo.
Optimize GraphX code yourself (it is really quite straightforward), using either generic optimisation saving memory (i.e. checkpointing on disk at each iteration to avoid OOM), or domain specific optimisation (similar to point 2)
If you are ok to leave GraphX (which is becoming somewhat legacy) behind, you can consider GraphFrames (package, blog
). I never tried, so I don't know if it has CC.
I'm certain you can find other possibilities among spark packages, but maybe you will even want to use something outside of Spark. But this is out of scope of the question.
Best of luck!

Extremely Slow Processing on Dataproc 9 hours vs 3 mins on local machine

From the log I can see that there are 182k rows 70MB. It take 1.5 hours load 70MB of data and 9 hours (Started at 15/11/14 01:58:28 and ended at 15/11/14 09:19:09) to train 182K rows on Dataproc. Loading the same data and running the same algorithm on my local machine takes 3 mins
DataProc Log
15/11/13 23:27:09 INFO com.google.cloud.hadoop.io.bigquery.ShardedExportToCloudStorage: Table 'mydata-data:website_wtw_feed.video_click20151111' to be exported has 182712 rows and 70281790 bytes
15/11/13 23:28:13 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#rc-spark-poc-w-1.c.dailymotion-data.internal:60749] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/11/14 01:58:28 INFO com.dailymotion.recommender.BigQueryRecommender: Fetching the Ratings RDD
15/11/14 01:58:28 INFO com.dailymotion.recommender.BigQueryRecommender: Transforming the video feature matrix
15/11/14 01:58:28 INFO com.dailymotion.recommender.BigQueryRecommender: Training ALS Matrix factorization Model
[Stage 2:=============================> (1 + 1) / 2]
15/11/14 09:19:09 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
15/11/14 09:19:09 WARN com.github.fommil.netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
15/11/14 09:19:44 INFO com.dailymotion.recommender.BigQueryRecommender: Transforming the video feature matrix
15/11/14 09:19:44 INFO com.dailymotion.recommender.BigQueryRecommender: Transforming the user feature matrix
Copied the data to local machine
r.viswanadha$ gsutil cp -r gs://<mycompany>-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000 .
Copying gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-0/data-000000000000.json...
Downloading ...201511132327_0000/shard-0/data-000000000000.json: 141.3 MiB/141.3 MiB
Copying gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-0/data-000000000001.json...
Copying gs://<mycompany>-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-1/data-000000000000.json...`
Ran the same Algorithm. ALS Train step took ~3 mins
com.dailymotion.recommender.BigQueryRecommender --app_name BigQueryRecommenderTest --master local[4] --input_dir /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/input/job_201511132327_0000/shard-0/
First Run
15/11/14 13:19:36 INFO BigQueryRecommender: Training implicit features for the ALS Matrix factorization Model
...
15/11/14 13:22:24 INFO BigQueryRecommender: Transforming the video feature matrix
Second Run
15/11/14 13:29:05 INFO BigQueryRecommender: Training implicit features for the ALS Matrix factorization Model
...
15/11/14 13:31:57 INFO BigQueryRecommender: Transforming the video feature matrix
The DataProc cluster has 1 Master and 3 Slaves with 104GB (RAM) and 16 CPUs each.
My local machine has 8GB (RAM) and 2 CPU 2.7GHz Core i5
gsutil ls -l -r -h gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000
gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/:
gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-0/:
0 B 2015-11-13T23:27:13Z gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-0/
141.3 MiB 2015-11-13T23:29:21Z gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-0/data-000000000000.json
0 B 2015-11-13T23:29:21Z gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-0/data-000000000001.json
gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-1/:
0 B 2015-11-13T23:27:13Z gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-1/
0 B 2015-11-13T23:28:47Z gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/shard-1/data-000000000000.json
0 B 2015-11-13T23:27:09Z gs://dailymotion-spark-rc-test/bqdata/hadoop/tmp/bigquery/job_201511132327_0000/
TOTAL: 6 objects, 148165416 bytes (141.3 MiB)
To recap some of the offline findings, when things are running orders of magnitude slower on the distributed cluster compared to a local setup, the main bottlenecks to look for are I/O round-trip-latency bottlenecks both in terms of cross-network service dependencies as well as disk and local I/O.
Things to look for in general (some of which may or may not be applicable to your specific case, but may be common to others encountering similar issues):
Make sure the GCS bucket holding the data is in the same region as the GCE zone you've deployed your Dataproc cluster with gsutil ls -L gs://[your-bucket]. Cross continental traffic is not only significantly slower, but also may incur additional network costs in your project.
If your job has any other network dependencies such as querying APIs or a separate database of some sort running on GCE, try to colocate them in the same zone; even within the same continent, GCE cross-region traffic could have tens of milliseconds of round-trip latency, which can add up significantly especially if there are per-record requests being made (for example, 30ms * 180k records comes out to 1.5 hours).
Even though this may not have been applicable to your specific case this time, remember to avoid per-record round-trip I/O to GCS via Hadoop FileSystem interfaces if possible; overall throughput to GCS is very scalable, but by nature of remote storage, round-trip latencies are much slower than round-trip latencies you might measure on a local machine, due to local reads often hitting the OS buffer cache or if you're using a laptop with SSD being able to sustain high volumes of sub-millisecond round trips, compared to 30ms-100ms round-trips to GCS.
And in general, for use cases that can support very high throughput but suffer long round-trip latencies, make sure to shard out the data with something like repartition() if the data is small and doesn't naturally partition into sufficient parallelism already to ensure good utilization of your Spark cluster.
Finally, our latest Dataproc release fixes a bunch of native library configurations, so it may show much better performance for the ALS portion as well as other mllib use cases.
For anyone who hits something similar: when dealing with only a single small object in GCS (or a single shard with data from the BigQuery connector), you can end up with a single partition in your Spark RDD and as a result, end up with little or no parallelism.
While it results in an extra shuffle phase, the input RDD can be repartitioned right after reading from GCS or BigQuery to acquire the desired number of partitions. Whether the extra shuffle is beneficial is dependant on how much processing or IO is required for each record in an RDD.

spark 1.4 mllib memory pile up with gradient boosted trees

Problem with Gradient Boosted Trees (GBT):
I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6
What happens if I run GBT for 40 iterations, the input as
seen in spark UI becomes larger and larger for certain stages
(and the runtime increases correspondingly)
MapPartition in DecisionTree.scala L613
Collect in DecisionTree.scala L977
count DecistionTreeMetadata.scala L 111.
I start with 4GB input and eventually this goes up to over 100GB
input increasing by a constant amount. The completion of the related tasks
becomes slower and slower.
The question is whether this is a correct procedure or whether this is a bug in the MLLib.
My feeling is that somehow more and more data is bound to the relevant data rdd.
Does anyone know how to fix it?
I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.
I am referring to
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree

Resources