spark 1.4 mllib memory pile up with gradient boosted trees - apache-spark

Problem with Gradient Boosted Trees (GBT):
I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6
What happens if I run GBT for 40 iterations, the input as
seen in spark UI becomes larger and larger for certain stages
(and the runtime increases correspondingly)
MapPartition in DecisionTree.scala L613
Collect in DecisionTree.scala L977
count DecistionTreeMetadata.scala L 111.
I start with 4GB input and eventually this goes up to over 100GB
input increasing by a constant amount. The completion of the related tasks
becomes slower and slower.
The question is whether this is a correct procedure or whether this is a bug in the MLLib.
My feeling is that somehow more and more data is bound to the relevant data rdd.
Does anyone know how to fix it?
I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.
I am referring to
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree

Related

How to handle Spark tasks that require much more memory than the inputs and outputs combined?

I want to perform a computation that can be written as a simple Python UDF. This particular computation consumes much more memory while generating intermediate results than is needed to store the inputs and outputs combined.
Here's the rough structure of the computational task:
import pyspark.sql.functions as fx
#fx.udf("double")
def my_computation(reference):
large_object = load_large_object(reference)
result = summarize_large_object(large_object)
return result
df_input = spark.read.parquet("list_of_references.parquet")
df_result = df_input.withColumn("result", my_computation(fx.col("reference")))
pdf_result = df_result.toPandas()
The idea is that load_large_object takes a small input (reference) and generates a much larger object, whereas summarize_large_object takes that large object and summarizes it down to a much smaller result. So, while the inputs and outputs (reference and result) can be quite small, the intermediate value large_object requires much more memory.
I haven't found any way to reliably run computations like this without severely limiting the amount of parallelism that I can achieve on upstream and downstream computations in the same Spark session.
Naively running a computation like the one above (without any changes to the default Spark configuration) often leads to worker nodes running out of memory. For example, if large_object consumes 2 GB of memory and the relevant Spark stage is running as 8 parallel tasks on 8 cores of a machine with less than 16 GB of RAM, the worker node will run out of memory (or at least start swapping to disk and slow down significantly).
IMO, the best solution would be to temporarily limit the number of parallel tasks that can run simultaneously. The closest configuration parameter that I'm aware of is spark.task.cpus, but this affects all upstream and downstream computations within the same Spark session.
Ideally, there would be some way to provide a hint to Spark that effectively says "for this step, make sure to allocate X amount of extra memory per task" (and then Spark wouldn't schedule such a task on any worker node that isn't expected to have that amount of extra memory available). Upstream and downstream jobs/stages could remain unaffected by the constraints imposed by this sort of hint.
Does such a mechanism exist?

Spark java.lang.StackOverflowError on Power Iteration Clustering

I am trying to run Spark's Power Iteration Clustering algorithm for 5000 iterations on 80 million rows of data. On low iterations (couple hundred) it finishes fine so its not a code issue. At high iteration it gives me a java.lang.StackOverflowError exception.
I know that it means that DAG grew too large and it can't keep track of the lineage etc. I also have read that checkpointing can solve this issue in iterative algorithms. The problem is that PIC has no checkpoint interval parameter like the LDA algorithm so I can't (or at least don't know how to) checkpoint in the middle of the algorithm running.
Is there another possible fix to this issue? I have also attempted to increase the stack memory but that hasn't worked. I can't decrease the iterations because it won't converge.

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Spark LDA woes - prediction and OOM questions

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.
Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (​skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (​topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.
Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is ​hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.
I tried various settings for:
driver-memory (8G)
executor-memory (1-225G)
spark.driver.maxResultSize (including disabling it)
spark.memory.offheap.enabled (true/false)
spark.broadcast.blockSize (currently at 8m)
spark.rdd.compress (currently true)
changing the serializer (currently Kryo) and its max buffer (512m)
increasing various timeouts to allow for longer computation
(executor.heartbeatInterval, rpc.ask/lookupTimeout,
spark.network.timeout) spark.akka.frameSize (1024)
At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.
Questions:
Does my prediction routine look OK? Is this an off-by-one error
somewhere w.r.t the irrelevant predicted topics?
Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.
Any pointers as to what I can try next would be much appreciated!

Resources