I am trying to run Spark's Power Iteration Clustering algorithm for 5000 iterations on 80 million rows of data. On low iterations (couple hundred) it finishes fine so its not a code issue. At high iteration it gives me a java.lang.StackOverflowError exception.
I know that it means that DAG grew too large and it can't keep track of the lineage etc. I also have read that checkpointing can solve this issue in iterative algorithms. The problem is that PIC has no checkpoint interval parameter like the LDA algorithm so I can't (or at least don't know how to) checkpoint in the middle of the algorithm running.
Is there another possible fix to this issue? I have also attempted to increase the stack memory but that hasn't worked. I can't decrease the iterations because it won't converge.
Related
By view Spark UI timeline, I find my spark application's last task of a specific stage always cost too much time. It seem the task can't finish forever, I have even waited six times longer time than normal tasks.
I want to get more information about the lask task, but I don't know how to debug this specific task, is there anyone can give me some suggestions?
Thanks for your help!
The data has been partitioned well, so the lask task don't have too much data.
Check the explain plan of the resulting dataframe to understand what operations are happening. Are there any shuffles? Sometimes when operations are performed on a dataframe(such as joins) it can result in intermediate dataframes being mapped to a smaller number of partitions and this can cause slower performance because the data isnt as distributed as can be.
Check if there are a lot of shuffles and repeated calls to such dataframes and try to cache the dataframe that comes right after a shuffle.
Check in the Spark UI (address of the driver:4040 is default) and see what the data volume of cached dataframes is, what are the processes and if there are any other overheads such as gc or if it is pure processing time.
Hope that helps.
If we are running spark job, lets say logistic regression in spark,
for the first iteration spark will take around 80s and further it will take 1s why is that so ?
Whats the internal behavior of spark here ? i know spark stores the data in-memory thats why computation is faster but detailed explanation would be good!
Few things:
First iteration can contain sending code to workers, etc.
Most of ML algorithms caches input data in memory. Cache is lazy, so in the first iteration whole dataset is cached - moved to RAM - and in next iterations algorithm uses cached data - which is much faster
Spark infrastructure must be initialized - parts of the context, executor JVMs
I used newAPIHadoopRDD() method to load the HBase records to a RDD and do a simple count job.
However, this count job takes lots of time far more than I can imagine. I checked the codes, I am thinking may be in the HBase, one column family just has too much data, and when I load the records to the RDD, so much data may cause the executors memory overflow.
Is that possible this reason cause the issue?
I use spark with the cassandra spark connector and direct kafka.
And I seed batch procession increasing slowly over the time.
Even when there is nothing to process incoming from kafka.
I think it is about a few milliseconds by batch, but after a long time, a batch can take several more seconds until it reaches the batch interval and finally crash.
I thought first it was a memory leak, but I think the processing time would be less linear but exponentially instead.
I don't really know if it is stages that become longer and longer or the latency
between stage that increases.
I use spark 1.4.0
Any pointers about this?
EDIT :
A attentive look at the evolution of the processing time of each batch, comparing total jobs processing time.
And it appears that even if batch processing time increases, the job processing time are not increasing.
exemple : for a batch that take 7s the sum of each job processing time is 1.5s. (as shown in the image below)
Is it because the computing time driver side increases, and not the computing time executor side?
And this driver computing time is not shown in job processing ui?
If it's the case how can correct it?
I finally found the solution to my problem.
I had this code in the function that add filter and transform to my rdd.
TypeConverter.registerConverter(new SomethingToOptionConverter[EventCC])
TypeConverter.registerConverter(new OptionToSomethingConverter[EventCC])
Because it's call at each batch there is a lot of time the same object inside TypeConverter.
And I don't really know how it works Cassandra Spark converter but it's look like to make reflection internaly whit objects.
And make slow reflection x time batch make all the processing time of the batch increasing.
Problem with Gradient Boosted Trees (GBT):
I am running on AWS EC2 with version spark-1.4.1-bin-hadoop2.6
What happens if I run GBT for 40 iterations, the input as
seen in spark UI becomes larger and larger for certain stages
(and the runtime increases correspondingly)
MapPartition in DecisionTree.scala L613
Collect in DecisionTree.scala L977
count DecistionTreeMetadata.scala L 111.
I start with 4GB input and eventually this goes up to over 100GB
input increasing by a constant amount. The completion of the related tasks
becomes slower and slower.
The question is whether this is a correct procedure or whether this is a bug in the MLLib.
My feeling is that somehow more and more data is bound to the relevant data rdd.
Does anyone know how to fix it?
I think a problematic line might be L 225 in GradientBoostedTrees.scala, where a new data rdd is defined.
I am referring to
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/tree