Why Spark's First Iteration is slow and further iterations are faster?

Why Spark's First Iteration is slow and further iterations are faster? - apache-spark

If we are running spark job, lets say logistic regression in spark,
for the first iteration spark will take around 80s and further it will take 1s why is that so ?
Whats the internal behavior of spark here ? i know spark stores the data in-memory thats why computation is faster but detailed explanation would be good!

Few things:
First iteration can contain sending code to workers, etc.
Most of ML algorithms caches input data in memory. Cache is lazy, so in the first iteration whole dataset is cached - moved to RAM - and in next iterations algorithm uses cached data - which is much faster
Spark infrastructure must be initialized - parts of the context, executor JVMs

Related

Is there a way to "precompile" the spark optimization plan, so that it doesn't need to be recomputed everytime?

If I have an application that runs the same job on the same set of columns (not necessarily same row values) every day. Is there a way I can save the spark execution plan without having spark recompute it every time?
My application requires thousands of transformations and there is significant time involved in building the lineage graph and optimization plan.

Is there a way I can save the spark execution plan without having spark recompute it every time?
I have never came across such possibility, so with large dose of confidence I can say that it's not an option.
What instead you can do it to optimize the data that is the input to Spark - optimal partitioning, compression, a format that supports predicate pushdown is probably the places where you can look for some time savings.

Why there are so many partitions required before shuffling data in Apache Spark?

Background
I am a newbie in Spark and want to understand about shuffling in spark.
I have two following questions about shuffling in Apache Spark.
1) Why there is change in no. of partitions before performing shuffling ? Spark does it by default by changing partition count to value given in spark.sql.shuffle.partitions.
2) Shuffling usually happens when there is a wide transformation. I have read in a book that data is also saved on disk. Is my understanding correct ?

Two questions actually.
Nowhere it it stated that you need to change this parameter. 200 is the default if not set. It applies to JOINing and AGGregating. You make have a far bigger set of data that is better served by increasing the number of partitions for more processing capacity - if more Executors are available. 200 is the default, but if your quantity is huge, more parallelism if possible will speed up processing time - in general.
Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially:
DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.
CURRENT STAGE
All the (fused) Map operations are performed intra-Stage.
The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
operations of current Stage.
This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have
thought in memory was possible, if data is small, but this is an architectural Spark
approach as stated from the docs.)
The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
keys/locations once all of the map side work is done.
NEXT STAGE
The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.
Stages mean writing to disk, even if enough memory present. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' style of implementation.
Of course, fault tolerance is aided by this persistence, less re-computation work.
Similar aspects apply to DFs.

Difference between one-pass and multi-pass computations

I'm reading an article on Apache Spark and I came across the following sentence:
"Hadoop as a big data processing technology has been around for 10 years and has proven to be the solution of choice for processing large data sets. MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms." (Full article)
Searching the web yields results about the difference between one-pass and multi-pass compilers (For instance, see This SO question)
However, I'm not really sure if the answer also applies for data processing. Can somebody explain me what one-pass computation and multi-pass computation is, and why the latter is better, and thus is used in Spark?

Map Reduce
Source : https://www.guru99.com/introduction-to-mapreduce.html
Here you can see, the input file is processed as follows.
first split
goes into mapping phase
Shuffling
Reducer
In Map-reduce paradigm, after every stage the intermediate result is written to disk. Also, Mapper and Reducer are two different process. That is, first the mapper job runs, spits out the mapping files, then the reducer job starts. At every stage the job requires resource allocation. Therefore, a single map-reduce job required multiple iterations. If you have multiple map phases, after every map the data needs to spit out to disk before other map task starts. This is the multi-step process.
Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.
Spark
On the other hand, spark does the resource negotiation only once. Once the negotiation is completed, it spawns all the executors and that stays throughout the tenure of the job.
During the execution, spark doesn't write the intermediate output of the Map phases to the disk, rather keeps in memory. Therefore, all the map operations can happen back to back without writing to disk or spawning new executors. This is the single step process.
Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.

One pass computations is when you are reading the dataset once whereas multipass computations is when a dataset is read once from the disk and multiple computations or operation are done on the same dataset. Apache Spark processing framework allows you to read data once which is then cached into memory and then we can perform multi pass computations on the data. These computations can be done on the dataset very quickly because the data is present into memory of the machine and apache spark does not need to read the data again from the disk which helps us to save lot of input output operations time. As per the definition of apache spark it is an in memory processing framework which means the data and transformation on which the computation is done is present in memory itself.

which is faster in spark, collect() or toLocalIterator()

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?

The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.

To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.

As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?

As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string