How to reduce spark batch job creation overhead - apache-spark

We have a requirement where a calculation must be done in near real time (with in 100ms at most) and involves moderately complex computation which can be parallelized easily. One of the options we are considering is to use spark in batch mode apart from Apache Hadoop YARN. I've read that submitting batch jobs to spark has huge overhead however. Is these a way we can reduce/eliminate this overhead?

Spark best utilizes available resources i.e. memory and cores. Spark uses the concept of Data Locality.
If data and the code that operates on it are together than computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data.
If you are low on resources surely scheduling and processing time will shoot. Spark builds its scheduling around this general principle of data locality.
Spark prefers to schedule all tasks at the best locality level, but this is not always possible.
Check https://spark.apache.org/docs/1.2.0/tuning.html#data-locality

Related

Will Spark structured streaming benefit from dynamic allocation if number of cores more than number of Kafka partitions?

Supposed we have an application that reads from X partition topic, does some filtering on the data then saves it into storage (no complex shuffling logic, just some simple transformations) using Structured Streaming query. Will this application benefit from dynamic allocation feature that adds more than X single-core executors in case of data spike?
I am asking this, because I've mostly worked with DStreams, where there is quite well known recommendation to have single core per partition so that every executor core will be busy processing data from one partition and adding more executors usually will not give much scaling benefits. My intuition says that no, because the data will still end up on the same workers, but I might be missing something.
are you talking about dynamic allocation by yarn ?
But you can use minPartitions setting in spark structured streaming.
Refer https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

In spark, is it possible to reuse a DataFrame's execution plan to apply it to different data sources

I have a bit complex pipeline - pyspark which takes 20 minutes to come up with execution plan. Since I have to execute the same pipeline multiple times with different data frame (as source) Im wondering is there any option for me to avoid building execution plan every time? Build execution plan once and reuse it with different source data?`
There is a way to do what you ask but it requires advanced understanding of Spark internals. Spark plans are simply trees of objects. These trees are constantly transformed by Spark. They can be "tapped" and transformed "outside" of Spark. There is a lot of devil in the details and thus I do not recommend this approach unless you have a severe need for it.
Before you go there, it important to look at other options, such as:
Understanding what exactly is causing the delay. On some managed planforms, e.g., Databricks, plans are logged in JSON for analysis/debugging purposes. We sometimes seen delays of 30+ mins with CPU pegged at 100% on a single core while a plan produces tens of megabytes of JSON and pushes them on the wire. Make sure something like this is not happening in your case.
Depending on your workflow, if you have to do this with many datasources at the same time, use driver-side parallelism to analyze/optimize plans using many cores at the same time. This will also increase your cluster utilization if your jobs have any skew in the reduce phases of processing.
Investigate the benefit of Spark's analysis/optimization to see if you can introduce analysis barriers to speed up transformations.
This is impossible because the source DataFrame affects the execution of the optimizations applied to the plan.
As #EnzoBnl pointed out, this is not possible as Tungsten applies optimisations specific to the object. What you could do instead (if possible with your data) is to split your large file into smaller chunks that could be shared between the multiple input dataframes and use persist() or checkpoint() on them.
Specifically checkpoint makes the execution plan shorter by storing a mid-point, but there is no way to reuse.
See
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.

Difference between one-pass and multi-pass computations

I'm reading an article on Apache Spark and I came across the following sentence:
"Hadoop as a big data processing technology has been around for 10 years and has proven to be the solution of choice for processing large data sets. MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms." (Full article)
Searching the web yields results about the difference between one-pass and multi-pass compilers (For instance, see This SO question)
However, I'm not really sure if the answer also applies for data processing. Can somebody explain me what one-pass computation and multi-pass computation is, and why the latter is better, and thus is used in Spark?
Map Reduce
Source : https://www.guru99.com/introduction-to-mapreduce.html
Here you can see, the input file is processed as follows.
first split
goes into mapping phase
Shuffling
Reducer
In Map-reduce paradigm, after every stage the intermediate result is written to disk. Also, Mapper and Reducer are two different process. That is, first the mapper job runs, spits out the mapping files, then the reducer job starts. At every stage the job requires resource allocation. Therefore, a single map-reduce job required multiple iterations. If you have multiple map phases, after every map the data needs to spit out to disk before other map task starts. This is the multi-step process.
Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.
Spark
On the other hand, spark does the resource negotiation only once. Once the negotiation is completed, it spawns all the executors and that stays throughout the tenure of the job.
During the execution, spark doesn't write the intermediate output of the Map phases to the disk, rather keeps in memory. Therefore, all the map operations can happen back to back without writing to disk or spawning new executors. This is the single step process.
Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.
One pass computations is when you are reading the dataset once whereas multipass computations is when a dataset is read once from the disk and multiple computations or operation are done on the same dataset. Apache Spark processing framework allows you to read data once which is then cached into memory and then we can perform multi pass computations on the data. These computations can be done on the dataset very quickly because the data is present into memory of the machine and apache spark does not need to read the data again from the disk which helps us to save lot of input output operations time. As per the definition of apache spark it is an in memory processing framework which means the data and transformation on which the computation is done is present in memory itself.

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

Resources