Avoid chunk / batch processing in Spark - apache-spark

Often I am encountering a pattern of dividing Big processing steps in batches when these steps can't be processed entirely in our Big Data Spark cluster.
For instance, we have a large cross join or some calculus that fails when done with all the input data and then we usually are dividing these spark task in chunks so the spark mini-tasks can complete.
Particularly I doubt this is the right way to do it in Spark.
Is there a recipe to solve this issue? Or even with Spark we are again in the old-way of chunking/batching the work so to the work can be completed in a small cluster?
Is this a mere question of re-partitioning the input data so that Spark can do more sequential processing instead of parallel processing?

Related

Huge latency in spark streaming job

I have a near real time spark streaming application for image recognition where receiver gets the input frames from kafka. I have 6 receivers per executor, 5 executors in total, I can see 30 active tasks per iteration on Spark UI.
My problem is spark able to read 850 frames/sec from kafka but processes task very slowly, which is why i am facing backpressure related issues. Within each batch, the task is expected to run few tensorflow models by first loading them using keras.model_loads and then performs other related processing in order to get the prediction from the model. The output of 1st tensorflow model is the input to 2nd tensorflow model which in turns also load another model and perform prediction on top of it. Now finally output of #2 is the input to model #3 which do the same thing, load the model and perform prediction. The final prediction is send back to kafka to another topic. Having this process flow for each task, overall latency to process a single task is coming somewhere between 10 to 15 seconds which is huge for a spark streaming application
Can anyone help me, how can I make this program fast?
Remember I have to use these custom tensorflow models in my program to get the final output.
I have the following thoughts in my mind:
Option 1 - Replace spark streaming with structured streaming
Option 2 - Break sequential processing and put each sub process in separate RDD i.e. model #1 processing in RDD1, model #2 processing in RDD2 and so on
Option 3 - Rewrite custom tensorflow functionality in spark only, currently that is a single python program which I am using with each task. However I am not so sure about this option yet and not even check the feasibility so far. But what I am assuming that if I am able to do that i will have full control over the distribution of models. Therefore may get the fast processing of these task on GPU machines on AWS cluster which is not happening currently.
Tuning spark job is most time consuming part, you can tryout following options -
Go through this link, this is must for any spark job tuning http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
Try to use direct kafka ingestion instead of receiver based approach.
Try to analyze and find out most time consuming part in your execution from log. If your custom code takes long time due to sequential processing spark tuning will not help.

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

How to enable dynamic repartitioning in Spark Streaming for uneven data load

I have a use case where input stream data is skewed, volume of data can be from 0 events to 50,000 events per batch. Each data entry is independent of others. Therefore to avoid shuffle caused by repartitioning I want to use some kind of dynamic repartitioning based on the batch size. I cannot get size of the batch using dstream count.
My use case is very simple I have unknown volume of data coming into the spark stereaming process, that I want to process in parallel and save to a text file. I want to run this data in parallel therefore I am using repartition which has introduced shuffle. I want to avoid shuffle due to repartition.
I want to what is the recommended approach to solve data skewed application in spark streaming.

How to reduce spark batch job creation overhead

We have a requirement where a calculation must be done in near real time (with in 100ms at most) and involves moderately complex computation which can be parallelized easily. One of the options we are considering is to use spark in batch mode apart from Apache Hadoop YARN. I've read that submitting batch jobs to spark has huge overhead however. Is these a way we can reduce/eliminate this overhead?
Spark best utilizes available resources i.e. memory and cores. Spark uses the concept of Data Locality.
If data and the code that operates on it are together than computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data.
If you are low on resources surely scheduling and processing time will shoot. Spark builds its scheduling around this general principle of data locality.
Spark prefers to schedule all tasks at the best locality level, but this is not always possible.
Check https://spark.apache.org/docs/1.2.0/tuning.html#data-locality

Resources