Huge latency in spark streaming job - apache-spark

I have a near real time spark streaming application for image recognition where receiver gets the input frames from kafka. I have 6 receivers per executor, 5 executors in total, I can see 30 active tasks per iteration on Spark UI.
My problem is spark able to read 850 frames/sec from kafka but processes task very slowly, which is why i am facing backpressure related issues. Within each batch, the task is expected to run few tensorflow models by first loading them using keras.model_loads and then performs other related processing in order to get the prediction from the model. The output of 1st tensorflow model is the input to 2nd tensorflow model which in turns also load another model and perform prediction on top of it. Now finally output of #2 is the input to model #3 which do the same thing, load the model and perform prediction. The final prediction is send back to kafka to another topic. Having this process flow for each task, overall latency to process a single task is coming somewhere between 10 to 15 seconds which is huge for a spark streaming application
Can anyone help me, how can I make this program fast?
Remember I have to use these custom tensorflow models in my program to get the final output.
I have the following thoughts in my mind:
Option 1 - Replace spark streaming with structured streaming
Option 2 - Break sequential processing and put each sub process in separate RDD i.e. model #1 processing in RDD1, model #2 processing in RDD2 and so on
Option 3 - Rewrite custom tensorflow functionality in spark only, currently that is a single python program which I am using with each task. However I am not so sure about this option yet and not even check the feasibility so far. But what I am assuming that if I am able to do that i will have full control over the distribution of models. Therefore may get the fast processing of these task on GPU machines on AWS cluster which is not happening currently.

Tuning spark job is most time consuming part, you can tryout following options -
Go through this link, this is must for any spark job tuning http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
Try to use direct kafka ingestion instead of receiver based approach.
Try to analyze and find out most time consuming part in your execution from log. If your custom code takes long time due to sequential processing spark tuning will not help.

Related

Spark Model serving: How to compute the state just once, then keep the state to answer all real-time requests? (recommendation engine)

I try to implement a recommendation engine using Kafka to collect real-time click data and then process it using Spark Structured Streaming. My problem is about how to server predictions in near real-time using this streaming dataframe.
What works fine is: collecting click data, sending to Kafka, subscribing from Spark, using Strcutured Streaming to compute a dataframe which describes the 'state of a visitor'. Now having this streaming dataframe, there are just few lines of code (business logic) telling which is the best recommendation.
Now my problem is: how do I put this in production. I could create a mlflow.pyfunc model. But this would not contain the 'state of a visitor' dataframe. When looking at model serving frameworks, I understand that every inference request would create an independent runtime which would have to do the whole data pipeline again.
My idea would be to have 1 Spark instance which would:
create this streaming dataframe
wait for incoming request and answer those by using the dataframe from (1.)
Is this a reasonable approach? If yes: How do I set this up? If no: What is the preferred way to do real-time recommendations?

Spark realtime processing with 1 or multiple jobs in one application

I'm curious to find out what the best practice approach is with designing spark streaming applications.
We have a number of data sources we want to ingest, clean and transform over kafka using spark streaming.
The processing is broken down into 3 steps resulting in a new topic with new structure in each topic, e.g. Raw, standardised and logical.
The question relates to the design of the spark steaming applications. I see 3 options
1 streaming application per step meaning 3 running spark jobs per source
1 streaming application per source meaning 1 running spark Job that reads and writes multiple topics for the same source
1 streaming application for all sources and topics.
My intuition tells me that option 2 is best tradeoff as option 1 results in far too many running spark jobs and too much complexity in a single job.
However is it actually a good idea at all to have a single spark Job do more than 1 step in the pipeline at all? If the job was to stop or fail, could it be less reliable or result in data loss of some sort?
As confirmed in the comments section the flow looks something like following:
sources -> step1(raw) -> topic1 -> step2(standardized) -> topic2 -> step3(logical) -> target
I would keep the entire streaming pipeline in a single application (i.e. 3rd option mentioned by you). Following are the benefits of this approach:
No need of writing intermediate results (of Step 1 and 2) on disk (either on a Kafka topic or on files). Why involve disk IO when the entire computing can be done in memory. That is the whole
A single application will be easy to maintain. i.e. all your transformation logic can be in a single application. Also adding a new transformation (step) in the same application would be easy as compared to spawning a new application for a new transformation (step).
Regarding your concern of data loss:
Not quite sure about DStream based Streaming, but for Structured Streaming, if your streaming application fails by whatever reason, Spark will reprocess the data of the most recent batch (for which the job failed) as far as your source is replayable. So there won't be data loss but there could be duplicate data. Check this link: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#fault-tolerance-semantics
For Dstream based Streaming also I believe there is a zero data loss guarantee. Check this link: https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
However, I don't have much hands-on experience in Dstream based model. So I wouldn't comment much on that.
Note: I have assumed that intermediate result of step 1 and step 2 won't be used by any other application or job other than step 2 and step 3. If you have to store the intermediate results, then we need to rethink the approach.

5 Minutes Spark Batch Job vs Streaming Job

I am trying to figure out what should be a better approach.
I have a Spark Batch Job which is scheduled to run every 5 mints and it takes 2-3 mints to execute.
Since Spark 2.0 have added support for dynamic allocation spark.streaming.dynamicAllocation.enabled, Is it a good idea to make its a streaming job which pulls data from source every 5 mints?
What are things I should keep in mind while choosing between streaming/batch job?
Spark Streaming is an outdated technology. Its successor is Structured Streaming.
If you do processing every 5 mins so you do batch processing. You can use the Structured Streaming framework and trigger it every 5 mins to imitate batch processing, but I usually wouldn't do that.
Structured Streaming has a lot more limitations than normal Spark. For example you can only write to Kafka or to a file, or else you need to implement the sink by yourself using Foreach sink. Also if you use a File sink then you cannot update it, but only append to it. Also there are operations that are not supported in Structured Streaming and there are actions that you cannot perform unless you do an aggrigation before.
I might use Structured Straming for batch processing if I read from or write to Kafka because they work well together and everything is pre-implemented. Another advantage of using Structured Streaming is that you automatically continue reading from the place you stopped.
For more information refer to Structured Streaming Programming Guide.
Deciding between streaming vs. batch, one needs to look into various factors. I am listing some below and based on your use case, you can decide which is more suitable.
1) Input Data Characteristics - Continuous input vs batch input
If input data is arriving in batch, use batch processing.
Else if input data is arriving continuously, stream processing may be more useful. Consider other factors to reach to a conclusion.
2) Output Latency
If required latency of output is very less, consider stream processing.
Else if latency of output does not matter, choose batch processing.
3) Batch size (time)
A general rule of thumb is use batch processing if the batch size > 1 min otherwise stream processing is required. This is because trigerring/spawning of batch process adds latency to overall processing time.
4) Resource Usage
What's the usage pattern of resources in your cluster ?
Are there more batch jobs which execute when other batch jobs are done ? Having more than one batch jobs running one after other and are using cluster respurces optimally. Then having batch jobs is better option.
Batch job runs at it's schedule time and resources in cluster are idle after that. Consider running streaming job if data is arriving continuously, less resources may be required for processing and output will become available with less latency.
There are other things to consider - Replay, Manageability (Streaming is more complex), Existing skill of team etc.
Regarding spark.streaming.dynamicAllocation.enabled, I would avoid using it because if the rate of input varies a lot, executors will be killed and created very frequently which would add to latency.

How to avoid Code Redundancy in Lambda Architecture?

We have an exiting batch processing which is working as mentioned below
Hive SQL is using for Daily batch processing.
Data are being either ingested from Files or RDMBS
Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.
Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?
ALL the source system is going to land on Kafka.
Same batch processing System will consume Kafka topics.
New Spark Application will consume kafka topics for streaming.
Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.
The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?
You can build your processing stages using Spark SQL and Spark Structured Streaming: https://spark.apache.org/docs/2.2.0/structured-streaming-programming-guide.html. Depending on your needs there can be some incompatibilities. But I´d try to build the Spark Aggregations + Transformations using the Dataset[_] api and then try to spawn in both ways, batch and streaming.
The problem of duplicated code base is inherent in lambda architecture. It gets a mention in the 'criticism' section of the wikipedia page
Another issue is that the data between batch and stream are not in sync so can lead to unexpected results when bringing data together. For example, joining across stream and batch when keys do not yet exist in batch.
I believe the lambda architecture comes from an belief that streaming is complex and expensive so keep batch as much as possible and add streaming only for those elements that require near-real time. We already have batch, let's add a few streaming things.
An alternate architecture is to use streaming for everything. This is based on the realization that batch is a special case of streaming, so do your batch and stream processing on a single streaming platform.
use spark structured streaming for batch
lambda architecture issues and how only using streaming solves them
questioning the lambda architecture

Avoid chunk / batch processing in Spark

Often I am encountering a pattern of dividing Big processing steps in batches when these steps can't be processed entirely in our Big Data Spark cluster.
For instance, we have a large cross join or some calculus that fails when done with all the input data and then we usually are dividing these spark task in chunks so the spark mini-tasks can complete.
Particularly I doubt this is the right way to do it in Spark.
Is there a recipe to solve this issue? Or even with Spark we are again in the old-way of chunking/batching the work so to the work can be completed in a small cluster?
Is this a mere question of re-partitioning the input data so that Spark can do more sequential processing instead of parallel processing?

Resources