Spark realtime processing with 1 or multiple jobs in one application - apache-spark

I'm curious to find out what the best practice approach is with designing spark streaming applications.
We have a number of data sources we want to ingest, clean and transform over kafka using spark streaming.
The processing is broken down into 3 steps resulting in a new topic with new structure in each topic, e.g. Raw, standardised and logical.
The question relates to the design of the spark steaming applications. I see 3 options
1 streaming application per step meaning 3 running spark jobs per source
1 streaming application per source meaning 1 running spark Job that reads and writes multiple topics for the same source
1 streaming application for all sources and topics.
My intuition tells me that option 2 is best tradeoff as option 1 results in far too many running spark jobs and too much complexity in a single job.
However is it actually a good idea at all to have a single spark Job do more than 1 step in the pipeline at all? If the job was to stop or fail, could it be less reliable or result in data loss of some sort?

As confirmed in the comments section the flow looks something like following:
sources -> step1(raw) -> topic1 -> step2(standardized) -> topic2 -> step3(logical) -> target
I would keep the entire streaming pipeline in a single application (i.e. 3rd option mentioned by you). Following are the benefits of this approach:
No need of writing intermediate results (of Step 1 and 2) on disk (either on a Kafka topic or on files). Why involve disk IO when the entire computing can be done in memory. That is the whole
A single application will be easy to maintain. i.e. all your transformation logic can be in a single application. Also adding a new transformation (step) in the same application would be easy as compared to spawning a new application for a new transformation (step).
Regarding your concern of data loss:
Not quite sure about DStream based Streaming, but for Structured Streaming, if your streaming application fails by whatever reason, Spark will reprocess the data of the most recent batch (for which the job failed) as far as your source is replayable. So there won't be data loss but there could be duplicate data. Check this link: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#fault-tolerance-semantics
For Dstream based Streaming also I believe there is a zero data loss guarantee. Check this link: https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html
However, I don't have much hands-on experience in Dstream based model. So I wouldn't comment much on that.
Note: I have assumed that intermediate result of step 1 and step 2 won't be used by any other application or job other than step 2 and step 3. If you have to store the intermediate results, then we need to rethink the approach.

Related

Best deduplication strategy to be used with spark

What is the best de-duplication strategy to be used with spark?
I have a Kafka source that is continuously fed with structured information (say JSON) from various producers continuously.
I am having an HDInsight spark cluster that can pick messages in real time for this Kafka source, process them and put it into a destination Kafka source in real time.
My use case demands that the information received from the source may have duplicates which need to be eliminated. The duplicates have to be be checked against say last 24 hours.
My attempt :
I tried using the .dropduplicate method in spark along with watermarking , but I think it's not the best thing to do since the data for a single day window may exceed 50 GB in my use case.
I also looked for bloom filter implementation which can be used with spark but couldn't find a good one.
My question:
What are the possible approaches to eliminate duplication in general for large scale spark streaming application.?
Which of these features can be used along with HDInsight clusters on Azure ?
What are the fault tolerance capability in such services ?

5 Minutes Spark Batch Job vs Streaming Job

I am trying to figure out what should be a better approach.
I have a Spark Batch Job which is scheduled to run every 5 mints and it takes 2-3 mints to execute.
Since Spark 2.0 have added support for dynamic allocation spark.streaming.dynamicAllocation.enabled, Is it a good idea to make its a streaming job which pulls data from source every 5 mints?
What are things I should keep in mind while choosing between streaming/batch job?
Spark Streaming is an outdated technology. Its successor is Structured Streaming.
If you do processing every 5 mins so you do batch processing. You can use the Structured Streaming framework and trigger it every 5 mins to imitate batch processing, but I usually wouldn't do that.
Structured Streaming has a lot more limitations than normal Spark. For example you can only write to Kafka or to a file, or else you need to implement the sink by yourself using Foreach sink. Also if you use a File sink then you cannot update it, but only append to it. Also there are operations that are not supported in Structured Streaming and there are actions that you cannot perform unless you do an aggrigation before.
I might use Structured Straming for batch processing if I read from or write to Kafka because they work well together and everything is pre-implemented. Another advantage of using Structured Streaming is that you automatically continue reading from the place you stopped.
For more information refer to Structured Streaming Programming Guide.
Deciding between streaming vs. batch, one needs to look into various factors. I am listing some below and based on your use case, you can decide which is more suitable.
1) Input Data Characteristics - Continuous input vs batch input
If input data is arriving in batch, use batch processing.
Else if input data is arriving continuously, stream processing may be more useful. Consider other factors to reach to a conclusion.
2) Output Latency
If required latency of output is very less, consider stream processing.
Else if latency of output does not matter, choose batch processing.
3) Batch size (time)
A general rule of thumb is use batch processing if the batch size > 1 min otherwise stream processing is required. This is because trigerring/spawning of batch process adds latency to overall processing time.
4) Resource Usage
What's the usage pattern of resources in your cluster ?
Are there more batch jobs which execute when other batch jobs are done ? Having more than one batch jobs running one after other and are using cluster respurces optimally. Then having batch jobs is better option.
Batch job runs at it's schedule time and resources in cluster are idle after that. Consider running streaming job if data is arriving continuously, less resources may be required for processing and output will become available with less latency.
There are other things to consider - Replay, Manageability (Streaming is more complex), Existing skill of team etc.
Regarding spark.streaming.dynamicAllocation.enabled, I would avoid using it because if the rate of input varies a lot, executors will be killed and created very frequently which would add to latency.

How to avoid Code Redundancy in Lambda Architecture?

We have an exiting batch processing which is working as mentioned below
Hive SQL is using for Daily batch processing.
Data are being either ingested from Files or RDMBS
Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.
Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?
ALL the source system is going to land on Kafka.
Same batch processing System will consume Kafka topics.
New Spark Application will consume kafka topics for streaming.
Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.
The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?
You can build your processing stages using Spark SQL and Spark Structured Streaming: https://spark.apache.org/docs/2.2.0/structured-streaming-programming-guide.html. Depending on your needs there can be some incompatibilities. But I´d try to build the Spark Aggregations + Transformations using the Dataset[_] api and then try to spawn in both ways, batch and streaming.
The problem of duplicated code base is inherent in lambda architecture. It gets a mention in the 'criticism' section of the wikipedia page
Another issue is that the data between batch and stream are not in sync so can lead to unexpected results when bringing data together. For example, joining across stream and batch when keys do not yet exist in batch.
I believe the lambda architecture comes from an belief that streaming is complex and expensive so keep batch as much as possible and add streaming only for those elements that require near-real time. We already have batch, let's add a few streaming things.
An alternate architecture is to use streaming for everything. This is based on the realization that batch is a special case of streaming, so do your batch and stream processing on a single streaming platform.
use spark structured streaming for batch
lambda architecture issues and how only using streaming solves them
questioning the lambda architecture

How to do multiple Kafka topics to multiple Spark jobs in parallel

Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext

Spark: processing multiple kafka topic in parallel

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.
Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?
I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here
I made the following observations, in case its helpful for someone:
In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.
If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*. So, I decided to create multiple streams.
sparkConf.set("spark.streaming.concurrentJobs", "4");
I think the right solution depends on your use case.
If your processing logic is the same for data from all topics, then without doubt, this is a better approach.
If the processing logic is different, i guess you get a single RDD from all the topics and you have to create a pairedrdd for each processing logic and handle it separately. The problem is that this creates a sort of grouping to processing and the overall processing speed will be determined by the topic which needs the longest time to process. So topics with less data have to wait till data from all topics are processed. One advantage is that if its a timeseries data, then the processing proceeds together which might be a good thing.
Another advantage of running independent jobs is that you get better control and can adjust your resource sharing. For eg: jobs which process topic with high throughput can be allocated a higher CPU/memory.

Resources