Can Kafka-Spark Streaming pair be used for both batch+real time data? - apache-spark

H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!

What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)

Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.

Related

Streaming analytics using Apache Kafka

We are collecting streaming data from device (Android , iOS). The data flow is , websocket -> logstash -> kafka -> spark -> cassandra. Ram is of 16 GB. Our app is based on OTT platform and when a video is streaming it will send events to kafka for analytics purpose. Current situation is, memory will be overflowed quickly while playing 4 or 5 videos in parallel.
What might be the issue? Is it any configuration mistake? Is there any other better approach for our requirement?
I'll answer your broad question with a broad answer.
Is Logstash / Kafka / Spark / Cassandra a 'correct' architecture?
There's nothing particularly wrong with that approach. It depends on what processing you're doing, and why you're landing it to Cassandra. You'll find plenty of people taking this approach, whilst others may use different stream processing e.g. Kafka Streams, as well as not always using a data store (since Apache Kafka persists data) - depends on what's consuming the data afterwards.
Can my system handle more than 10,000 user activities at a time with this architecture?
Yes. No. It depends, on way too many factors to give an answer. 10,000 users doing a simple activity with small volumes of data is hugely different from 10,000 users requiring complex processing on large volumes of data.
The only way to get an answer to this, and evaluate your architectural choice in general, is to analyse the behaviour of your system as you increase [simulated] user numbers. Do particular bottlenecks appear that indicate the requirement for greater hardware scale, or even different technology choices.

Spark Structured Streaming Checkpoint Compatibility

Am I safe to use Kafka and Spark Structured Streaming (SSS) (>=v2.2) with checkpointing on HDFS in cases where I have to upgrade the Spark library or when changing the query? I'd like to seamlessly continue with the offset left behind even in those cases.
I've found different answers when searching the net for compatibility issues in SSS's (>=2.2) checkpoint mechanism. Maybe someone out there can lighten up the situation ... in best case backed up with facts/references or first-person experience ?
In Spark's programming guide (current=v2.3) they just claim "..should be a directory in an HDFS-compatible" but don't even leave a single word about constraints in terms of compatibility.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks at least gives some hints that this is an issue at all.
https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-after-changes-in-a-streaming-query
A Cloudera blog recommends storing the offset rather in Zookeeper, but this actually refers to the "old" Spark Streaming implementation. If this is relates to structured streaming, too, is unclear.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
A guy in this conversation claims that there is no problem on that regard anymore ...but without pointing to facts.
How to get Kafka offsets for structured query for manual and reliable offset management?
Help is highly appreciated.
Checkpoints are great when you don't need to change the code, fire and forget procedure are perfect use cases.
I read the post from Databricks you posted, the truth is that you can't know what kind of changes are called to do until you have to do them. I wonder how they can predict the future.
About the link on Cloudera, yes they are speaking about the old procedure, but with Structured Streaming still code changes void your checkpoints.
So, in my opinion, so much automation is good for Fire and Forget procedure.
If this is not your case, saving the Kafka offset elsewhere is a good way to restart from where you left last time; you know that Kafka can contain a lot of data and restart from zero to avoid data loss or accept the idea to restart from the latest offset sometimes is not always acceptable.
Remember: Any stream logic change will be ignored as long as there are checkpoints, so you can't make change to your job once deployed, unless you accept the idea to throwing away the checkpoints.
By throwing away the checkpoints you must force the job to reprocess the entire Kafka topic (earliest), or start right at the end (latest) skipping unprocessed data.
It's great, is it not?

Processing log files: Apache Storm or Spark

I have a requirement to process log file data. It is relatively trivial. I have 4 servers with 2 web applications running on each for a total of 8 log files. These get rotated on a regular basis. I'm writing data in the following format into these log files
Source Timestamp :9340398;39048039;930483;3940830
Where the numbers are identifiers in a data store. I want to set up a process to read these logs and for each id it will update a count depending on the number of times its id has been logged. It can either be real time or batch. My interface language to the datastore is Java. The process runs in production so needs to be robust but also needs to have a relatively simple architecture so it is maintainable. We also run zookeeper.
My initial thought was to do this in a batch whenever the log file is rotated running an Apache spark on each server. However I then got to looking at log agregators such as Apache Flume, Kafka and Storm, but this seems like overkill.
Given the multitude of choices has anyone got any good suggestions as to which tools to use to handle this problem based on experience?
8 log files don't seem to warrant any "big data" technology. If you do want a play/get started with these type of technology I'd recommend you'd start with Spark and/or Flink - both have relatively similar programming model both both can handle "business real-time" (Flink is better at streaming but both would seem to work in your case). Storm is relatively rigid (hard to change topologies) and has a more complex programming model

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Parallelism of Streams in Spark Streaming Context

I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks
The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).

Resources