Parallelism of Streams in Spark Streaming Context - apache-spark

I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks

The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).

Related

Can Kafka-Spark Streaming pair be used for both batch+real time data?

H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)
Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.

Spark Structured Streaming Checkpoint Compatibility

Am I safe to use Kafka and Spark Structured Streaming (SSS) (>=v2.2) with checkpointing on HDFS in cases where I have to upgrade the Spark library or when changing the query? I'd like to seamlessly continue with the offset left behind even in those cases.
I've found different answers when searching the net for compatibility issues in SSS's (>=2.2) checkpoint mechanism. Maybe someone out there can lighten up the situation ... in best case backed up with facts/references or first-person experience ?
In Spark's programming guide (current=v2.3) they just claim "..should be a directory in an HDFS-compatible" but don't even leave a single word about constraints in terms of compatibility.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks at least gives some hints that this is an issue at all.
https://docs.databricks.com/spark/latest/structured-streaming/production.html#recover-after-changes-in-a-streaming-query
A Cloudera blog recommends storing the offset rather in Zookeeper, but this actually refers to the "old" Spark Streaming implementation. If this is relates to structured streaming, too, is unclear.
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
A guy in this conversation claims that there is no problem on that regard anymore ...but without pointing to facts.
How to get Kafka offsets for structured query for manual and reliable offset management?
Help is highly appreciated.
Checkpoints are great when you don't need to change the code, fire and forget procedure are perfect use cases.
I read the post from Databricks you posted, the truth is that you can't know what kind of changes are called to do until you have to do them. I wonder how they can predict the future.
About the link on Cloudera, yes they are speaking about the old procedure, but with Structured Streaming still code changes void your checkpoints.
So, in my opinion, so much automation is good for Fire and Forget procedure.
If this is not your case, saving the Kafka offset elsewhere is a good way to restart from where you left last time; you know that Kafka can contain a lot of data and restart from zero to avoid data loss or accept the idea to restart from the latest offset sometimes is not always acceptable.
Remember: Any stream logic change will be ignored as long as there are checkpoints, so you can't make change to your job once deployed, unless you accept the idea to throwing away the checkpoints.
By throwing away the checkpoints you must force the job to reprocess the entire Kafka topic (earliest), or start right at the end (latest) skipping unprocessed data.
It's great, is it not?

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Processing log files: Apache Storm or Spark

I have a requirement to process log file data. It is relatively trivial. I have 4 servers with 2 web applications running on each for a total of 8 log files. These get rotated on a regular basis. I'm writing data in the following format into these log files
Source Timestamp :9340398;39048039;930483;3940830
Where the numbers are identifiers in a data store. I want to set up a process to read these logs and for each id it will update a count depending on the number of times its id has been logged. It can either be real time or batch. My interface language to the datastore is Java. The process runs in production so needs to be robust but also needs to have a relatively simple architecture so it is maintainable. We also run zookeeper.
My initial thought was to do this in a batch whenever the log file is rotated running an Apache spark on each server. However I then got to looking at log agregators such as Apache Flume, Kafka and Storm, but this seems like overkill.
Given the multitude of choices has anyone got any good suggestions as to which tools to use to handle this problem based on experience?
8 log files don't seem to warrant any "big data" technology. If you do want a play/get started with these type of technology I'd recommend you'd start with Spark and/or Flink - both have relatively similar programming model both both can handle "business real-time" (Flink is better at streaming but both would seem to work in your case). Storm is relatively rigid (hard to change topologies) and has a more complex programming model

Per-user stream processing

I need to process data from a set of streams, applying the same elaboration to each stream independently from the other streams.
I've already seen frameworks like storm, but it appears that it allows the processing of static streams only (i.e. tweets form twitter), while I need to process data from each user separately.
A simple example of what I mean could be a system where each user can track his gps location and see statistics like average velocity, acceleration, burnt calories and so on in real time. Of course, each user would have his own stream(s) and the system should process the stream of each user separately, as if each user had its own dedicated topology processing his data.
Is there a way to achieve this with a framework like storm, spark streaming or samza?
It would be even better if python is supported, since I already have a lot of code I'd like to reuse.
Thank you very much for your help
Using Storm, you can group data using fields-grouping connection pattern if you have a user-id in your tuples. This ensures, that data is partitioned by user-id and thus you get logical substreams. Your code only needs to be able to process multiple groups/substreams, because a single bolt instance gets multiple groups for processing. But Storm supports your use case for sure. It also can run Python code.
In Samza, similar to Storm, one would partition the individual streams on some user ID. This would guarantee that the same processor would see all the events for some particular user (as well as other user IDs that the partition function [a hash, for instance] assigns to that processor). Your description sounds like something that would more likely run on the client's system rather than being a server-side operation, however.
Non-JVM language support has been proposed for Samza, but not yet implemented.
You can use WSO2 Stream Processor to achieve this. You can partition the input stream by user-name and process events pertain to each user separately. The processing logic has to be written in Siddhi QL which is a SQL like language.
WSO2 SP also has a python wrapper to, it will allow you do perform administrative tasks such as submitting, editing jobs. But you can't write processing logic using python code.

Resources