Use Akka with apache spark streaming & Kafka? - apache-spark

Below is the high level usecase which im trying to workon.
we have stream of students data published into a Kafka topic and our module has to read the student ids as stream and fetch associated data from multiple sources for each student and perform some calculation for each student and publish the associated calculation for each student into a kafka topic.
So here the question is it better to write a single big Spark job or use Akka to have separate service for each source so that actors can work parallely take bunch of student ids and get the data from respective source and perform some bunch Transformations and actions and finally a calculation associated with each student .
Or do i really need to use Akka here? Will Spark handles this efficiently internally?
Appreciate any thoughts here.

If your transformations take data from Kafka as input and produce output back into Kafka, it appears the most natural fit is Kafka Streams. I'd look to that first. Kafka Streams take advantage of the partitioning of data on Kafka to process partition groups in parallel to each other, but process messages sequentially within in each group, similarly how akka actors work in parallel to each other but each actor internally processes messages sequentially.
However, if your calculation requires e.g. machine learning or in general some iterative data-processing which does re-partitioning (shuffling in spark lingo) of the data between iterations, then Kafka Streams would no longer be that good a fit, I think. Then I'd consider Spark or Flink.
Akka is really powerful and you can use it in both these cases and more. However, it's a lower level library than Kafka Streams, Spark or Flink. Which means you have more power but also more considerations to think about. If using akka, I'd go for akka-streams. They have a good integration with kafka via the akka-stream-kafka (aka reactive-kafka) library.

Related

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

How to do multiple Kafka topics to multiple Spark jobs in parallel

Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext

Spark or Storm (Trident)

I am trying to scale a component in our system and thinking about what should be a better way to go between Storm(Trident) and Spark.
So, we have 2 large sets which can contain upto Million of events stored inside redis cluster . Say S1 and S2.
Now, we read a message from a messaging queue(Kafka) and need to find all the elements which are present both in S1 and S2 (basically find **S1∩S2 ). Now for small sets Redis itself can do the intersection efficiently but we anticipate the size of these sets can be in million .**
To solve the above , we are looking to explore some distributed computation frameworks (namely Storm and Spark).
I have a little experience with basic Spouts and Bolts with Storm and think that it will not be able to work here efficiently as we will have to write the logic of intersection inside one of our bolts . Exploring if Trident can be of some use but looks to me it may not provide adequate .
On the other hand , Spark provides RDD at its core which provide operations like intersection,union to be processed in parallel out of the box and my guess is we read a message from a messaging queue and submit a task to spark cluster which will read from the redis and compute S1∩S2 efficiently .So , I think Spark can be a good fit for our use case.
If both Storm and Spark can help I would be tilted to use Storm .
Can anyone here provide some perspective .
Disclaimer: I am a committer at Flink and Storm, and work as a software engineer at Confluent focusing on Kafka Streams.
I am not familiar with Spark details, but "intersect" sounds like a batch processing operator -- so I am not sure, if it will be available in Spark Streaming -- you should double check this (I assume you want to use Spark Streaming as you compare Spark to Storm). If you want to do batch processing, it sounds reasonable to go with Spark and exploit "intersect" operator.
Doing "intersect" in stream processing is different than for batch processing. However, it it basically a join operation and it should not be hard to implement this (as long as there is a proper join operator provided by the system).
As you mention that you will consumer message from Kafka, it might be worth to try out Kafka Streams, Kafka's stream processing library. Thus, you do not need to run an additional system. Kafka Streams offers rich DSL including sliding-window-joins.
If you want to go with a stream processing framework, I would rather use Flink that is (IMHO) better than Storm (or Spark).
See also Confluent's Kafka Streams docs that are more detailed than Apache Kafka's docs of Kafka Streams: http://docs.confluent.io/current/streams/index.html

Spark: processing multiple kafka topic in parallel

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.
Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?
I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here
I made the following observations, in case its helpful for someone:
In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.
If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*. So, I decided to create multiple streams.
sparkConf.set("spark.streaming.concurrentJobs", "4");
I think the right solution depends on your use case.
If your processing logic is the same for data from all topics, then without doubt, this is a better approach.
If the processing logic is different, i guess you get a single RDD from all the topics and you have to create a pairedrdd for each processing logic and handle it separately. The problem is that this creates a sort of grouping to processing and the overall processing speed will be determined by the topic which needs the longest time to process. So topics with less data have to wait till data from all topics are processed. One advantage is that if its a timeseries data, then the processing proceeds together which might be a good thing.
Another advantage of running independent jobs is that you get better control and can adjust your resource sharing. For eg: jobs which process topic with high throughput can be allocated a higher CPU/memory.

Spark Streaming with large number of streams and models used for analytical processing of RDDs

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?
Or
Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..
Please suggest.
Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.
The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.
Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

Resources