Spark or Storm (Trident) - apache-spark

I am trying to scale a component in our system and thinking about what should be a better way to go between Storm(Trident) and Spark.
So, we have 2 large sets which can contain upto Million of events stored inside redis cluster . Say S1 and S2.
Now, we read a message from a messaging queue(Kafka) and need to find all the elements which are present both in S1 and S2 (basically find **S1∩S2 ). Now for small sets Redis itself can do the intersection efficiently but we anticipate the size of these sets can be in million .**
To solve the above , we are looking to explore some distributed computation frameworks (namely Storm and Spark).
I have a little experience with basic Spouts and Bolts with Storm and think that it will not be able to work here efficiently as we will have to write the logic of intersection inside one of our bolts . Exploring if Trident can be of some use but looks to me it may not provide adequate .
On the other hand , Spark provides RDD at its core which provide operations like intersection,union to be processed in parallel out of the box and my guess is we read a message from a messaging queue and submit a task to spark cluster which will read from the redis and compute S1∩S2 efficiently .So , I think Spark can be a good fit for our use case.
If both Storm and Spark can help I would be tilted to use Storm .
Can anyone here provide some perspective .

Disclaimer: I am a committer at Flink and Storm, and work as a software engineer at Confluent focusing on Kafka Streams.
I am not familiar with Spark details, but "intersect" sounds like a batch processing operator -- so I am not sure, if it will be available in Spark Streaming -- you should double check this (I assume you want to use Spark Streaming as you compare Spark to Storm). If you want to do batch processing, it sounds reasonable to go with Spark and exploit "intersect" operator.
Doing "intersect" in stream processing is different than for batch processing. However, it it basically a join operation and it should not be hard to implement this (as long as there is a proper join operator provided by the system).
As you mention that you will consumer message from Kafka, it might be worth to try out Kafka Streams, Kafka's stream processing library. Thus, you do not need to run an additional system. Kafka Streams offers rich DSL including sliding-window-joins.
If you want to go with a stream processing framework, I would rather use Flink that is (IMHO) better than Storm (or Spark).
See also Confluent's Kafka Streams docs that are more detailed than Apache Kafka's docs of Kafka Streams: http://docs.confluent.io/current/streams/index.html

Related

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

how to use flink and spark together,and spark just for transformation?

Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.

kafka streaming or spark streaming

Am using now kafka in Python.
Was wondering if Spark Kafka is needed or can we use just use kafka
through pyKafka.
My concern was Spark creates overhead (pyspark) in the process,
and if we don't use any spark functions, just Kafka streaming is required.
What are the inconvenients of using Pyspark and kafka spark ?
It totally depends on the use case at hand, as all mentioned in the comments, however I passed with the same situation a couple of months ago, I will try to transfer my knowledge and how I decided to move to kafka-streams instead of spark-streaming.
In my use case, we only used spark to do a realtime streaming from kafka, and don't do any sort of map-reduce, windowing, filtering, aggregation.
Given the above case, I did the comparison based on 3 dimentions:
Technicality
DevOps
Cost
Below image show the table of comparison I did to convince my team to migrate to use kafka-streams and suppress spark, Cost is not added in the image as it totally depends on your cluster size (HeadNode-WorkerNodes).
V.I. NOTE:
Again, this is based on your case, I just tried to give you a pointer how to do the comparison, but spark itself has lots of benefits, which is irrelevant to describe it in this question.

Use Akka with apache spark streaming & Kafka?

Below is the high level usecase which im trying to workon.
we have stream of students data published into a Kafka topic and our module has to read the student ids as stream and fetch associated data from multiple sources for each student and perform some calculation for each student and publish the associated calculation for each student into a kafka topic.
So here the question is it better to write a single big Spark job or use Akka to have separate service for each source so that actors can work parallely take bunch of student ids and get the data from respective source and perform some bunch Transformations and actions and finally a calculation associated with each student .
Or do i really need to use Akka here? Will Spark handles this efficiently internally?
Appreciate any thoughts here.
If your transformations take data from Kafka as input and produce output back into Kafka, it appears the most natural fit is Kafka Streams. I'd look to that first. Kafka Streams take advantage of the partitioning of data on Kafka to process partition groups in parallel to each other, but process messages sequentially within in each group, similarly how akka actors work in parallel to each other but each actor internally processes messages sequentially.
However, if your calculation requires e.g. machine learning or in general some iterative data-processing which does re-partitioning (shuffling in spark lingo) of the data between iterations, then Kafka Streams would no longer be that good a fit, I think. Then I'd consider Spark or Flink.
Akka is really powerful and you can use it in both these cases and more. However, it's a lower level library than Kafka Streams, Spark or Flink. Which means you have more power but also more considerations to think about. If using akka, I'd go for akka-streams. They have a good integration with kafka via the akka-stream-kafka (aka reactive-kafka) library.

How to do multiple Kafka topics to multiple Spark jobs in parallel

Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext

Resources