I want to check and see if it is a good idea to invoke Spark code from a storm bolt. We have a stream based system in Storm. So per message we would like to do so ML and we are thinking of using Spark for that. So wanted to check if it is a good idea to do so. Any run time issues we might encounter ?
Thanks
ap
if you already have a system in place with Storm, then why do you want to use Spark?
IMHO both Spark and storm are different beast, you may want to run them in parallel for same or different use cases but do not tightly integrate each other.
What do you mean ML per message? ML on a single message doesn't make much sense. Do you mean a ML on a stream? Sure you can do it with Spark, but then you need to either use Spark Streaming (and you have two streaming architectures...) or save the data somewhere and do batch ML with Spark.
Why not use trident-ml instead?
Related
I’m learning how to use kubeflow pipeline for Apache Spark jobs and have a question. I’d appreciate if you could share your thoughts!
It is my understanding that data cannot be shared between SparkSessions, and that in each pipeline step/component you need to instantiate a new SparkSession (please correct me if I’m wrong). Does that mean in order to use output from spark jobs from previous pipeline steps, we need to save it somewhere? I suspect this will cause disk read/write burden and slow down the whole process. Can you please share with me how helpful it will be then to use pipeline for spark work?
I’m imaging a potential use case where one would like to ingest data in pyspark, preprocess it, select features for a ML job, then try different ML models and select the best one. In a non-spark situation, I probably would set separate components for each step of “loading data”, “preprocessing data”, and “feature engineering”. Due to the aforementioned issue, however, would it be better to complete all these with in one step in the pipeline, save the output somewhere, and then dedicate separate pipeline components for each model and train them in parallel?
Can you share any other potential use case? Thanks a lot in advance!
Spark in general is a in-memory processing framework, you'd avoid un-necessary writing/reading files. I believe it's better to have one spark job done in one task so you don't need to share spark session and the "middle" result between tasks. Data from loading data/pre-processing/feature engineering better to be serialised/stored with/without kubeflow anyway (think silver/bronze/golden).
Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.
Am using now kafka in Python.
Was wondering if Spark Kafka is needed or can we use just use kafka
through pyKafka.
My concern was Spark creates overhead (pyspark) in the process,
and if we don't use any spark functions, just Kafka streaming is required.
What are the inconvenients of using Pyspark and kafka spark ?
It totally depends on the use case at hand, as all mentioned in the comments, however I passed with the same situation a couple of months ago, I will try to transfer my knowledge and how I decided to move to kafka-streams instead of spark-streaming.
In my use case, we only used spark to do a realtime streaming from kafka, and don't do any sort of map-reduce, windowing, filtering, aggregation.
Given the above case, I did the comparison based on 3 dimentions:
Technicality
DevOps
Cost
Below image show the table of comparison I did to convince my team to migrate to use kafka-streams and suppress spark, Cost is not added in the image as it totally depends on your cluster size (HeadNode-WorkerNodes).
V.I. NOTE:
Again, this is based on your case, I just tried to give you a pointer how to do the comparison, but spark itself has lots of benefits, which is irrelevant to describe it in this question.
Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext
I am trying to scale a component in our system and thinking about what should be a better way to go between Storm(Trident) and Spark.
So, we have 2 large sets which can contain upto Million of events stored inside redis cluster . Say S1 and S2.
Now, we read a message from a messaging queue(Kafka) and need to find all the elements which are present both in S1 and S2 (basically find **S1∩S2 ). Now for small sets Redis itself can do the intersection efficiently but we anticipate the size of these sets can be in million .**
To solve the above , we are looking to explore some distributed computation frameworks (namely Storm and Spark).
I have a little experience with basic Spouts and Bolts with Storm and think that it will not be able to work here efficiently as we will have to write the logic of intersection inside one of our bolts . Exploring if Trident can be of some use but looks to me it may not provide adequate .
On the other hand , Spark provides RDD at its core which provide operations like intersection,union to be processed in parallel out of the box and my guess is we read a message from a messaging queue and submit a task to spark cluster which will read from the redis and compute S1∩S2 efficiently .So , I think Spark can be a good fit for our use case.
If both Storm and Spark can help I would be tilted to use Storm .
Can anyone here provide some perspective .
Disclaimer: I am a committer at Flink and Storm, and work as a software engineer at Confluent focusing on Kafka Streams.
I am not familiar with Spark details, but "intersect" sounds like a batch processing operator -- so I am not sure, if it will be available in Spark Streaming -- you should double check this (I assume you want to use Spark Streaming as you compare Spark to Storm). If you want to do batch processing, it sounds reasonable to go with Spark and exploit "intersect" operator.
Doing "intersect" in stream processing is different than for batch processing. However, it it basically a join operation and it should not be hard to implement this (as long as there is a proper join operator provided by the system).
As you mention that you will consumer message from Kafka, it might be worth to try out Kafka Streams, Kafka's stream processing library. Thus, you do not need to run an additional system. Kafka Streams offers rich DSL including sliding-window-joins.
If you want to go with a stream processing framework, I would rather use Flink that is (IMHO) better than Storm (or Spark).
See also Confluent's Kafka Streams docs that are more detailed than Apache Kafka's docs of Kafka Streams: http://docs.confluent.io/current/streams/index.html