Apache Spark Structured Streaming vs Apache Flink: what is the difference? - apache-spark

We have discussed the questions below:
What is the difference between Apache Spark and Apache Flink? [closed]
What does “streaming” mean in Apache Spark and Apache Flink?
What is the difference between mini-batch vs real time streaming in practice (not theory)?
But Spark Structured Streaming was added at Spark2.2, it brings a lot of changes for streaming, and it is outstanding.
Can we say Spark Strutured Streaming is a streaming processing, or still batch processing?
Now what is the big difference between Apache Flink and Apache Spark Structured Streaming?

Currently:
Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results).
In near future:
Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming.
The main difference:
Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

Related

Spark Streaming vs Structured Streaming

The last months I've been using quite a lot Structured Streaming for implementing Stream Jobs (after using Kafka a lot). After reading the book Stream Processing with Apache Spark i was having this question: Is there any point or use cases where i would use Spark Streaming instead of Structured Streaming? Should i invest some time getting into it or since im already using Spark Structured Streaming i should stick with it and there is no benefit on the previous API.
Would appreciate any opinion/insight
Hi Sharing my personal experience.
Structured streaming is the future for spark based streaming implementation. It provides higher level of abstraction and other great features. However there are few restrictions.
i have had to switch to spark streaming on few occasions due to the flexibility offered by it. One recent example is, we had to perform Joins with static reference data, however Outer joins are not supported in Structured streaming. This can be accomplished with Spark streaming.
With the newer spark version 2.4, Structured streaming is much improved with support for foreachBatch sink which gives similar flexibility offered by spark streaming.
My personal thought is having the knowledge of spark streaming is helpful and you might have to use it depending on your use case.

Are Spark Streaming, Structured Streaming and Kafka Streaming the same thing?

I have come across three popular streaming techniques that are Spark Streaming, Structured Streaming and Kafka Streaming.
I have gone through various sites but not getting this answer, are these three the same thing or different?
If not same what is the basic difference.
I am not looking for an in depth answer. But an answer to above question (yes or no) and a little intro to each of them so that I can explore more. :)
Thanks in advance
Subrat
I guess you are referring to Kafka Streams when you say "Kafka Streaming".
Kafka Streams is a JVM library, part of Apache Kafka. It is a way of processing data in Kafka topics providing an abstraction layer. Applications running KafkaStreams library can be run anywhere (not just in the Kafka cluster, actually, it is not recommended to). They'll consume, process and produce data to/from the Kafka cluster.
Spark Streaming is a part of Apache Spark distributed data processing library, that provides Stream (as oppposed to batch) processing. Spark initially provided batch computation only, so a specific layer Spark Streaming was provided for stream processing. Spark Streaming can be fed with Kafka data, but it can be connected to other sources as well.
Structured Streaming, within the realm of Apache Spark, is a different approach that came to overcome certain limitations to stream processing of the previous approach that Spark Streaming was using. It was added to Spark from a certain version onwards(2.0 IIRC).

Do Spark Streaming and Spark Structured Streaming use same micro-batch engine?

Do Spark Streaming and Spark Structured Streaming use the same micro-batch scheduler engine? Does Spark Structured Streaming have lower latency than Spark Streaming?
Do Spark Streaming and Spark Structured Streaming use same micro-batch scheduler engine
Certainly not. They're different internally, but share the same high-level concepts of a stream and a record.
While in Spark Structured Streaming you can get as close to how it was in Spark Streaming using DataStreamWriter.foreach or DataStreamWriter.foreachBatch methods.
The main difference is how to describe a streaming pipeline. In Spark Structured Streaming you use Spark SQL's Dataset API while Spark Streaming bet on Spark Core's RDD API. Both end up as a RDD-based computation, but Spark SQL uses higher-level abstractions (e.g. Dataset API).
Do they both use a "micro-batch scheduler engine"? Yes, but Spark Structured Streaming is trying to leverage some data sources that can be queried continuously (and no micro-batching).
does Spark Structured Streaming have lower latency than Spark Streaming?
That'd be hard to answer. The creators of Spark Streaming decided to develop Spark Structured Streaming and hope to get better at query performance and expressiveness. Spark Streaming is no longer recommended.
Structered Streaming is mostly a higher-level abstraction that allows you to define your streaming logic then it uses Spark SQL engine for execution on the same micro-batch engine.
By default Structured Streaming uses micro-batch engine, however if you are using Spark 2.3+, then you can have the continuous mode where you can get down to 1 millisecond latency

Flink or Spark? when streaming is not important

I recently have been comparing Spark and Flink for a brand new project. In this project, streaming feature is not so important. Batch analysis of ~90TB data is most important. Later I will apply ML and data mining in data analysis.
When searching around, I find lots of articles, presentations and videos claim Flink is the next generation analysis solution. Don't see much articles defending Spark. On the other hand, Spark is(or was?) very popular and widely deployed in very large production system.
My question is: For my use case, i.e. streaming is not important, shall I embrace Flink or start with Spark 2?
BTW, I've read through this thread. It doesn't give me a good answer.
Update, April 2018: Eventually we choose Spark. Apparently there are more questions to address other than performance. Cloudera, Hortonworks and HDInsight give a good level of confidence/proof in security, stability, scale, roadmap etc. to enterprise architects and security reviewers.
As per your requirement, Apache Spark is best. Both Spark and Flink are advanced big data processing technologies. In terms of features, stability, ecosystem, community, integrations with other systems and adaptability Spark is much ahead of Flink.
The major difference between Spark and Flink is: Spark is a batch processing system and it has streaming abstraction whereas Flink is stream data processing system for processing unbounded datasets and it has batch processing abstraction to process bounded datasets in batch style.
Spark is best for ETL, Machine Learning, Streaming, data warehousing and Graph processing on large volumes of datasets. Flink is best for stream processing on large and unbounded datasets.
[Apache-Flink]
[Apache-Spark]

KStreams + Spark Streaming + Machine Learning

I'm doing a POC for running Machine Learning algorithm on stream of data.
My initial idea was to take data, use
Spark Streaming --> Aggregate Data from several tables --> run MLLib on Stream of Data --> Produce Output.
But I cam across KStreams. Now I'm confused !!!
Questions :
1. What is difference between Spark Streaming and Kafka Streaming ?
2. How can I marry KStreams + Spark Streaming + Machine Learning ?
3. My idea is to train the test data continuously rather than have batch training..
First of all, the term "Confluent's Kafka Streaming" is technically not correct.
it's called Kafka's Streams API (aka Kafka Streams)
it's part of Apache Kafka and thus "owned" by the Apache Software Foundation (and not by Confluent)
there is Confluent Open Source and Confluent Enterprise -- two offers from Confluent that both leverage Apache Kafka (and thus, Kafka Streams)
However, Confluent contributes a lot of code to Apache Kafka, including Kafka Streams.
About the differences (I only highlight some main differences and refer to the Internet and documentation for further details: http://docs.confluent.io/current/streams/index.html and http://spark.apache.org/streaming/)
Spark Streaming:
micro-batching (no real record-by-record stream processing)
no sub-second latency
limited window operations
no event-time processing
processing framework (difficult to operate and to deploy)
part of Apache Spark -- a data processing framework
exactly-once processing
Kafka Streams
record-by-record stream processing
ms latency
rich window operations
stream/table duality
event time, ingestion time, and processing time semantics
Java library (easy to run and deploy -- it's just a Java application as any other)
part of Apache Kafka -- a Stream Processing Platform (ie, it offers storage and processing at once)
at-least-once processing (exactly-once processing is WIP; cf KIP-98 and KIP-129)
elastic, ie, dynamically scalable
Thus there is no reasons to "marry" both -- it's a question of choice which one you want to use.
My personal take is, that Spark is not a good solution for stream processing. If you want to use a library like Kafka Streams or a framework like Apache Flink, Apache Storm, or Apache Apex (which are all good option for stream processing) depends on your use case (and maybe personal taste) and cannot be answered on SO.
A main differentiator of Kafka Streams is, that it is a library and does not require a processing cluster. And because it is part of Apache Kafka and if you have Apache Kafka already in place, this might simplify your overall deployment as you do not need to run an extra processing cluster.
I have recently presented at a conference about this topic.
Apache Kafka Streams or Spark Streaming are typically used to apply a machine learning model in real time to new events via stream processing (process data while it is in motion). Matthias answer already discusses their differences.
On the other side, you first use things like Apache Spark MLlib (or H2O.ai or XYZ) to build the analytic models first using historical data sets.
Kafka Streams can be used for online training of models, too. Though, I think online training has various caveats.
All of this is discussed in more details in my slide deck "Apache Kafka Streams and Machine Learning / Deep Learning for Real Time Stream Processing".
Apache Kafka Steams is library and provides embeddable stream processing engine and it is easy to use in Java applications for stream processing and it is not a framework.
I found some Use cases about when to use Kafka Streams and also good comparison with Apache flink from Kafka author.
Spark Streaming and KStreams in one pic from stream processing point of view.
Highlighted the significant advantages of Spark Streaming and KStreams here to make answer short.
Spark Streaming Advantages over KStreams:
Easy to integrate Spark ML models and Graph computing in same application without writing data outside of an application which means you will process the much quicker than writing kafka again and process.
Join non streaming sources like files system and other non kafka sources with other stream sources in same application.
Messages with Schema can be easily processed with most favorite SQL (StructuredStreaming).
Possible to do graph analysis over streaming data with GraphX inbuilt library.
Spark apps can be deployed over (if) existing YARN or Mesos cluster.
KStreams Advantages:
Compact library for ETL processing and ML model serving/training on messages with rich features. So far, both source and target should be Kafka topic only.
Easy to achieve exactly once semantics.
No separate processing cluster required.
Easy to deploy on docker since it's a plain java application to run.

Resources