Flink or Spark? when streaming is not important - apache-spark

I recently have been comparing Spark and Flink for a brand new project. In this project, streaming feature is not so important. Batch analysis of ~90TB data is most important. Later I will apply ML and data mining in data analysis.
When searching around, I find lots of articles, presentations and videos claim Flink is the next generation analysis solution. Don't see much articles defending Spark. On the other hand, Spark is(or was?) very popular and widely deployed in very large production system.
My question is: For my use case, i.e. streaming is not important, shall I embrace Flink or start with Spark 2?
BTW, I've read through this thread. It doesn't give me a good answer.
Update, April 2018: Eventually we choose Spark. Apparently there are more questions to address other than performance. Cloudera, Hortonworks and HDInsight give a good level of confidence/proof in security, stability, scale, roadmap etc. to enterprise architects and security reviewers.

As per your requirement, Apache Spark is best. Both Spark and Flink are advanced big data processing technologies. In terms of features, stability, ecosystem, community, integrations with other systems and adaptability Spark is much ahead of Flink.
The major difference between Spark and Flink is: Spark is a batch processing system and it has streaming abstraction whereas Flink is stream data processing system for processing unbounded datasets and it has batch processing abstraction to process bounded datasets in batch style.
Spark is best for ETL, Machine Learning, Streaming, data warehousing and Graph processing on large volumes of datasets. Flink is best for stream processing on large and unbounded datasets.
[Apache-Flink]
[Apache-Spark]

Related

Spark Streaming vs Structured Streaming

The last months I've been using quite a lot Structured Streaming for implementing Stream Jobs (after using Kafka a lot). After reading the book Stream Processing with Apache Spark i was having this question: Is there any point or use cases where i would use Spark Streaming instead of Structured Streaming? Should i invest some time getting into it or since im already using Spark Structured Streaming i should stick with it and there is no benefit on the previous API.
Would appreciate any opinion/insight
Hi Sharing my personal experience.
Structured streaming is the future for spark based streaming implementation. It provides higher level of abstraction and other great features. However there are few restrictions.
i have had to switch to spark streaming on few occasions due to the flexibility offered by it. One recent example is, we had to perform Joins with static reference data, however Outer joins are not supported in Structured streaming. This can be accomplished with Spark streaming.
With the newer spark version 2.4, Structured streaming is much improved with support for foreachBatch sink which gives similar flexibility offered by spark streaming.
My personal thought is having the knowledge of spark streaming is helpful and you might have to use it depending on your use case.

Apache Spark Structured Streaming vs Apache Flink: what is the difference?

We have discussed the questions below:
What is the difference between Apache Spark and Apache Flink? [closed]
What does “streaming” mean in Apache Spark and Apache Flink?
What is the difference between mini-batch vs real time streaming in practice (not theory)?
But Spark Structured Streaming was added at Spark2.2, it brings a lot of changes for streaming, and it is outstanding.
Can we say Spark Strutured Streaming is a streaming processing, or still batch processing?
Now what is the big difference between Apache Flink and Apache Spark Structured Streaming?
Currently:
Spark Structured Streaming has still microbatches used in background. However, it supports event-time processing, quite low latency (but not as low as Flink), supports SQL and type-safe queries on the streams in one API; no distinction, every Dataset can be queried both with SQL or with typesafe operators. It has end-to-end exactly-one semantics (at least they says it ;) ). The throughput is better than in Flink (there were some benchmarks with different results, but look at Databricks post about the results).
In near future:
Spark Continous Processing Mode is in progress and it will give Spark ~1ms latency, comparable to those from Flink. However, as I said, it's still in progress. The API is ready for non-batch jobs, so it's easier to do than in previous Spark Streaming.
The main difference:
Spark relies on micro-batching now and Flink is has pre-scheduled operators. That means, Flink's latency is lower, but Spark Community works on Continous Processing Mode, which will work similar (as far as I understand) to receivers.

KStreams + Spark Streaming + Machine Learning

I'm doing a POC for running Machine Learning algorithm on stream of data.
My initial idea was to take data, use
Spark Streaming --> Aggregate Data from several tables --> run MLLib on Stream of Data --> Produce Output.
But I cam across KStreams. Now I'm confused !!!
Questions :
1. What is difference between Spark Streaming and Kafka Streaming ?
2. How can I marry KStreams + Spark Streaming + Machine Learning ?
3. My idea is to train the test data continuously rather than have batch training..
First of all, the term "Confluent's Kafka Streaming" is technically not correct.
it's called Kafka's Streams API (aka Kafka Streams)
it's part of Apache Kafka and thus "owned" by the Apache Software Foundation (and not by Confluent)
there is Confluent Open Source and Confluent Enterprise -- two offers from Confluent that both leverage Apache Kafka (and thus, Kafka Streams)
However, Confluent contributes a lot of code to Apache Kafka, including Kafka Streams.
About the differences (I only highlight some main differences and refer to the Internet and documentation for further details: http://docs.confluent.io/current/streams/index.html and http://spark.apache.org/streaming/)
Spark Streaming:
micro-batching (no real record-by-record stream processing)
no sub-second latency
limited window operations
no event-time processing
processing framework (difficult to operate and to deploy)
part of Apache Spark -- a data processing framework
exactly-once processing
Kafka Streams
record-by-record stream processing
ms latency
rich window operations
stream/table duality
event time, ingestion time, and processing time semantics
Java library (easy to run and deploy -- it's just a Java application as any other)
part of Apache Kafka -- a Stream Processing Platform (ie, it offers storage and processing at once)
at-least-once processing (exactly-once processing is WIP; cf KIP-98 and KIP-129)
elastic, ie, dynamically scalable
Thus there is no reasons to "marry" both -- it's a question of choice which one you want to use.
My personal take is, that Spark is not a good solution for stream processing. If you want to use a library like Kafka Streams or a framework like Apache Flink, Apache Storm, or Apache Apex (which are all good option for stream processing) depends on your use case (and maybe personal taste) and cannot be answered on SO.
A main differentiator of Kafka Streams is, that it is a library and does not require a processing cluster. And because it is part of Apache Kafka and if you have Apache Kafka already in place, this might simplify your overall deployment as you do not need to run an extra processing cluster.
I have recently presented at a conference about this topic.
Apache Kafka Streams or Spark Streaming are typically used to apply a machine learning model in real time to new events via stream processing (process data while it is in motion). Matthias answer already discusses their differences.
On the other side, you first use things like Apache Spark MLlib (or H2O.ai or XYZ) to build the analytic models first using historical data sets.
Kafka Streams can be used for online training of models, too. Though, I think online training has various caveats.
All of this is discussed in more details in my slide deck "Apache Kafka Streams and Machine Learning / Deep Learning for Real Time Stream Processing".
Apache Kafka Steams is library and provides embeddable stream processing engine and it is easy to use in Java applications for stream processing and it is not a framework.
I found some Use cases about when to use Kafka Streams and also good comparison with Apache flink from Kafka author.
Spark Streaming and KStreams in one pic from stream processing point of view.
Highlighted the significant advantages of Spark Streaming and KStreams here to make answer short.
Spark Streaming Advantages over KStreams:
Easy to integrate Spark ML models and Graph computing in same application without writing data outside of an application which means you will process the much quicker than writing kafka again and process.
Join non streaming sources like files system and other non kafka sources with other stream sources in same application.
Messages with Schema can be easily processed with most favorite SQL (StructuredStreaming).
Possible to do graph analysis over streaming data with GraphX inbuilt library.
Spark apps can be deployed over (if) existing YARN or Mesos cluster.
KStreams Advantages:
Compact library for ETL processing and ML model serving/training on messages with rich features. So far, both source and target should be Kafka topic only.
Easy to achieve exactly once semantics.
No separate processing cluster required.
Easy to deploy on docker since it's a plain java application to run.

Apache Spark vs Apache Ignite [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Currently I'm studying Apache spark and Apache ignite frameworks.
Some principle differences between them are described in this article ignite vs spark But I realized that I still don't understand their purposes.
I mean for which problems spark more preferable than ignite and vice versa?
I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations.
Another common use for Ignite is distributed caching, which is often used to improve performance of applications that interact with relational databases or any other data sources.
Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time.Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability.
Apache Spark(cluster computing framework) is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited for high-performance computing and machine learning algorithms.
Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
Conclusion:
Ignite and Spark are both in-memory computing solutions but they target different use cases.
In many cases, they are used together to achieve superior results:
Ignite can provide shared storage, so the state can be passed from one Spark application or job to another.
Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x (spark doesn’t index the data)
When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
Does Spark and Ignite works together?
Yes, Spark and Ignite works together.
In short
Ignite vs. Spark
Ignite is an in-memory distributed database more focused on data storage and handles transnational updates on data, then serves client requests. Apache Spark is an MPP compute engine which is more inclined towards analytics, ML, Graph, and ETL specific payloads.
In detail
Apache Spark is an OLAP tool
Apache Spark is a general-purpose cluster computing system. It's an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark with other components
Deployment topology
Spark on YARN typology is discussed here.
Apache Ignite is an OLTP tool
Ignite is a memory-centric distributed database, caching, and processing platform for transnational, analytical, and streaming workloads delivering in-memory speeds at the petabyte scale. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging, and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources.
SQL Overview
Deployment topology
Apache Spark is a processing framework. You tell it where to get data, provide some code about how to process that data, and then tell it where to put the results. It's a way to easily reliably run computing logic across a bunch of nodes in a cluster on data from any source (which is then kept in-memory during processing). It's primarily meant for large-scale analysis on data from various sources (even from multiple databases at once), or from streaming sources like Kafka. It can also be used for ETL, like transforming and joining data together before putting the final results in some other database system.
Apache Ignite is more of an in-memory distributed database, at least that's how it started. It has a key/value and SQL API, so you can store and read data in various ways, and run queries like you would any other SQL database. It also supports running your own code (similar to Spark) so you can do processing that wouldn't really work with SQL, while also reading and writing the data all in the same system. It also can read/write data to other database systems while acting as a cache layer in the middle. Eventually, as of 2018, it also supports on-disk storage so now you can use it as an all-in-one distributed database, cache, and processing framework.
Apache Spark is still better for more complex analytics, and you can have Spark read data from Apache Ignite, but for many scenarios it's now possible to consolidate processing and storage into a single system with Apache Ignite.
Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
By Nikita Ivanov: http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/
Although both Apache Spark and Apache Ignite utilize the power of in-memory computing, they address somewhat different use cases and rarely “compete” for the same task. Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
I am late to answer this question, but let me try to share my view on this.
Ignite may not be ready to use in production for enterprise application as some important features such as Security is only available in Gridgain(wrapper over Ignite)
Complete list of features can be found from below link
https://www.gridgain.com/products/gridgain-vs-ignite

Google Dataflow vs Apache Spark

I am surveying Google Dataflow and Apache Spark to decide which one is more suitable solution for our bigdata analysis business needs.
I found there are Spark SQL and MLlib in the spark platform to do structured data query and machine learning.
I wonder is there any corresponding solution in the Google Dataflow platform?
It would help if you could expand a bit on your specific use case(s). What are you trying to accomplish in relation to "Bigdata analysis"? The short answer... it depends :-)
Here are some key architectural points to consider in relation to Google Cloud Dataflow v. Spark and Hadoop MR.
Resource Mgmt: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course you can build up and tear down these clusters, but the Dataflow model is geared towards hands free dev ops in relation to resource management. If want to optimize resource usage to job demands Dataflow is a solid model to control cost and nearly forget about resource tuning. If you prefer a multi-tenant style cluster I'd suggest you look at Google Cloud Dataproc as it provides the on demand cluster management aspects like Dataflow, but focused on class Hadoop workloads like MR, Spark, Pig, ...
Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Meaning once you submit a job the work resources are bound to the graph that was submitted AND the majority of the data is loaded into resources as needed. Spark can be a better model if you want to load data into the cluster via in memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps. Now if most of your queries can be expressed in SQL syntax you may want to look at BigQuery. BigQuery provides the "on demand" aspects of Dataflow and enables you to interactively execute queries over massive amounts of data e.g petabytes. The biggest advantage in my opinion of BigQuery is that you do not have think/worry about hardware allocation to deal with your data sizes. Meaning as your data sizes grow you don't have to think about hardware (memory and disk size) configuration.
Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives. Things to consider: 1) Dataflow's primary programming language is Java. There is a Python SDK in the works. The Dataflow Java SDK in open sourced and has been ported to Scala. Today, Spark has more SDK surface choice with GraphX, Streaming, Spark SQL, and ML. 2) Dataflow is a unified programming model for batch and streaming based DAG development. The goal was to remove the complexity and cost switching when moving between batch and streaming models. The same graph can seamlessly run in either mode. 3) Today, Cloud Dataflow does not support converging/iterative based graph execution. If you need the power of something like MLib then Spark is the way to go. Keep in mind this is the state of things today.
Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides Triggers (applied to Windows) that enable you to control how you want to handle late arriving data. Net-net you dial in the level of correctness control to meet the needs of your analysis. For example, lets say you have a mobile game that interacts with a 100 edge nodes. These nodes create 10000's events second related to game play. Let's say a group of nodes can't communicate with your back end streaming analysis system. In the case of Dataflow - once that data does arrive - you can control how you'd like to handle the data in relation to your query correctness needs. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight. For example, let's say you discover a logical bug in a transform. You can upgrade your in flight job without losing your existing Windowed state. Net-net you can keep you business running.
Net-net:
- if you are really primarily doing ETL style work (filtering, shaping, joining, ...) or batch style MapReduce Dataflow is a great path if you want minimal devOps.
- if you need to implement ML style graphs, go the Spark path and give Dataproc a try
- if you are doing ML and you first need to do ETL to clean up your training data implement a hybrid with Dataflow and Dataproc
- if you need interactivity Spark is a solid choice, but so is BigQuery if you are/can express your queries in SQL
- if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice.
So... what are you scenarios?
I've tried both :
Dataflow is still very young, the is no "out-of-the-box" solution for doing ML with it (even though you could implement algorithms in transforms), you could output the processes data to cloud storage and read it later with another tool.
Spark would be recommended but you would have to manage your cluster yourself.
However there is a good alternative: Google Dataproc
You can develop analysis tools with spark and deploy them with one command on your cluster, dataproc will manage the cluster itself without having to tweak the configuration.
I have built code using spark,DataFlow .Let me put my thoughts.
Spark/DataProc: I have used spark (Pyspark) a lot for ETL. You can use SQL and any programming language of your choice. Lot of functions are available (Including Window functions). Build your dataframe and write your transformation and it can be super fast. Once data is cached , any operation on the Dataframe will quick.
You can simply build hive external table on the GCS. Then you can use Spark for ETL and Load data into Big Query. This is for Batch processing.
For streaming you can use spark Streaming and load data into Big query.
Now if you have cluster allready then you have think whether to move to Google cloud or not. I found Data proc (Google Cloud Hadoop/Spark) offering is better as you don't have to worry many cluster managements..
DataFlow : It's know as apache beam. Here you can write your code in Java/Python or any other language. You can execute the code in any framework (Spark/MR/Flink).This is a unified model. Here you can do both batch processing and Stream Data processing.
google now offers both programming models- mapreduce and spark.
Cloud DataFlow and Cloud DataProc they are respectively

Resources