Apache Spark vs Apache Ignite [closed] - apache-spark

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Currently I'm studying Apache spark and Apache ignite frameworks.
Some principle differences between them are described in this article ignite vs spark But I realized that I still don't understand their purposes.
I mean for which problems spark more preferable than ignite and vice versa?

I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations.
Another common use for Ignite is distributed caching, which is often used to improve performance of applications that interact with relational databases or any other data sources.

Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time.Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability.
Apache Spark(cluster computing framework) is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited for high-performance computing and machine learning algorithms.
Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
Conclusion:
Ignite and Spark are both in-memory computing solutions but they target different use cases.
In many cases, they are used together to achieve superior results:
Ignite can provide shared storage, so the state can be passed from one Spark application or job to another.
Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x (spark doesn’t index the data)
When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications

Does Spark and Ignite works together?
Yes, Spark and Ignite works together.
In short
Ignite vs. Spark
Ignite is an in-memory distributed database more focused on data storage and handles transnational updates on data, then serves client requests. Apache Spark is an MPP compute engine which is more inclined towards analytics, ML, Graph, and ETL specific payloads.
In detail
Apache Spark is an OLAP tool
Apache Spark is a general-purpose cluster computing system. It's an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark with other components
Deployment topology
Spark on YARN typology is discussed here.
Apache Ignite is an OLTP tool
Ignite is a memory-centric distributed database, caching, and processing platform for transnational, analytical, and streaming workloads delivering in-memory speeds at the petabyte scale. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging, and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources.
SQL Overview
Deployment topology

Apache Spark is a processing framework. You tell it where to get data, provide some code about how to process that data, and then tell it where to put the results. It's a way to easily reliably run computing logic across a bunch of nodes in a cluster on data from any source (which is then kept in-memory during processing). It's primarily meant for large-scale analysis on data from various sources (even from multiple databases at once), or from streaming sources like Kafka. It can also be used for ETL, like transforming and joining data together before putting the final results in some other database system.
Apache Ignite is more of an in-memory distributed database, at least that's how it started. It has a key/value and SQL API, so you can store and read data in various ways, and run queries like you would any other SQL database. It also supports running your own code (similar to Spark) so you can do processing that wouldn't really work with SQL, while also reading and writing the data all in the same system. It also can read/write data to other database systems while acting as a cache layer in the middle. Eventually, as of 2018, it also supports on-disk storage so now you can use it as an all-in-one distributed database, cache, and processing framework.
Apache Spark is still better for more complex analytics, and you can have Spark read data from Apache Ignite, but for many scenarios it's now possible to consolidate processing and storage into a single system with Apache Ignite.

Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
By Nikita Ivanov: http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/

Although both Apache Spark and Apache Ignite utilize the power of in-memory computing, they address somewhat different use cases and rarely “compete” for the same task. Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.

I am late to answer this question, but let me try to share my view on this.
Ignite may not be ready to use in production for enterprise application as some important features such as Security is only available in Gridgain(wrapper over Ignite)
Complete list of features can be found from below link
https://www.gridgain.com/products/gridgain-vs-ignite

Related

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Can we use Apache Spark to store Data? or is it only a Data processing tool?

I am new to Apache Spark, I would like to know is it possible to store data using Apache Spark. Or is it only a processing tool?
Thanks for spending your time,
Satya
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
In real life use-case you usually have database, or data repository frome where you access data from spark.
Spark can access data that's in:
SQL Databases (Anything that can be connected using JDBC driver)
Local files
Cloud storage (eg. Amazon S3)
NoSQL databases.
Hadoop File System (HDFS)
and many more...
Detailed description can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Apache Spark is primarily processing engine. It works with underlying file systems such as HDFS, s3 and other supported file systems. It has capabilities to read the data from relational databases as well. But primarily it is in memory distributed processing tool.
As you can read in Wikipedia, Apache Spark is defined as:
is an open source cluster computing framework
When we refer about computing, it's related to a processing tool, in essence it allows to work as a pipeline scheme (or somehow ETL), you read the dataset, you process the data, and then you store the data processed, or models that describe the data.
If your main objective is to distribute your data, there are some good alternatives like HDFS (Hadoop File System), and others.

HBase or Cassandra?

In my lambda architecture, i am debating on whether to use HDFS or Cassandra to store my immutable data. I need Cassandra to serve the online requests etc. so it is the mandatory part of the tech stack. Now, I do not want to introduce new tool (HDFS) into the stack if I don't have to. So my question is, what will I be missing if I don't use HDFS and use Cassandra to host my immutable data as well.
EDIT:
I understand HDFS is a distributed filesystem and Cassandra is NoSQL DB. Still, both support data replication, both support high-throughput writes. In addition Cassandra supports low latent data retrieval. So am I right saying that HDFS isn't going to provide me much lift?
As I understand You are trying to clarify your Serving Layer of your Lambda Architecture.
If it is true, you want to store your batch views and real-time views into a Database.
And as I understand you do not have Hadoop cluster in your batch layer.
And your batch views have not been completed in HDFS.
At this point your architecture is outside of HDFS.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
If you dont want a hadoop cluster, omit HBase.
Cassandra is distributed NoSQL Database(column-oriented) and it works outside the Hadoop cluster and HDFS.
If I understand your architecture and your needs right, I think Cassandra is best for you.
Additionally, you can get quick info about Lambda architecture from this link;
http://artofbigdata.blogspot.com.tr/2016/01/lambda-architecture.html
HDFS supports different file formats to store. For example, sequence files, Avro and Parquet etc..so that you can choose a file format suitable to your application needs.
Also note that you can efficiently read the data using SQL-like queries.
So different data models are available in HDFS over Cassandra to host the data.

Google Dataflow vs Apache Spark

I am surveying Google Dataflow and Apache Spark to decide which one is more suitable solution for our bigdata analysis business needs.
I found there are Spark SQL and MLlib in the spark platform to do structured data query and machine learning.
I wonder is there any corresponding solution in the Google Dataflow platform?
It would help if you could expand a bit on your specific use case(s). What are you trying to accomplish in relation to "Bigdata analysis"? The short answer... it depends :-)
Here are some key architectural points to consider in relation to Google Cloud Dataflow v. Spark and Hadoop MR.
Resource Mgmt: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course you can build up and tear down these clusters, but the Dataflow model is geared towards hands free dev ops in relation to resource management. If want to optimize resource usage to job demands Dataflow is a solid model to control cost and nearly forget about resource tuning. If you prefer a multi-tenant style cluster I'd suggest you look at Google Cloud Dataproc as it provides the on demand cluster management aspects like Dataflow, but focused on class Hadoop workloads like MR, Spark, Pig, ...
Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Meaning once you submit a job the work resources are bound to the graph that was submitted AND the majority of the data is loaded into resources as needed. Spark can be a better model if you want to load data into the cluster via in memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps. Now if most of your queries can be expressed in SQL syntax you may want to look at BigQuery. BigQuery provides the "on demand" aspects of Dataflow and enables you to interactively execute queries over massive amounts of data e.g petabytes. The biggest advantage in my opinion of BigQuery is that you do not have think/worry about hardware allocation to deal with your data sizes. Meaning as your data sizes grow you don't have to think about hardware (memory and disk size) configuration.
Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives. Things to consider: 1) Dataflow's primary programming language is Java. There is a Python SDK in the works. The Dataflow Java SDK in open sourced and has been ported to Scala. Today, Spark has more SDK surface choice with GraphX, Streaming, Spark SQL, and ML. 2) Dataflow is a unified programming model for batch and streaming based DAG development. The goal was to remove the complexity and cost switching when moving between batch and streaming models. The same graph can seamlessly run in either mode. 3) Today, Cloud Dataflow does not support converging/iterative based graph execution. If you need the power of something like MLib then Spark is the way to go. Keep in mind this is the state of things today.
Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides Triggers (applied to Windows) that enable you to control how you want to handle late arriving data. Net-net you dial in the level of correctness control to meet the needs of your analysis. For example, lets say you have a mobile game that interacts with a 100 edge nodes. These nodes create 10000's events second related to game play. Let's say a group of nodes can't communicate with your back end streaming analysis system. In the case of Dataflow - once that data does arrive - you can control how you'd like to handle the data in relation to your query correctness needs. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight. For example, let's say you discover a logical bug in a transform. You can upgrade your in flight job without losing your existing Windowed state. Net-net you can keep you business running.
Net-net:
- if you are really primarily doing ETL style work (filtering, shaping, joining, ...) or batch style MapReduce Dataflow is a great path if you want minimal devOps.
- if you need to implement ML style graphs, go the Spark path and give Dataproc a try
- if you are doing ML and you first need to do ETL to clean up your training data implement a hybrid with Dataflow and Dataproc
- if you need interactivity Spark is a solid choice, but so is BigQuery if you are/can express your queries in SQL
- if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice.
So... what are you scenarios?
I've tried both :
Dataflow is still very young, the is no "out-of-the-box" solution for doing ML with it (even though you could implement algorithms in transforms), you could output the processes data to cloud storage and read it later with another tool.
Spark would be recommended but you would have to manage your cluster yourself.
However there is a good alternative: Google Dataproc
You can develop analysis tools with spark and deploy them with one command on your cluster, dataproc will manage the cluster itself without having to tweak the configuration.
I have built code using spark,DataFlow .Let me put my thoughts.
Spark/DataProc: I have used spark (Pyspark) a lot for ETL. You can use SQL and any programming language of your choice. Lot of functions are available (Including Window functions). Build your dataframe and write your transformation and it can be super fast. Once data is cached , any operation on the Dataframe will quick.
You can simply build hive external table on the GCS. Then you can use Spark for ETL and Load data into Big Query. This is for Batch processing.
For streaming you can use spark Streaming and load data into Big query.
Now if you have cluster allready then you have think whether to move to Google cloud or not. I found Data proc (Google Cloud Hadoop/Spark) offering is better as you don't have to worry many cluster managements..
DataFlow : It's know as apache beam. Here you can write your code in Java/Python or any other language. You can execute the code in any framework (Spark/MR/Flink).This is a unified model. Here you can do both batch processing and Stream Data processing.
google now offers both programming models- mapreduce and spark.
Cloud DataFlow and Cloud DataProc they are respectively

What is difference between distributed cache and Tachyon?

Distributed cache is a method that store common requests and enabling quick retrieval.
Tachyon is a memory-centric distributed storage file system that avoids going to disk to load datasets that are frequently read.
What is the different between these two?
The main difference is in programming paradigm, note that by your definition Tachyon is almost certainly a distributed cache.
Most distributed caches are typically some form of key value store, while higher level data structures can be built atop this the core paradigm tends to be key value.
Tachyon is designed to function as a software file system that is compatible with the HDFS interface prevalent in the big data analytics space. The point of doing this is that it can be used as a drop in accelerator rather than having to adapt each framework to use a distributed caching layer explicitly.
Note that both Apache Ignite and Apache Geode (Incubating) are related projects that offer both key-value and file system style APIs making them arguably more flexible.
Tachyon (known as Alluxio now) is located between the computation layer (Apache Spark, Apache Flink, Apache MapReduce) and the storage layer (HDFS, Amazon S3, OpenStack Swift, ...).
It is basically an in-memory file system used to abstract the user from the storage systems underneath (one or multiple).
For the computations frameworks or jobs above it, Tachyon is the data storage where the data to be computed is kept.
It can't carry out distributed computing advanced features and doesn't provide SQL queries support natively like some of the distributed caches do (Apache Ignite or Hazelcast).

Resources