My application crawls user's mailbox and saves it to an RDBMS database. I started using Redis as a cache (simple key-value store) for RDBMS database. But gradually I started storing crawler states and other data in Redis that needs to be persistent. Loosing this data means a few hours of downtime. I must ensure airtight consistency for this data. The data should not be lost in node failures or split brain scenarios. Strong consistency is a must. Sharding is done by my application. One Redis process runs on each of ten EC2 m4.large instances. On each of these instances. I am doing up to 20K IOPS to Redis. I am doing more writes than reads, though I have not determined the actual percentage of both. All my data is completely in memory, not backed by disk.
My only problem is each of these instances are SPOF. I cannot use Redis cluster as it does not guarantee consistency. I have evaluated a few more tools like Aerospike, none gives 'No data loss guarantee'.
Cassandra looks promising as I can tune the consistency level I want. I plan to use Cassandra with a replication factor 2, and a write must be written to both the replicas before considered committed. This gives 'No data loss guarantee.
By launching enough cassandra nodes (ssd backed) can I replace my Redis key-value store and still get similar read/write IOPS and
latency? Will opensource cassandra suffice my use case? If not, will the Datastax enterprise In-Memory version solve it?
EDIT 1:
A bit of clarification:
I think I need to use Write consistency level 'ALL' and Read consistency level 'One'. I understand that with this consistency level my cluster will not tolerate any failure. That is OK for me. A few minutes of downtime occasionally is not a problem, as long as my data is consistent. In my present setup, one Redis instance failure causes a few hours of downtime.
I must ensure airtight consistency for this data.
Cassandra deals with failure better when there are more nodes. Assuming your case allows for having more nodes, this is my suggestion.
So, if you have 5 nodes, use CL of QUORUM for both READ and WRITE. What it means is that you always write to at least 3 nodes and read from 3 nodes.(for 5 nodes , QUORUM is 3).
This ensures a very high level consistency
Also ensures limited downtime. Even if a node is down your writes and reads won't break.
If you use CL ALL, then even if one node is down or overloaded, you will have to take a full downtime.
I hope it helps!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Currently I'm studying Apache spark and Apache ignite frameworks.
Some principle differences between them are described in this article ignite vs spark But I realized that I still don't understand their purposes.
I mean for which problems spark more preferable than ignite and vice versa?
I would say that Spark is a good product for interactive analytics, while Ignite is better for real-time analytics and high performance transactional processing. Ignite achieves this by providing efficient and scalable in-memory key-value storage, as well as rich capabilities for indexing, querying the data and running computations.
Another common use for Ignite is distributed caching, which is often used to improve performance of applications that interact with relational databases or any other data sources.
Apache Ignite is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time.Ignite is a data-source-agnostic platform and can distribute and cache data across multiple servers in RAM to deliver unprecedented processing speed and massive application scalability.
Apache Spark(cluster computing framework) is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.
By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited for high-performance computing and machine learning algorithms.
Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
Conclusion:
Ignite and Spark are both in-memory computing solutions but they target different use cases.
In many cases, they are used together to achieve superior results:
Ignite can provide shared storage, so the state can be passed from one Spark application or job to another.
Ignite can provide SQL with indexing so Spark SQL can be accelerated over 1,000x (spark doesn’t index the data)
When working with files instead of RDDs, the Apache Ignite In-Memory File System (IGFS) can also share state between Spark jobs and applications
Does Spark and Ignite works together?
Yes, Spark and Ignite works together.
In short
Ignite vs. Spark
Ignite is an in-memory distributed database more focused on data storage and handles transnational updates on data, then serves client requests. Apache Spark is an MPP compute engine which is more inclined towards analytics, ML, Graph, and ETL specific payloads.
In detail
Apache Spark is an OLAP tool
Apache Spark is a general-purpose cluster computing system. It's an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark with other components
Deployment topology
Spark on YARN typology is discussed here.
Apache Ignite is an OLTP tool
Ignite is a memory-centric distributed database, caching, and processing platform for transnational, analytical, and streaming workloads delivering in-memory speeds at the petabyte scale. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging, and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources.
SQL Overview
Deployment topology
Apache Spark is a processing framework. You tell it where to get data, provide some code about how to process that data, and then tell it where to put the results. It's a way to easily reliably run computing logic across a bunch of nodes in a cluster on data from any source (which is then kept in-memory during processing). It's primarily meant for large-scale analysis on data from various sources (even from multiple databases at once), or from streaming sources like Kafka. It can also be used for ETL, like transforming and joining data together before putting the final results in some other database system.
Apache Ignite is more of an in-memory distributed database, at least that's how it started. It has a key/value and SQL API, so you can store and read data in various ways, and run queries like you would any other SQL database. It also supports running your own code (similar to Spark) so you can do processing that wouldn't really work with SQL, while also reading and writing the data all in the same system. It also can read/write data to other database systems while acting as a cache layer in the middle. Eventually, as of 2018, it also supports on-disk storage so now you can use it as an all-in-one distributed database, cache, and processing framework.
Apache Spark is still better for more complex analytics, and you can have Spark read data from Apache Ignite, but for many scenarios it's now possible to consolidate processing and storage into a single system with Apache Ignite.
Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
By Nikita Ivanov: http://www.odbms.org/blog/2017/06/on-apache-ignite-apache-spark-and-mysql-interview-with-nikita-ivanov/
Although both Apache Spark and Apache Ignite utilize the power of in-memory computing, they address somewhat different use cases and rarely “compete” for the same task. Some conceptual differences:
Spark doesn’t store data, it loads data for processing from other storages, usually disk-based, and then discards the data when the processing is finished. Ignite, on the other hand, provides a distributed in-memory key-value store (distributed cache or data grid) with ACID transactions and SQL querying capabilities.
Spark is for non-transactional, read-only data (RDDs don’t support in-place mutation), while Ignite supports both non-transactional (OLAP) payloads as well as fully ACID compliant transactions (OLTP)
Ignite fully supports pure computational payloads (HPC/MPP) that can be “dataless”. Spark is based on RDDs and works only on data-driven payloads.
I am late to answer this question, but let me try to share my view on this.
Ignite may not be ready to use in production for enterprise application as some important features such as Security is only available in Gridgain(wrapper over Ignite)
Complete list of features can be found from below link
https://www.gridgain.com/products/gridgain-vs-ignite
How is Cassandra's eventual consistency model different from HBase? It seems Facebook moved from Cassandra to HBase because consistency issues. Which of these NoSQL DBs are ideal for scale and performance with consistency as near as possible to 'immediate'. What is the factor by which performance degrades when we try to improve upon consistency?
Here's Facebook's original post on why they chose HBase for Messenger. At the time they decided HBase was "ideal for scale and performance with consistency as near as possible to 'immediate'", however they reached its limits and later developed a new service called Iris that handles the most recent week of messages, while storing the older messages in HBase.
Cassandra's consistency model provides a lot of flexibility. The biggest difference is that Cassandra is a shared-nothing architecture: each server is designed to be able to function independently, thus high availability and partition tolerance at the cost of consistency.
With HBase however there is a single source of truth, at the (apparent) cost of availability and partition tolerance. The read process, from the client's perspective, involves finding the location of that data and reading it from that server. Any updates to that data are atomic.
Here's one HBase vs Cassandra benchmark that shows HBase outperforming Cassandra on nearly every test on (mostly) default settings, and here's another benchmark that shows Cassandra outperforming HBase on certain tests. I think the conclusion here is that the answer to your question is highly dependent on your use case.
Here's a good article that sums up the plusses and minuses of each, and can help you decide which one is best for your needs.
Distributed cache is a method that store common requests and enabling quick retrieval.
Tachyon is a memory-centric distributed storage file system that avoids going to disk to load datasets that are frequently read.
What is the different between these two?
The main difference is in programming paradigm, note that by your definition Tachyon is almost certainly a distributed cache.
Most distributed caches are typically some form of key value store, while higher level data structures can be built atop this the core paradigm tends to be key value.
Tachyon is designed to function as a software file system that is compatible with the HDFS interface prevalent in the big data analytics space. The point of doing this is that it can be used as a drop in accelerator rather than having to adapt each framework to use a distributed caching layer explicitly.
Note that both Apache Ignite and Apache Geode (Incubating) are related projects that offer both key-value and file system style APIs making them arguably more flexible.
Tachyon (known as Alluxio now) is located between the computation layer (Apache Spark, Apache Flink, Apache MapReduce) and the storage layer (HDFS, Amazon S3, OpenStack Swift, ...).
It is basically an in-memory file system used to abstract the user from the storage systems underneath (one or multiple).
For the computations frameworks or jobs above it, Tachyon is the data storage where the data to be computed is kept.
It can't carry out distributed computing advanced features and doesn't provide SQL queries support natively like some of the distributed caches do (Apache Ignite or Hazelcast).
I have checked the main features of Cassandra and Infinispan. They seem to have and deliver pretty similar characteristics and functionalities:
NoSQL data store
persistance
decentralized
support replication
scalability
fault tolerant
MapReduce support
Queries
One difference I have found out is that Infinispan does not provide tunable consistency (every node has the same data).
When learning about the Infinispan I came across Cassandra Cache Store (http://infinispan.org/docs/cachestores/cassandra/). It provides persistance of data.
But then why I would still want to use Infinispan and not Cassandra directly?
Do these solutions complement each other or they are more competing on the same level?
Infinispan is mainly used as a distributed cache, like memcached/hazelcast and so on.
Natively data are written in memory but you can persist them into what they call "cache stores" -- there are many cache-stores ready (for File/Cassandra/Hbase/Mongo) or you can make your own implementation.
One difference I have found out is that Infinispan does not provide
tunable consistency (every node has the same data).
Tunable consistency and data distribution are two different things. It's not true that "every node has the same data", it depends on how you choose to cluster data. Infinispan, like others, offers both replication (all nodes stores same cache) and distribution (each node will be responsible for a range of tokens). Tunable consistency in Cassandra means that you can choose how many nodes should be informed about your r/w operation before returning the control to the client.
You might need to use Infinispan and not Cassandra directly for many reasons. If for instance you have huge amount of memory in your application servers and you want keep a bigger/different cache than what you can store inside your Cassandra nodes. Other feature you might need is to plug the infinispan-query module in order to perform full-text searches without installing a solr/elasticsearch/whatever cluster or use the transactional capability within is.
IMHO these two products does not compare directly, they're born for different use cases and offers different features. You can use any, one or both, depend on what's your application architecture and needs.
HTH,
Carlo