Maximum parallel queries in Cassandra - cassandra

I'm new in Cassandra DB, and I have a very trivial question: how much parallel queries can O do without compromising perfomance? The queries are going to be like
Select data from table where id='asdasdasd';
Its a server in a datacenter, it should work properly with 3000 read querys? Sorry for the poor information but its all i have.

It all depends on the server's capacity where you have installed your cluster of Cassandra, and how you have configured the nodes.
There is a configuration parameter in cassandra.yaml that is concurrent_reads
Tune it to get a better read rate.

Related

Cassandra : how to properly implement "global" back-pressure with multiples applications?

As you know, with Cassandra, when nodes are overloaded, it may seriously hurt your production depending on required consistency, because nodes might become unresponsive, the entire daemon might also crash, hints might fill-up your data mount point, and so on.
So the keyword here is back-pressure.
To do appropriate back-pressure with Spark on Cassandra, there are especially the following properties :
--conf "spark.cassandra.output.throughputMBPerSec=2"
--total-executor-cores 24
(There are also similar back-pressure options with Datastax driver, or cqlsh. You basically limit the throughput per core, to apply some back-pressure)
Let say, I found my global write throuput on my Cassandra cluster, and I set appropriate settings for my application1 that works fine.
BUT still, the challenge, is that there are many developers on a Cassandra cluster. So at a given time, I may have Spark application1, application2, application3, ... that runs concurrently.
Question : What are my options to ensure that the write troughput (no matter how many applications runs concurrently) at a given time is globally NOT going to reach too much pressure for Cassandra, thus hurting my production workload ?
Thank you
What I recommend folks do to separate analytical workloads, is to spin-up another (logical) data center. Sure, it could be in the same physical data center. But what you want is separate compute and storage to keep the analytics load from interfering with the production traffic.
First, make sure that you're running with the GossipingPropertyFileSnitch (cassandra.yaml) and that your keyspaces are using the NetworkTopologyStrategy. Likewise, you'll want to make sure that your keyspace definition contains a named data center and that your production application/services are configured to use that data center (ex: dc1 as below) as their default DC:
ALTER KEYSPACE product_data WITH
REPLICATION={'class':'NetworkTopologyStrategy',
'dc1':'3'};
Once the new infra is up, install Cassandra and join the nodes to the cluster as a new DC by specifying the new name in the cassandra-rackdc.properties file. Something like:
dc=dc1_analytics
Next adjust your keyspace(s) to replicate data to the new DC.
ALTER KEYSPACE product_data WITH
REPLICATION={'class':'NetworkTopologyStrategy',
'dc1':'3','dc1_analyitcs':'3'};
Run a repair/rebuild on the new DC, and then configure the Spark jobs to only use dc1_analytics.

Cassandra(with Hadoop) performance with Spark

We are running Spark/Hadoop on a different set of nodes than Cassandra. We have 10 Cassandra nodes and multiple spark cores but Cassandra is not running on Hadoop. Performance in fetching data from Cassandra through spark(in yarn client mode) is not very good and bulk data reads from HDFS are faster(6 mins in Cassandra to 2 mins in HDFS). Changing Spark-Cassandra parameters is not helping much also.
Will deploying Hadoop on top of Cassandra solve this issue and majorly impact read performance ?
Without looking at your code, bulk reads in an analytics/Spark capacity, are always going to be faster when directly going to the file VS. reading from a database. The database offers other advantages such as schema enforcement, availability, distribution control, etc but I think the performance differences you're seeing are normal.

Can a Cassandra cluster serve as a replacement for an in-memory Redis key-value store?

My application crawls user's mailbox and saves it to an RDBMS database. I started using Redis as a cache (simple key-value store) for RDBMS database. But gradually I started storing crawler states and other data in Redis that needs to be persistent. Loosing this data means a few hours of downtime. I must ensure airtight consistency for this data. The data should not be lost in node failures or split brain scenarios. Strong consistency is a must. Sharding is done by my application. One Redis process runs on each of ten EC2 m4.large instances. On each of these instances. I am doing up to 20K IOPS to Redis. I am doing more writes than reads, though I have not determined the actual percentage of both. All my data is completely in memory, not backed by disk.
My only problem is each of these instances are SPOF. I cannot use Redis cluster as it does not guarantee consistency. I have evaluated a few more tools like Aerospike, none gives 'No data loss guarantee'.
Cassandra looks promising as I can tune the consistency level I want. I plan to use Cassandra with a replication factor 2, and a write must be written to both the replicas before considered committed. This gives 'No data loss guarantee.
By launching enough cassandra nodes (ssd backed) can I replace my Redis key-value store and still get similar read/write IOPS and
latency? Will opensource cassandra suffice my use case? If not, will the Datastax enterprise In-Memory version solve it?
EDIT 1:
A bit of clarification:
I think I need to use Write consistency level 'ALL' and Read consistency level 'One'. I understand that with this consistency level my cluster will not tolerate any failure. That is OK for me. A few minutes of downtime occasionally is not a problem, as long as my data is consistent. In my present setup, one Redis instance failure causes a few hours of downtime.
I must ensure airtight consistency for this data.
Cassandra deals with failure better when there are more nodes. Assuming your case allows for having more nodes, this is my suggestion.
So, if you have 5 nodes, use CL of QUORUM for both READ and WRITE. What it means is that you always write to at least 3 nodes and read from 3 nodes.(for 5 nodes , QUORUM is 3).
This ensures a very high level consistency
Also ensures limited downtime. Even if a node is down your writes and reads won't break.
If you use CL ALL, then even if one node is down or overloaded, you will have to take a full downtime.
I hope it helps!

Configure cassandra to use different network interfaces for data streaming and client connection?

I have a cassandra cluster deployed with 3 cassandra nodes with replication factor of 3. I have a lot of data being written to cassandra on daily basis (10-15GB). I have provisioned these cassandra on commodity hardware as suggested by "Big data community" and I am expecting the nodes to go down frequently which is handled using redundancy provided by cassandra.
My problem is, I have observed cassandra to slow down with writes when a new node is provisioned and the data is being streamed while bootstrapping. So, to overcome this hurdle, We have decided to have a separate network interface for inter-node communication and for client application to write data to cassandra. My question is how can this be configured, if at all this is possible ?
Any help is appreciated.
I think you are chasing the wrong solution.
I am confused by the fact that you only have 3 nodes, yet your concern is around slow writes while bootstrapping. Why? Are you planning to grow your cluster regularly? What is your consistency level on write, as this has a big impact on performance? Obviously if you only have 2 or 3 nodes and you're trying to bootstrap, you will see a slowdown, because you're tying up a significant percentage of your cluster to do the streaming.
Note that "commodity hardware" doesn't mean cheap, low-performance hardware. It just means you don't need the super high-end database-class machines used for databases like Oracle. You should still use really good commodity hardware. You may also need more nodes, as setting RF equal to cluster size is not typically a great idea.
Having said that, you can set your listen_address to the inter-node interface and rpc_address to the client address if you feel that will help.

Strategies for Cassandra backups

We are considering additional backup strategies for our Cassandra database.
Currently we use:
nodetool snapshot - allows us to take a snapshot of all data on a given node at any time
replication level in itself is a backup - if any node ever goes down we still have additional copies of the data
Is there any other efficient strategy? The biggest issue is that we need a copy of life data in additional datacenter, which has just one server. So, for example let's say we 3 Cassandra servers, that we want to replicate to one backup Cassandra server in a second datacenter. the only way I can think we can do it is to set the replication level to 4. That means all the data will have to be written to all servers at all times.
We are also considering running Cassandra on top of some network distributed system like HDFS. Correct me if I'm wrong but that would be a very inefficient use of Cassandra?
Are there any other efficient backup strategies for Cassandra that can backup the current data without impairing it's performance?

Resources