Cassandra prototype, latency issue - cassandra

We are trying to create a prototype to the Cassandra Datastax community edition and java driver.
I've tried to measure the latency of simple retrieve and update using the Sample from Cassandra Java Driver (simplex keyspace).
I have two data centers with one Rack per data center. Each Rack contains 3 nodes.
I have 6 nodes (VMs) in total.
I've configured key_cache_size_in_mb to 10 in order tuning the retrieve/update operations.
In summary we are trying to tune the sample operations to get around 5 ms latency for read/update operation.
Following the latency that we managed to achieve:
19 milliseconds elapsed to retrieve playlist table.
title album artist
Memo From Turner Performance Mick Jager
Updating simplex.playlist
14 milliseconds elapsed to update songs table.
14 milliseconds elapsed to retrieve songs table.
title album artist tags
La Petite Tonkinoise' Bye Bye Blackbird' JosŽphine Baker
What are the tunings that should be done in order improve the performance and achieving better latency than above?
Your direction/insight would be highly appreciated.
Thanks in advance,
Erwin

Some performance optimization tips/best practices:
Larger the number of nodes, better the distribution and C* performs better
64-bit JVMs perform better than 32-bit (Use Oracle JVM 1.6 at least u22)
physical environments, minimum is 8GB, but anything between 16-32 GB, 8-core processors
at least two disks, one for the commit log and the other for the data directories
Commit Log + data directory on same volumes – avoid this. The biggest performance gain for write is to put commit log in a separate disk drive. Commit log is 100% sequential, while data reads are random from data directories. I/O contention between commit log & SSTables may deteriorate commit log writes and SSTable reads. But this does not apply to SSDs or EC2.
JVM parameters tuning (on a 8GB RAM system)
Heap tuning
-Xms${MAX_HEAP_SIZE}
-Xmx${MAX_HEAP_SIZE} – default to 40-50% of available physical memory – 4 GB
-Xmn${HEAP_NEWSIZE} - default to 25% of java heap – 1GB
GC tuning
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+UseParallelGC
-XX:SurvivorRatio=4
-XX:MaxTenuringThreshold=0
Synch the clocks on all nodes – As C* adds timestamp t each coumn value, it is must to synch clocks across the ring using NTP daemon or script. NTP known to drift the clocks across datacenters.
Use Key cache sparingly, as it has highest possible performance gains with least memory footprint, as it stores only the key and data location. Saves one file I/O seek.
update column family my_column_family with keys_cached=50000;
Use RF=3, it’s a best practice, Write/Read consistency level = QUORUM is a best practice
on Linux, you can locate cassandra.sh, which is used to start the Cassandra process. This is where we add the GC params as well as the JVM memory settings. (backup the file first) i assume, you have 4GB allocated to cassandra process. Assuming you have a 8GB system memory, allocate -Xmx4096m to Cassandra process.
https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh?source=cc
you can tuning options coded in section "# GC tuning options"
key_cache_size_in_mb - this setting can be found in the cassandra.yaml file and will applicable to all column families in your keyspace or else set at CF level. You need to know approx size of your rows and work out the calculations. e.g. for 1 million rows to be cached with avg row size of 100 bytes with 25 columns each of 4 bytes, you need to set it as 100 mb (1 mn * 100 bytes)

Related

How can I estimate how many Cassandra nodes I need for a given read requests per second load?

I wish to estimate how many cassandra storage nodes I would need to serve a specific number of reads per second.
My node specs are 32 cores, 256GB ram, 10Gbps NIC, 10x 6TB HDDs. Obviously SSDs would be much more preferrable, but these are not available in this instance.
I have around 5x10^11 values of 1kB each = 500TB of values to serve, at a rate of 100,000 read requests per second. The distribution of these requests is completely even, ie, ram capacity caching will have no effect.
If we assume that each HDD can sustain ~100 IOps, then I could expect that I need at least ~ 100 nodes to serve this read load - correct?
I also estimate that I would need at least ~ 20 machines for the total storage with a replication factor of 2, plus overhead.
It's a really broad question - you need to try to test your machines with tools, like, NoSQLBench that was specially built for such tasks.
Typical recommendation is to store ~1Tb of data per Cassandra node (including replication). You need to take into account other factors, like, how long it will take to replace the node in the cluster, or add new one - the speed of streaming is directly proportional to size of data on disk.
HDDs are really not recommended if you want to have low latency answers. I have a client with ~150Tb spread over ~30 machines with HDDs, with a lot of writes although, and read latencies regularly are going above 0.5 second, and higher. You need to take into account that Cassandra requires random access to data, and head of HDD simply couldn't move so fast to serve requests.

Cassandra CPU imbalance in Azure

We have a 30+ node Cassandra cluster (3.11.2) in 4 data centers. One of the centers consists of 8 nodes in Azure running on Standard DS12 v2 (4cpu, 28gb) nodes with a 500GB premium SSD drive. All in the same data center (central US).
We are seeing a dramatic CPU imbalance in the node activity when pushed to the max. We have a keyspace with about 200 million records, and we're running a process to check and refresh the records if necessary from another data stream.
What's happening, is we have 4 nodes that are running at 70-90% CPU compared to 15-25% of the other 4. The measurement of the CPU is being done in the nodes themselves, because Azure's own metrics is broken and never represents what is actually happening.
Digging into a pair of nodes (one low CPU and one high) the difference is the iowait% of the two. The data in the keyspace is balanced (within reason - they are all within 5% of another in record count and size). It looks like the number of reads is balanced, and even the read latency as reported by Cassandra is similar.
When I do an iostat compare of the nodes, the high CPU node is reporting a much higher (by 50 to 100%) rKB/s numbers... which is likely leading to the difference in iowait% time.
These nodes are 100% configured the same, running the same version of everything (OS, libraries, everything) that I can think to look. I cannot figure out why some nodes are deciding to do more disk reads that the others, resulting in the cluster as a whole slowing down.
Anybody have any suggestions on where I can look for differences?
The only thing that is a pattern, is the nodes that are slower are the 4 nodes that were added later in our expansion. We started with 4 nodes for a while and added 4 more when we needed space. All the appropriate repairs and other tasks required with node additions were done - the fact that the records and physical size of the data files on disk being equal should attest to that.
When we shut down our refresh process, the all the nodes settle down to a even 5% or less CPU across the board. No compaction or any other maintenance is happening that would indicate something different.
plz help... :)
Our final solution for this - to fix ONLY the unbalanced problem was to cleanup, full repair and compact. At that point the nodes are relatively equally used. We suspect expanding the cluster (adding nodes) may have left elements of data on the older nodes that were not compacted out based on regular compaction events.
We are still working to try to solve the load issue; but now at least all the nodes are feeling the same CPU crunch.

Cassandra cluster - data density (data size per node) - looking for feedback and advises

I am considering the design of a Cassandra cluster.
The use case would be storing large rows of tiny samples for time series data (using KairosDB), data will be almost immutable (very rare delete, no updates). That part is working very well.
However, after several years the data will be quite large (it wil reach a maximum size of several hundreds of terabytes - over one petabyte considering the replication factor).
I am aware of advice not to use more than 5TB of data per Cassandra node because of high I/O loads during compactions and repairs (which is apparently already quite high for spinning disks).
Since we don't want to build an entire datacenter with hundreds of nodes for this use case, I am investigating if this would be workable to have high density servers on spinning disks (e.g. at least 10TB or 20TB per node using spinning disks in RAID10 or JBOD, servers would have good CPU and RAM so the system will be I/O bound).
The amount of read/write in Cassandra per second will be manageable by a small cluster without any stress. I can also mention that this is not a high performance transactional system but a datastore for storage, retrievals and some analysis, and data will be almost immutable - so even if a compaction or a repair/reconstruction that take several days of several servers at the same time it's probably not going to be an issue at all.
I am wondering if some people have an experience feedback for high server density using spinning disks and what configuration you are using (Cassandra version, data size per node, disk size per node, disk config: JBOD/RAID, type of hardware).
Thanks in advance for your feedback.
Best regards.
The risk of super dense nodes isn't necessarily maxing IO during repair and compaction - it's the inability to reliably resolve a total node failure. In your reply to Jim Meyer, you note that RAID5 is discouraged because the probability of failure during rebuild is too high - that same potential failure is the primary argument against super dense nodes.
In the days pre-vnodes, if you had a 20T node that died, and you had to restore it, you'd have to stream 20T from the neighboring (2-4) nodes, which would max out all of those nodes, increase their likelihood of failure, and it would take (hours/days) to restore the down node. In that time, you're running with reduced redundancy, which is a likely risk if you value your data.
One of the reasons vnodes were appreciated by many people is that it distributes load across more neighbors - now, streaming operations to bootstrap your replacement node come from dozens of machines, spreading the load. However, you still have the fundamental problem: you have to get 20T of data onto the node without bootstrap failing. Streaming has long been more fragile than desired, and the odds of streaming 20T without failure on cloud networks are not fantastic (though again, it's getting better and better).
Can you run 20T nodes? Sure. But what's the point? Why not run 5 4T nodes - you get more redundancy, you can scale down the CPU/memory accordingly, and you don't have to worry about re-bootstrapping 20T all at once.
Our "dense" nodes are 4T GP2 EBS volumes with Cassandra 2.1.x (x >= 7 to avoid the OOMs in 2.1.5/6). We use a single volume, because while you suggest "cassandra now supports JBOD quite well", our experience is that relying on Cassandra's balancing algorithms is unlikely to give you quite what you think it will - IO will thundering herd between devices (overwhelm one, then overwhelm the next, and so on), they'll fill asymmetrically. That, to me, is a great argument against lots of small volumes - I'd rather just see consistent usage on a single volume.
I haven't used KairosDB, but if it gives you some control over how Cassandra is used, you could look into a few things:
See if you can use incremental repairs instead of full repairs. Since your data is an immutable time series, you won't often need to repair old SSTables, so incremental repairs would just repair recent data.
Archive old data in a different keyspace, and only repair that keyspace infrequently such as when there is a topology change. For routine repairs, only repair the "hot" keyspace you use for recent data.
Experiment with using a different compaction strategy, perhaps DateTiered. This might reduce the amount of time spent on compaction since it would spend less time compacting old data.
There are other repair options that might help, for example I've found the the -local option speeds up repairs significantly if you are running multiple data centers. Or perhaps you could run limited repairs more frequently rather than performance killing full repairs on everything.
I have some Cassandra clusters that use RAID5. This has worked fine so far, but if two disks in the array fail then the node becomes unusable since writes to the array are disabled. Then someone must manually intervene to fix the failed disks or remove the node from the cluster. If you have a lot of nodes, then disk failures will be a fairly common occurrence.
If no one gives you an answer about running 20 TB nodes, I'd suggest running some experiments on your own dataset. Set up a single 20 TB node and fill it with your data. As you fill it, monitor the write throughput and see if there are intolerable drops in throughput when compactions happen, and at how many TB it becomes intolerable. Then have an empty 20 TB node join the cluster and run a full repair on the new node and see how long it takes to migrate its half of the dataset to it. This would give you an idea of how long it would take to replace a failed node in your cluster.
Hope that helps.
I would recommend to think about the data model of your application and how to partition your data. For time series data it would probably make sense to use a composite key [1] which consists of a partition key + one or more columns. Partitions are distributed across multiple servers according to the hash of the partition key (depending on the Cassandra Partitioner that you use, see cassandra.yaml).
For example, you could partition your server by device that generates the data (Pattern 1 in [2]) or by a period of time (e.g., per day) as shown in Pattern 2 in [2].
You should also be aware that the max number of values per partition is limited to 2 billion [3]. So, partitioning is highly recommended. Don't store your entire time series on a single Cassandra node in a single partition.
[1] http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/
[2] https://academy.datastax.com/demos/getting-started-time-series-data-modeling
[3] http://wiki.apache.org/cassandra/CassandraLimitations

Cassandra In-Memory option

There is an In-Memory option introduced in the Cassandra by DataStax Enterprise 4.0:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/inMemory.html
But with 1GB size limited for an in-memory table.
Anyone know the consideration why limited it as 1GB? And possible extend to a large size of in-memory table, such as 64GB?
To answer your question: today it's not possible to bypass this limitation.
In-Memory tables are stored within the JVM Heap, regardless the amount of memory available on single node allocating more than 8GB to JVM Heap is not recommended.
The main reason of this limitation is that Java Garbage Collector slow down when dealing with huge memory amount.
However if you consider Cassandra as a distributed system 1GB is not the real limitation.
(nodes*allocated_memory)/ReplicationFactor
allocated_memory is max 1GB -- So your table may contains many GB in memory allocated in different nodes.
I think that in future something will improve but dealing with 64GB in memory it could be a real problem when you need to flush data on disk. One more consideration that creates limitation: avoid TTL when working with In-Memory tables. TTL creates tombstones, a tombstone is not deallocated until the GCGraceSeconds period passes -- so considering a default value of 10 days each tombstone will keep the portion of memory busy and unavailable, possibly for long time.
HTH,
Carlo

What is the maximum number of keyspaces in Cassandra?

What is the maximum number of keyspaces allowed in a Cassandra cluster? The wiki page on limitations doesn't mention one. Is there such a limit?
A keyspace is basically just a Map entry to Cassandra... you can have as many as you have memory for. Millions, easily.
ColumnFamilies are more expensive, since Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
You should have a look to : https://community.datastax.com/questions/12579/limit-on-number-of-cassandra-tables.html
We recommend a maximum of 200 tables total per cluster across all
keyspaces (regardless of the number of keyspaces). Each table uses 1MB
of memory to hold metadata about the tables so in your case where 1GB
is allocated to the heap, 500-600MB is used just for table metadata
with hardly any heap space left for other operations.
It is a recommendation and there is no hard-limit on the number of tables you can create in a cluster. You can create thousands if you were so inclined.
More importantly, applications take a long time to startup since the
drivers request the cluster metadata (including the schema) during the
initialisation/discovery phase. Retrieving the schema for 200 tables
is significantly less than it would take to load 500, 1000 or 3000.
This may not be important to you but there are lots of use cases where
short startup times are crucial, most notably for short-lived
serverless functions where execution time costs money and reducing
execution where possible results in thousands of dollars in savings.

Resources