There is an In-Memory option introduced in the Cassandra by DataStax Enterprise 4.0:
http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/inMemory.html
But with 1GB size limited for an in-memory table.
Anyone know the consideration why limited it as 1GB? And possible extend to a large size of in-memory table, such as 64GB?
To answer your question: today it's not possible to bypass this limitation.
In-Memory tables are stored within the JVM Heap, regardless the amount of memory available on single node allocating more than 8GB to JVM Heap is not recommended.
The main reason of this limitation is that Java Garbage Collector slow down when dealing with huge memory amount.
However if you consider Cassandra as a distributed system 1GB is not the real limitation.
(nodes*allocated_memory)/ReplicationFactor
allocated_memory is max 1GB -- So your table may contains many GB in memory allocated in different nodes.
I think that in future something will improve but dealing with 64GB in memory it could be a real problem when you need to flush data on disk. One more consideration that creates limitation: avoid TTL when working with In-Memory tables. TTL creates tombstones, a tombstone is not deallocated until the GCGraceSeconds period passes -- so considering a default value of 10 days each tombstone will keep the portion of memory busy and unavailable, possibly for long time.
HTH,
Carlo
Related
Is it okay to have large number of partitions in cassandra?
Will heap memory bloat?
Mathematically speaking, Cassandra can support (Murmur3 partitioner) +/- 2^63 partitions. That comes out to a total of about 18.4 quintillion. So no worries there, that’s perfectly fine.
Will heap memory bloat?
No. One, there’s a pre-set size for your heap, and it won’t get any bigger than that. Two, Cassandra doesn’t keep all data or even all keys in memory.
You can configure how many keys/rows are cached on a per-table basis. Just make sure not to cache more than the heap-newgen or the heap can “thrash” (constantly running GC).
How to pin the table in cache so it would not swap out of memory?
Situation: We are using Microstrategy BI reporting. Semantic layer is built. We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table ; we did cache for SPARK context( Thrift server). Initially it was all in cache , now some in cache and some in disk. That disk may be local disk relatively more expensive reading than from s3. Queries may take longer and inconsistent times from user experience perspective. If More queries running using Cache tables, copies of the cache table images are copied and copies are not staying in memory causing reports to run longer. so how to pin the table so would not swap to disk. Spark memory management is dynamic allocation, and how to use those few tables to Pin in memory .
There's a couple different ways to tackle this. First keep in mind that in memory storage is shared between storage and execution. so doing big joins/etc that may require temp storage may be competing for memory space. You may want to look at "spark.memory.storageFraction" which currently defaults to 0.5 Consider 0.75 but this will likely slow down your queries. Also consider applying good data engineering to the problem. Reduce the amount of data that needs to be stored. Create a temp view with old records removed and unneeded columns pruned, then cache that. Consider using smaller datatypes for improved storage. Ex ints are more space efficient than big strings. Lastly considering switching to an instance type that has more memory available or switching to an instance with fast local disks. In some situations disk storage isn't that much slower than in memory. This is particularly true if you're running big complicated analytical queries where the cluster is cpu bound and not io bound.
What is the maximum value that one can set for transaction_buffer inside memsql cnf? I assume there is a correlation with RAM allocated on the server. My leaves have 32G each and at the moment we have transaction_buffer set to 0. We are passed designing phase on our cluster and we would like to do some performance tuning and one parameter that needs to be set up accordingly is this one.
The transaction_buffer size is an amount of memory reserved per database partition - i.e. each leaf node will need transaction_buffer size * partitions per leaf * number of databases memory. The default is 128 MB and this should be sufficient generally.
Basically, it's a balancing act - data in transaction_buffer will exist in memory before being written to disk. A transaction_buffer of 0 may save you some memory, but it's not taking full advantage of the speed of being in memory. If you have a lot of databases that are updated infrequently a low transaction_buffer may be the right balance as it is a per database cost (keeping in mind that each partition is a database itself).
Transaction_buffer may also be valuable for you as a "get out of jail free" card - since if your workload becomes more and more memory intensive it's possible to get into a situation where your OS is killing MemSQL too frequently to reduce memory consumption. Once you get stuck in a vicious cycle like that, restarting with a reduced transaction buffer can reduce memory overhead enough to keep the system from being OOM-killed long enough to troubleshoot and correct the issue on your end.
Eventually, it might become adaptive, and you'll be left without that easy way to get some wiggle-room. Which is why it is essential to make sure the maximum_memory is low enough that your system doesn't begin to OOM kill processes. https://docs.memsql.com/docs/memory-management
We are trying to create a prototype to the Cassandra Datastax community edition and java driver.
I've tried to measure the latency of simple retrieve and update using the Sample from Cassandra Java Driver (simplex keyspace).
I have two data centers with one Rack per data center. Each Rack contains 3 nodes.
I have 6 nodes (VMs) in total.
I've configured key_cache_size_in_mb to 10 in order tuning the retrieve/update operations.
In summary we are trying to tune the sample operations to get around 5 ms latency for read/update operation.
Following the latency that we managed to achieve:
19 milliseconds elapsed to retrieve playlist table.
title album artist
Memo From Turner Performance Mick Jager
Updating simplex.playlist
14 milliseconds elapsed to update songs table.
14 milliseconds elapsed to retrieve songs table.
title album artist tags
La Petite Tonkinoise' Bye Bye Blackbird' JosŽphine Baker
What are the tunings that should be done in order improve the performance and achieving better latency than above?
Your direction/insight would be highly appreciated.
Thanks in advance,
Erwin
Some performance optimization tips/best practices:
Larger the number of nodes, better the distribution and C* performs better
64-bit JVMs perform better than 32-bit (Use Oracle JVM 1.6 at least u22)
physical environments, minimum is 8GB, but anything between 16-32 GB, 8-core processors
at least two disks, one for the commit log and the other for the data directories
Commit Log + data directory on same volumes – avoid this. The biggest performance gain for write is to put commit log in a separate disk drive. Commit log is 100% sequential, while data reads are random from data directories. I/O contention between commit log & SSTables may deteriorate commit log writes and SSTable reads. But this does not apply to SSDs or EC2.
JVM parameters tuning (on a 8GB RAM system)
Heap tuning
-Xms${MAX_HEAP_SIZE}
-Xmx${MAX_HEAP_SIZE} – default to 40-50% of available physical memory – 4 GB
-Xmn${HEAP_NEWSIZE} - default to 25% of java heap – 1GB
GC tuning
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+UseParallelGC
-XX:SurvivorRatio=4
-XX:MaxTenuringThreshold=0
Synch the clocks on all nodes – As C* adds timestamp t each coumn value, it is must to synch clocks across the ring using NTP daemon or script. NTP known to drift the clocks across datacenters.
Use Key cache sparingly, as it has highest possible performance gains with least memory footprint, as it stores only the key and data location. Saves one file I/O seek.
update column family my_column_family with keys_cached=50000;
Use RF=3, it’s a best practice, Write/Read consistency level = QUORUM is a best practice
on Linux, you can locate cassandra.sh, which is used to start the Cassandra process. This is where we add the GC params as well as the JVM memory settings. (backup the file first) i assume, you have 4GB allocated to cassandra process. Assuming you have a 8GB system memory, allocate -Xmx4096m to Cassandra process.
https://github.com/apache/cassandra/blob/trunk/conf/cassandra-env.sh?source=cc
you can tuning options coded in section "# GC tuning options"
key_cache_size_in_mb - this setting can be found in the cassandra.yaml file and will applicable to all column families in your keyspace or else set at CF level. You need to know approx size of your rows and work out the calculations. e.g. for 1 million rows to be cached with avg row size of 100 bytes with 25 columns each of 4 bytes, you need to set it as 100 mb (1 mn * 100 bytes)
What is the maximum number of keyspaces allowed in a Cassandra cluster? The wiki page on limitations doesn't mention one. Is there such a limit?
A keyspace is basically just a Map entry to Cassandra... you can have as many as you have memory for. Millions, easily.
ColumnFamilies are more expensive, since Cassandra will reserve a minimum of 1MB for each CF's memtable: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-performance
You should have a look to : https://community.datastax.com/questions/12579/limit-on-number-of-cassandra-tables.html
We recommend a maximum of 200 tables total per cluster across all
keyspaces (regardless of the number of keyspaces). Each table uses 1MB
of memory to hold metadata about the tables so in your case where 1GB
is allocated to the heap, 500-600MB is used just for table metadata
with hardly any heap space left for other operations.
It is a recommendation and there is no hard-limit on the number of tables you can create in a cluster. You can create thousands if you were so inclined.
More importantly, applications take a long time to startup since the
drivers request the cluster metadata (including the schema) during the
initialisation/discovery phase. Retrieving the schema for 200 tables
is significantly less than it would take to load 500, 1000 or 3000.
This may not be important to you but there are lots of use cases where
short startup times are crucial, most notably for short-lived
serverless functions where execution time costs money and reducing
execution where possible results in thousands of dollars in savings.