What's the best/reliable method of estimating the space required in Cassandra. My Cluster consists of 2 nodes(RHEL 6.5) on Cassandra 3.11.2. I want to estimate the average size each row in every table will take in my database so that I can plan accordingly. I know about some methods like nodetool status command, du -sh command used in the data directory, nodetool cfstats etc. However each of these are giving different values and hence I'm not sure which one should I use in my calculations.
Also I found out that apart from the actual data, various metadata is also stored by Cassandra in various system specific tables like size_estimates, sstable_activity etc. Does this metadata also keep on increasing with the data? What's the ratio of space occupied by such metadata and the space occupied by the actual data in the database? Also what particular configurations in YAML(if any) should I keep in mind which might affect the size of the data.
A similar question was asked before but I wasn't satisfied by the answer.
If you are expecting 20 GB of data per day, here is the calculation.
1 Day = 20 GB, 1 Month = 600 GB, 1 Year = 7.2 TB, so your raw data size for one year is 7.2 TB with replication factor of 3 it would be around 21.6 TB of data for one year.
Taking compaction into consideration and your use case being write heavy,if you go with size tiered compaction. you would need twice the space of your raw data.
So you would need around 43 TB to 45 TB of disk space.
Related
DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml
Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.
We have a table that stores our data partitioned by files. One file is 200MB to 8GB in json - but theres a lot of overhead obviously. Compacting the raw data will lower this drastically. I ingested about 35 GB of json data and only one node got slightly more than 800 MB data. This is possibly due to "write hotspots" -- but we only write once and read only. We do not update data. Currently, we have one partition per file.
By using secondary indexes, we search for partitions in the database that contain a specific geolocation (= first query) and then take the result of this query to range query a time range of the found partitions (= second query). This might even be the whole file if needed but in 95% of the queries only chunks of a partition are queried.
We have a replication factor of 2 on a 6 node cluster. Data is fairly even distributed, every node owns 31,9% to 35,7% (effective) data according to nodetool status *tablename*.
Good read performance is key for us.
My questions:
How big is too big for a partition in terms of volume or row size? Is there a rule of thumb for this?
For Range Query performance: Is it better to split up our "big" partitions to have more smaller partitions? We built our schema with "big" partitions because we thought that when we do range queries on a partition, it would be good to have it all on one node so data can be fetched easily. Note that the data is also available on one replica due to RF 2.
C* supports very huge rows, but it doesn't mean it is a good idea to go to that level. The right limit depends on specific use cases, but a good ballpark value could be between 10k and 50k. Of course, everything is a compromise, so if you have "huge" (in terms of bytes) rows then heavily limit the numbers of rows in each partition. If you have "small" (in terms of bytes) rows them you can relax that limit a bit. This is because one partition means one node only due to your RF=1, so all your query for a specific partition will hit only one node.
Range queries should ideally go to one partition only. A range query means a sequential scan on your partition on the node getting the query. However, you will limit yourself to the throughput of that node. If you split your range queries between more nodes (that is you change the way you partition your data by adding something like a bucket) you need to get data from different nodes as well performing parallel queries, directly increasing the total throughput. Of course you'd lose the order of your records within different buckets, so if the order in your partition matters, then that could not be feasible.
As per the DataStax Cassandra yaml documentation link https://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html
compaction_throughput_mb_per_sec
(Default: 16) Throttles compaction to the specified total throughput across the entire system. The faster you insert data, the faster you need to compact in order to keep the SSTable count down. The recommended value is 16 to 32 times the rate of write throughput (in MB/second). Setting the value to 0 disables compaction throttling.
My literal interpretation of above text is, if you are observing disk I/O (mb/s) as say 38 mb/s, for now consider only the write load on Cassandra nodes, then compaction_throughput_mb_per_sec shall be set to 38 * 16 = 608 or 38 * 32 = 1216 and that is irrespective of the compaction strategy.
If above interpretation is correct then kindly help let me understand the actual meaning of the value 608 or 1216 in the context of throttling compaction and total throughput across system for Size tiered compaction strategy (default) with example may be by extending the one mentioned below.
The plot:
As per documentation the min_threshold value for SizeTieredCompactionStrategy is 6. In our case it is unchanged. On an average, disk I/O per node is being observed to be around 38 mb/s (only writes, no read operations happening). compaction_throughput_mb_per_sec value is 16.
What would be the compaction workflow with value 16? If we change it to 608 then exactly what is going to change, what is going to be impacted and how?
Let's have a relook at the meaning of compaction.
the compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable.
...
The compaction_throughput_mb_per_sec parameter is designed for use with large partitions because compaction is throttled to the specified total throughput across the entire system.
Refer: Configuring compaction
To preserve read performance in a mixed read-write workload, you need to mitigate the tendency of small SSTables to accumulate during a single long-running compaction.
Refer: concurrent_compactors
So when you update compaction_throughput_mb_per_sec, you update the rate at which new consolidated SSTables are written; and turn helps you to mitigate the tendency of small SSTables to accumulate during compaction.
So, in short, when you increase the value of compaction_throughput_mb_per_sec from 16 to 608, you increase the write-throughput required for writing SSTables, in turn reduce the chances of small SSTables getting created, and finally improve read performance.
We are currently in the process of deploying a larger Cassandra cluster and looking for ways to estimate the best size of the key cache. Or more accurately looking for a way of finding out the size of one row in the key cache.
I have tried tying into the integrated metrics systems using graphite, but I wasn't able to receive any clear answer. Further I tried putting my own debugging code into org.cassandra.io.sstable, but this neither yielded any concrete results.
We are using Cassandra 1.20.10, but are there any fool proof solutions to getting the size of one row in the key cache?
With best regards,
Ben
Check out jamm. Its a library used for measuring the size of an object in memory.
You need to add -javaagent:"/path/to/jamm.jar" to your startup parameters but cassandra is configured to start with jamm, so if you change internal cassandra code, this is already done for you.
To size of objects (in bytes):
MemoryMeter meter = new MemoryMeter();
meter.measureDeep(object);
Measure deep is a more costly but much more accurate measurement of an object's memory size.
For estimation of key size, let's assume you intended to store 1 million keys in cache, each key of length 60 bytes on an average. There will be some overhead to store the key, lets say it is 40 bytes that means key size per row = 100 bytes.
Since we need to cache 1 million keys
total key cache = 1 mn * 100 = 100 Mbytes
perform this for each CF in your keyspace.
Here is situation
I am trying to fetch around 10k keys from CF.
Size of cluster : 10 nodes
Data on node : 250 GB
Heap allotted : 12 GB
Snitch used : property snitch with 2 racks in same Data center.
no. of sstables for cf per node : around 8 to 10
I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.
When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.
So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?
Is super column approach affects the read performance?
Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here:
http://www.datastax.com/docs/0.8/ddl/column_family
Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:
Using the C Extension
The C extension is crucial for phpcassa's performance.
You need to configure and make to be able to use the C extension.
cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install
Add the following line to your php.ini file:
extension=thrift_protocol.so
After doing much of RND about this stuff we figured there is no way you can get this working optimally.
When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.
1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory.
After data loading run the expected queries to warm up the key cache.
2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).
3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.
Above changes helped to bring down time required for querying within acceptable limits.
Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.
Hope above info is useful.