Cassandra nodetool "compactionstats" meaning of displayed values - cassandra

I cannot find documentation on the "compactionstats":
While using nodetool compactionstats, what do the numerical values on the completed and total columns mean?
My column family has a total data size of about 360 GB but my compaction status displays:
pending tasks: 7
compaction type keyspace column family completed total unit progress
Compaction Test Message 161257707087 2475323941809 bytes 6.51%
While I see the "completed" increasing slowly (also the progress;-).
But how is this "total" computed? Why is it 2.5 TB when I have only 360 GB of data?

You must have compression on. total is the total number of uncompressed bytes comprising the set of sstables that are being compacted together. If you grep the cassandra log file for lines containing Compacting you will find the sstables that are part of a compaction. If you sum these sizes and multiply by the inverse of your compression ratio for the column family you will get pretty close to the total. By default this can be a bit difficult to verify on a multi-core system because the number of simultaneous compactions defaults to the number of cores.
You can also verify this answer by looking at the code:
AbstractionCompactionIterable - getCompactionInfo() uses the bytesRead and totalBytes fields from that class. totalBytes is final and is computed in the constructor, by summing getLengthInBytes() from each file that is part of the compaction.
The scanners vary, but the length in bytes returned by CompressedRandomAccessReader is the uncompressed size of the file.

Related

DSE Cluster node disk gets filled

I have an 6 node cluster , each node is of 1000 GB in size. But the size of one node reached to 1000 GB randomly.On analysis i found only one key space gets filled & only 1 table of this keyspace size get increased from 200 GB to 800 GB (In 24 hours ) , which means someone execute operations on this table only . I want to figure out what operations had perform on this node which leads to this size increment ?
Are there any logs which can be looked at to see what operations were performed?
I guess how I would do this is to use "nodetool tablehistograms" to prove that you have large partitions for the table. Then I would go to the table directory and run "sstablemetadata" on some of the data files, locating ones that displays some large partition sizes.
One trick you could do once you find sstables that have larger partitions is:
sstabledump <sstable> | grep -n "\"key\" :"
What that will do is show you the line number every time the key switches, the larger the gap between lines, the more rows there are.
Here is an example:
sstabledump aa-483-bti-Data.db | grep -n "\"key\" :"
4: "key" : [ "PROCESSING" ],
65605: "key" : [ "PENDING" ],
8552007: "key" : [ "COMPLETED" ],
As you can see, the gap between PENDING and COMPLETED was much larger than PROCESSING and PENDING (65k lines v.s. 8M lines). So this tells me that the PROCESSING partition is relatively small compared to PENDING. The only mystery is how large is the COMPLETED one as there is no "ending" line. To get the total line count, run:
sstabledump aa-483-bti-Data.db | wc -l
16316029
Total line count is 16M. So COMPLETED goes from 8M to 16M, or about 8M lines. So the COMPLETED partition is large as well, about as large as the PENDING partition.
Looking at sstablemetadata to see if that matches up with the output, I see that it does:
sstablemetadata aa-483-bti-Data.db
Partition Size:
Size (bytes) | Count (%) Histogram
943127 (921.0 kB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
129557750 (123.6 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
155469300 (148.3 MB) | 1 ( 33) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
I see two relatively large partitions and one small one. Bingo.
Maybe some of those can help you get to the bottom of your large partition(s).
With DataStax Enterprise, you should be able to turn on the Database Auditing feature. In fact, by configuring a logger class of CassandraAuditWriter, all activity gets written to the audit_log table in the dse_audit keyspace.
The data is organized by this PRIMARY KEY: ((date, node, day_partition), event_time); and has columns like username,table_name,keyspace_name,operation and others.
Check out the DataStax docs on that for configuration and query options.
As for (open source) Apache Cassandra, we use Ericsson's Cassandra Audit plugin for this functionality. By adding in the project's JAR, and making a couple of adjustments to the cassandra.yaml file, you can view the audit.logs for records like:
15:42:41.655 - client:'10.0.110.1'|user:'flynn'|status:'ATTEMPT'|operation:'DELETE FROM ecks.ectbl WHERE partk = ?'

What is the byte size of common Cassandra data types - To be used when calculating partition disk usage?

I am trying to calculate the the partition size for each row in a table with arbitrary amount of columns and types using a formula from the Datastax Academy Data Modeling Course.
In order to do that I need to know the "size in bytes" for some common Cassandra data types. I tried to google this but I get a lot of suggestions so I am puzzled.
The data types I would like to know the byte size of are:
A single Cassandra TEXT character (I googled answers from 2 - 4 bytes)
A Cassandra DECIMAL
A Cassandra INT (I suppose it is 4 bytes)
A Cassandra BIGINT (I suppose it is 8 bytes)
A Cassandra BOOELAN (I suppose it is 1 byte, .. or is it a single bit)
Any other considerations would of course also be appreciated regarding data types sizes in Cassandra.
Adding more info since it seems confusing to understand that I am only trying to estimate the "worst scenario disk usage" the data would occupy with out any compressions and other optimizations done by Cassandra behinds the scenes.
I am following the Datastax Academy Course DS220 (see link at end) and implement the formula and will use the info from answers here as variables in that formula.
https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size
I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:
uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate
hope it helps
The only reliable way to estimate the overhead associated to something is to actually perform measures. Really, you can't take the single data types and generalize something about them. If you have 4 bigints columns and you're supposing that your overhead is X, if you have 400 bigint columns your overhead won't probably be 100x. That's because Cassandra compresses (by default, and it's a settings tunable per column family) everything before storing data on disk.
Try to load some data, I mean production data, in the cluster, and then let's know your results and compression configuration. You'd find some surprises.
Know your data.

YCSB low read throughput cassandra

The YCSB Endpoint benchmark would have you believe that Cassandra is the golden child of Nosql databases. However, recreating the results on our own boxes (8 cores with hyperthreading, 60 GB memory, 2 500 GB SSD), we are having dismal read throughput for workload b (read mostly, aka 95% read, 5% update).
The cassandra.yaml settings are exactly the same as the Endpoint settings, barring the different ip addresses, and our disk configuration (1 SSD for data, 1 for a commit log). While their throughput is ~38,000 operations per second, ours is ~16,000 regardless (relatively) of the threads/number of client nodes. I.e. one worker node with 256 threads will report ~16,000 ops/sec, while 4 nodes will each report ~4,000 ops/sec
I've set the readahead value to 8KB for the SSD data drive. I'll put the custom workload file below.
When analyzing disk io & cpu usage with iostat, it seems that the reading throughput is consistently ~200,000 KB/s, which seems to suggest that the ycsb cluster throughput should be higher (records are 100 bytes). ~25-30% of cpu seems to be under %iowait, 10-25% in use by the user.
top and nload stats are not ostensibly bottlenecked (<50% memory usage, and 10-50 Mbits/sec for a 10 Gb/s link).
# The name of the workload class to use
workload=com.yahoo.ycsb.workloads.CoreWorkload
# There is no default setting for recordcount but it is
# required to be set.
# The number of records in the table to be inserted in
# the load phase or the number of records already in the
# table before the run phase.
recordcount=2000000000
# There is no default setting for operationcount but it is
# required to be set.
# The number of operations to use during the run phase.
operationcount=9000000
# The offset of the first insertion
insertstart=0
insertcount=500000000
core_workload_insertion_retry_limit = 10
core_workload_insertion_retry_interval = 1
# The number of fields in a record
fieldcount=10
# The size of each field (in bytes)
fieldlength=10
# Should read all fields
readallfields=true
# Should write all fields on update
writeallfields=false
fieldlengthdistribution=constant
readproportion=0.95
updateproportion=0.05
insertproportion=0
readmodifywriteproportion=0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=usertable
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
The key was increasing native_transport_max_threads in the casssandra.yaml file.
Along with the increased settings in the comment (increasing connections in ycsb client as well as concurrent read/writes in cassandra), Cassandra jumped to ~80,000 ops/sec.

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Cassandra 2.0 eating disk space

I am using cassandra in my app and it started eating up disk space much faster than I expected and much faster than defined in manual. Consider this most simple example:
CREATE TABLE sizer (
id ascii,
time timestamp,
value float,
PRIMARY KEY (id,time)
) WITH compression={'sstable_compression': ''}"
I am turning off compression on purpose to see how many bytes will each record take.
Then I insert few values, I run nodetool flush and then I check the size of data file on disk to see how much space did it take.
Results show huge waste of space. Each record take 67 bytes, I am not sure how that is possible.
My id is 13 bytes long at it is saved only once in data file, since it is always the same for testing purposes.
According to: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/architecture/architecturePlanningUserData_t.html
Size should be:
timestamp should be 8 bytes
value as column name takes 6 bytes
column value float takes 4 bytes
column overhead 15 bytes
TOTAL: 33 bytes
For testing sake, my id is always same, so I have actually only 1 row if I understood correctly.
So, my questions is how do I end up on using 67 bytes instead of 33.
Datafile size is correct, I tried inserting 100, 1000 and 10000 records. Size is always 67 bytes.
There are 3 overheads discussed in the file. One is the column overhead, which you have accommodated for. The second is the row overhead. And also if you have replication_factor greater than 1 there's an over head for that as well.

Resources