Cassandra - Number of disk seeks in a read request - cassandra

I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?

I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.

Related

Cassandra Compacting wide rows large partitions

I have been searching some docs online to get good understanding of how to tackle large partitions in cassandra.
I followed a document on the below link:
https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch13s10.html.
Regarding "LARGE ROWS WITH COMPACTION LIMITS", below is metioned:
"The default value for in_memory_compaction_limit_in_mb is 64. This value is set in conf/cassandra.yaml. For use cases that have fixed columns, the limit should never be exceeded. Setting this value can work as a sanity check to ensure that processes are not inadvertently writing to many columns to the same key.
Keys with many columns can also be problematic when using the row cache because it requires the entire row to be stored in memory."
In the /conf/cassandra.yaml, I did find a configuration named "in_memory_compaction_limit_in_mb".
The Definition in the cassandra.yaml goes as below:
In Cassandra 2.0:
in_memory_compaction_limit_in_mb
(Default: 64) Size limit for rows being compacted in memory. Larger rows spill to disk and use a slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The recommended value is 5 to 10 percent of the available Java heap size.
In Cassandra 3.0: (No such entries found in cassandra.yaml)
compaction_large_partition_warning_threshold_mb
(Default: 100) Cassandra logs a warning when compacting partitions larger than the set value
I have searching lot on what exactly the setting in_memory_compaction_limit_in_mb does.
It mentions some compaction is done in memory and some compaction is done on disk.
As per my understanding goes, When Compaction process runs:
SSTABLE is being read from disk---->(compared,tombstones removed,stale data removed) all happens in memory--->new sstable written to disk-->old table being removed
This operations accounts to high Disc space requirements and Disk I/O(Bandwidth).
Do help me with,if my understanding of compaction is wrong. Is there anything in compaction that happens in memory.
In my environment the
in_memory_compaction_limit_in_mb is set to 800.
I need to understand the purpose and implications.
Thanks in advance
in_memory_compaction_limit_in_mb is no longer necessary since the size doesn't need to be known before writing. There is no longer a 2 pass compaction so can be ignored. You don't have to do the entire partition at once, just a row at a time.
Now the primary cost is in deserializing the large index at the beginning of the partition that occurs in memory. You can increase the column_index_size_in_kb to reduce the size of that index (at cost of more IO during reads, but likely insignificant compared to the deserialization). Also if you use a newer version (3.11+) the index is lazy loaded after exceeding a certain size which improves things quite a bit.

One bigger partition or few smaller but more distributed partitions for Range Queries in Cassandra?

We have a table that stores our data partitioned by files. One file is 200MB to 8GB in json - but theres a lot of overhead obviously. Compacting the raw data will lower this drastically. I ingested about 35 GB of json data and only one node got slightly more than 800 MB data. This is possibly due to "write hotspots" -- but we only write once and read only. We do not update data. Currently, we have one partition per file.
By using secondary indexes, we search for partitions in the database that contain a specific geolocation (= first query) and then take the result of this query to range query a time range of the found partitions (= second query). This might even be the whole file if needed but in 95% of the queries only chunks of a partition are queried.
We have a replication factor of 2 on a 6 node cluster. Data is fairly even distributed, every node owns 31,9% to 35,7% (effective) data according to nodetool status *tablename*.
Good read performance is key for us.
My questions:
How big is too big for a partition in terms of volume or row size? Is there a rule of thumb for this?
For Range Query performance: Is it better to split up our "big" partitions to have more smaller partitions? We built our schema with "big" partitions because we thought that when we do range queries on a partition, it would be good to have it all on one node so data can be fetched easily. Note that the data is also available on one replica due to RF 2.
C* supports very huge rows, but it doesn't mean it is a good idea to go to that level. The right limit depends on specific use cases, but a good ballpark value could be between 10k and 50k. Of course, everything is a compromise, so if you have "huge" (in terms of bytes) rows then heavily limit the numbers of rows in each partition. If you have "small" (in terms of bytes) rows them you can relax that limit a bit. This is because one partition means one node only due to your RF=1, so all your query for a specific partition will hit only one node.
Range queries should ideally go to one partition only. A range query means a sequential scan on your partition on the node getting the query. However, you will limit yourself to the throughput of that node. If you split your range queries between more nodes (that is you change the way you partition your data by adding something like a bucket) you need to get data from different nodes as well performing parallel queries, directly increasing the total throughput. Of course you'd lose the order of your records within different buckets, so if the order in your partition matters, then that could not be feasible.

Writing to two cassandra tables with time overlap

I am writing to two cassandra tables, the tables have different keyspaces. I am wondering about how the write actually happens.
I see this explanation at: https://academy.datastax.com/demos/brief-introduction-apache-cassandra
Cassandra is well known for its impressive performance in both reading
and writing data. Data is written to Cassandra in a way that provides
both full data durability and high performance. Data written to a
Cassandra node is first recorded in an on-disk commit log and then
written to a memory-based structure called a memtable. When a
memtable’s size exceeds a configurable threshold, the data is written
to an immutable file on disk called an SSTable. Buffering writes in
memory in this way allows writes always to be a fully sequential
operation, with many megabytes of disk I/O happening at the same time,
rather than one at a time over a long period. This architecture gives
Cassandra its legendary write performance
But this does not explain what happens if I write to two tables in overlapping time period.
Let's say I am writing to Table 1 and Table 2 at the same time. The entries that I want to write would still be stored in the same memtable, correct? They would essentially be mixed, right?
Let's say I am writing 100,000,000 entries for Table 1 and 10 minutes later I started to write entries 100 for Table 2. The 100 for Table 2 would still have to wait for entries for Table 1 to be processed, since they are sharing the same memtable right?
Is my understanding about how memtable is shared correct? Is there a way for different keyspaces to have their own memtable. For example, if I really want to make sure that entries for Table 2 get written without a delay, is that possible?
.
Each table have its own memtable. Cassandra does not mix things. That is why it can easily and efficiently flush data on the disk when memtables total space is full.
This Datastax document is a good summary of how writing in Cassandra is performed from commitlog to sstable and compaction.

Cassandra: Storing and retrieving large sized values (50MB to 100 MB)

I want to store and retrieve values from Cassandra which ranges from 50MB to 100MB.
As per documentation, Cassandra works well when the column value size is less than 10MB. Refer here
My table is as below. Is there a different approach to this ?
CREATE TABLE analysis (
prod_id text,
analyzed_time timestamp,
analysis text,
PRIMARY KEY (slno, analyzed_time)
) WITH CLUSTERING ORDER BY (analyzed_time DESC)
As for my own experience, although in theory Cassandra can handle large blobs, in practise it may be really painful. As for one of my past projects, we stored protobuf blobs in C* ranged from 3kb to 100kb, but there were some (~0.001%) of them with size up to 150mb. This caused problems:
Write timeouts. By default C* has 10s write timeout which is really not enough for large blobs.
Read timeouts. The same issue with read timeout, read repair, hinted handoff timeouts and so on. You have to debug all these possible failures and raise all these timeouts. C* has to read the whole heavy row to RAM from disk which is slow.
I personally suggest not to use C* for large blobs as it's not very effective. There are alternatives:
Distributed filesystems like HDFS. Store an URL of the file in C* and file contents in HDFS.
DSE (Commercial C* distro) has it's own distributed FS called CFS on top of C* which can handle large files well.
Rethink your schema in a way to have much lighter rows. But it really depends of your current task (and there's not enough information in original question about it)
Large values can be problematic, as the coordinator needs to buffer each row on heap before returning them to a client to answer a query. There's no way to stream the analysis_text value.
Internally Cassandra is also not optimized to handle such use case very well and you'll have to tweak a lot of settings to avoid problems such as described by shutty.

physical disk space management of cassandra

Recently I have been looking into Cassandra from our new project's perspective and learned a lot from this community and its wiki too. But I have not found anything about about how updates are managed in Cassandra in terms of physical disk space management though it seems to be very much similar to record delete management using compaction.
Suppose there are 100 records with 5 column values each so when all changes would be flushed disk all records will be written adjacently and when delete operation is done then its marked in Memory table first and physically record is deleted after some time as set in configuration or when its full. And the compaction process claims the space.
Now question is that at one side being schema less there is no fixed number of columns at the the beginning but on the other side when compaction process takes place then.. does it put records adjacently on disk like traditional RDBMS to speed up the read process as for RDBMS its easy because they have to allocate fixed amount of space as per declaration of columns datatype.
But how Cassandra exactly makes the records placement on disk in compaction process (both for update/delete) to speed up the reads?
One more question related to compaction is that when there is no delete queries but there is an update query which updates an existent record with some variable length data or insert altogether a new column then how compaction makes its space available on disk between already existent data rows?
Rows and columns are stored in sorted order in an SSTable. This allows a compaction of multiple SSTables to output a new, (sorted) SSTable, with only sequential disk IO. This new SSTable will be outputted into a new file and freespace on the disks. This process doesn't depend on the number of rows of columns, just on them being stored in a sorted order. So yes, in all SSTables (even those resulting form compactions) rows and columns will be arranged in a sorted order on disk.
Whats more, as you hint at in your question, updates are no different from inserts - they do not overwrite the value on disk, but instead get buffered in a Memtable, then get flushed into a new SSTable. When the new SSTable eventually gets compacted with the SSTable containing the original value, the newer value will annihilate the old one - ie the old value will not be outputted from the compaction. Timestamps are used to decide which values is newest.
Deletes are handled in the same fashion, effectively inserted an "anti-value", or tombstone. The limitation of this process is that is can require significant space overhead. Deletes are effectively 'lazy, so the space doesn't get freed until some time later. Also, while the output of the compaction can be the same size as the input, the old SSTables cannot be deleted until the new one is completed, so this can reduce disk utilisation to 50%.
In the system described above, new values for an existing key can be a different size to the existing key without padding to some pre-determined length, as the new value does not get written over the old value on update, but to a new SSTable.

Resources