Cassandra: Storing and retrieving large sized values (50MB to 100 MB) - cassandra

I want to store and retrieve values from Cassandra which ranges from 50MB to 100MB.
As per documentation, Cassandra works well when the column value size is less than 10MB. Refer here
My table is as below. Is there a different approach to this ?
CREATE TABLE analysis (
prod_id text,
analyzed_time timestamp,
analysis text,
PRIMARY KEY (slno, analyzed_time)
) WITH CLUSTERING ORDER BY (analyzed_time DESC)

As for my own experience, although in theory Cassandra can handle large blobs, in practise it may be really painful. As for one of my past projects, we stored protobuf blobs in C* ranged from 3kb to 100kb, but there were some (~0.001%) of them with size up to 150mb. This caused problems:
Write timeouts. By default C* has 10s write timeout which is really not enough for large blobs.
Read timeouts. The same issue with read timeout, read repair, hinted handoff timeouts and so on. You have to debug all these possible failures and raise all these timeouts. C* has to read the whole heavy row to RAM from disk which is slow.
I personally suggest not to use C* for large blobs as it's not very effective. There are alternatives:
Distributed filesystems like HDFS. Store an URL of the file in C* and file contents in HDFS.
DSE (Commercial C* distro) has it's own distributed FS called CFS on top of C* which can handle large files well.
Rethink your schema in a way to have much lighter rows. But it really depends of your current task (and there's not enough information in original question about it)

Large values can be problematic, as the coordinator needs to buffer each row on heap before returning them to a client to answer a query. There's no way to stream the analysis_text value.
Internally Cassandra is also not optimized to handle such use case very well and you'll have to tweak a lot of settings to avoid problems such as described by shutty.

Related

Why do Tombstones affect read performance but not updates?

From the articles I read they say that tombstones affect read performance in Cassandra. I’m reading how data is updated in Cassandra and looks like data is written with a timestamp without modifying or reading the current data.
So when a read is performed before compaction is done, filtering needs to be done to take the latest value right? If that’s the case aren’t tombstones the same thing and why do they affect performance negatively but not updates to a row?
In Cassandra, update is a mutation, like, insert and delete, and except the use case of LWTs and some of the list operations, all mutations are just append to the memtable/commit log, without reading the data on the disk. So they are very fast - no checks are performed.
Read operation, in contrast to that, need to get all versions of the data from the disk/memtable, and then create an actual version of the data based on the timestamps. And for tombstones, we need to keep them in the memory, because we may read some data from the disk that could have older timestamp, and we need to detect this.

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Cassandra - Number of disk seeks in a read request

I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?
I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.

Writing to two cassandra tables with time overlap

I am writing to two cassandra tables, the tables have different keyspaces. I am wondering about how the write actually happens.
I see this explanation at: https://academy.datastax.com/demos/brief-introduction-apache-cassandra
Cassandra is well known for its impressive performance in both reading
and writing data. Data is written to Cassandra in a way that provides
both full data durability and high performance. Data written to a
Cassandra node is first recorded in an on-disk commit log and then
written to a memory-based structure called a memtable. When a
memtable’s size exceeds a configurable threshold, the data is written
to an immutable file on disk called an SSTable. Buffering writes in
memory in this way allows writes always to be a fully sequential
operation, with many megabytes of disk I/O happening at the same time,
rather than one at a time over a long period. This architecture gives
Cassandra its legendary write performance
But this does not explain what happens if I write to two tables in overlapping time period.
Let's say I am writing to Table 1 and Table 2 at the same time. The entries that I want to write would still be stored in the same memtable, correct? They would essentially be mixed, right?
Let's say I am writing 100,000,000 entries for Table 1 and 10 minutes later I started to write entries 100 for Table 2. The 100 for Table 2 would still have to wait for entries for Table 1 to be processed, since they are sharing the same memtable right?
Is my understanding about how memtable is shared correct? Is there a way for different keyspaces to have their own memtable. For example, if I really want to make sure that entries for Table 2 get written without a delay, is that possible?
.
Each table have its own memtable. Cassandra does not mix things. That is why it can easily and efficiently flush data on the disk when memtables total space is full.
This Datastax document is a good summary of how writing in Cassandra is performed from commitlog to sstable and compaction.

Cassandra repair - lots of streaming in case of incremental repair with Leveled Compaction enabled

I use Cassandra for gathering time series measurements. To enable nice partitioning, beside device-id I added day-from-UTC-beginning and a bucket created on the basis of a written measurement. The time is added as a clustering key. The final key can be written as
((device-id, day-from-UTC-beginning, bucket), measurement-uuid)
Queries against this schema in majority of cases take whole rows with the given device-id and day-from-UTC-beginning using IN for buckets. Because of this query schema Leveled Compaction looked like a perfect match, as it ensures with great probability that a row is held by one SSTable.
Running incremental repair was fine, when appending to the table was disabled. Once, the repair was run under the write pressure, lots of streaming was involved. It looked like more data was streamed than was appended after the last repair.
I've tried using multiple tables, one for each day. When a day ended and no further writes were made to a given table, repair was running smoothly. I'm aware of thousands of tables overhead though it looks like it's only one feasible solution.
What's the correct way of combining Leveled Compaction with incremental repairs under heavy write scenario?
Leveled Compaction is not a good idea when you have a write heavy workload. It is better for a read/write mixed workload when read latency matters. Also if your cluster is already pressed for I/O, switching to leveled compaction will almost certainly only worsen the problem. So ensure you have SSDs.
At this time size tiered is the better choice for a write heavy workload. There are some improvements in 2.1 for this though.

Resources