Writing to two cassandra tables with time overlap - cassandra

I am writing to two cassandra tables, the tables have different keyspaces. I am wondering about how the write actually happens.
I see this explanation at: https://academy.datastax.com/demos/brief-introduction-apache-cassandra
Cassandra is well known for its impressive performance in both reading
and writing data. Data is written to Cassandra in a way that provides
both full data durability and high performance. Data written to a
Cassandra node is first recorded in an on-disk commit log and then
written to a memory-based structure called a memtable. When a
memtable’s size exceeds a configurable threshold, the data is written
to an immutable file on disk called an SSTable. Buffering writes in
memory in this way allows writes always to be a fully sequential
operation, with many megabytes of disk I/O happening at the same time,
rather than one at a time over a long period. This architecture gives
Cassandra its legendary write performance
But this does not explain what happens if I write to two tables in overlapping time period.
Let's say I am writing to Table 1 and Table 2 at the same time. The entries that I want to write would still be stored in the same memtable, correct? They would essentially be mixed, right?
Let's say I am writing 100,000,000 entries for Table 1 and 10 minutes later I started to write entries 100 for Table 2. The 100 for Table 2 would still have to wait for entries for Table 1 to be processed, since they are sharing the same memtable right?
Is my understanding about how memtable is shared correct? Is there a way for different keyspaces to have their own memtable. For example, if I really want to make sure that entries for Table 2 get written without a delay, is that possible?
.

Each table have its own memtable. Cassandra does not mix things. That is why it can easily and efficiently flush data on the disk when memtables total space is full.
This Datastax document is a good summary of how writing in Cassandra is performed from commitlog to sstable and compaction.

Related

How does Cassandra (or Scylla) sort clustering columns?

One of the benefits of Cassandra (or Scylla) is that:
When a table has multiple clustering columns, the data is stored in nested sort order.
https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/whereClustering.html
Because of this I think reading the data back in that same sorted order should be very fast.
If data is written in a different order than the clustering columns specify, when does Cassandra (or Scylla) actually re-order the data?
Is it when the memtables are flushed to SSTables?
What if a memtable has already been flushed, and I add a new record that should be before records in an existing SSTable?
Does it keep the data out of order on disk for a while and re-order it during compaction?
If so, what steps does it take to make sure reads are in the correct order?
Data is always sorted in any given sstable.
When a memtable is flushed to disk, that will create a new sstable, which is sorted within itself. This happens naturally since memtables store data in sorted order, so no extra sorting is needed at that point. Sorting happens on insertion into the memtable.
A read, which is using natural ordering, will have to read from all sstables which are relevant for the read, merging multiple sorted results into one sorted result. This merging happens in memory on-the-fly.
Compaction, when it kicks in, will replace multiple sstables with one, creating a merged stream much like a regular read would do.
This technique of storing data is known as a log-structured merge tree.
The data is reordered during the compaction.
Basically, any write is just an append, in order to be very fast. There are no reads or seeks involved.
When reading data, Cassandra is reading from the active memtable and from one or more SSTables. Data is aggregated and the query is satisfied.
Since data distribution might require accessing a growing number of SSTables, compaction has the role to reorganize the data on disk so it will eliminate the potential overhead of reading data from multiple SSTables. It is worth mentioning that SSTables are immutable and new SSTables are created. The old ones are discarded.
The process is similar in both Scylla and Cassandra.

Getting database for Cassandra or building one from scratch?

So, I'm new to Cassandra and I was wondering what the best approach would be to learn Cassandra.
Should I first focus on the design of a database and build one from scratch?
And as I was reading that Cassandra is great for writing. How can one observe that? Is there open source data that one can use? (I didn't really know where to look.)
A good point getting started with Cassandra are the free online courses from DataStax (an enterprise grade Cassandra distribution): https://academy.datastax.com/courses
And for Cassandra beeing good at writing data - have a look here: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
The write path comes down to these points:
write the data into the commitlog (append only sequentially, no random io - therefore should be on its own disk to prevent head movements, with ssd no issue)
write the data into memtables (kept in memory - very fast)
So in terms of disk, a write is a simple append to the commitlog in the first place. No data is directly written to the sstables (it's in the commitlog and memtable, which becomes flushed to disk at times as sstables), updates are not changing an sstable on disk (sstables are immutable, an update is written separately with a new timestamp), a delete does not remove data from sstables (sstables are immutable - instead a tombstone is written).
All updates and deletes produce new entries in memtable and sstables, to remove deleted data and to get rid of old versions of data from updates sstables on disk are compacted from time to time into a new one.
Also read about the different compaction strategies (can help you provide good performance), replication factor (how many copies of your data the cluster should keep) and consistency levels (how Cassandra should determine when a write or read is successful, hint: ALL is almost wrong all the time, look for QUORUM).

Cassandra - Number of disk seeks in a read request

I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?
I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.

Cassandra: Storing and retrieving large sized values (50MB to 100 MB)

I want to store and retrieve values from Cassandra which ranges from 50MB to 100MB.
As per documentation, Cassandra works well when the column value size is less than 10MB. Refer here
My table is as below. Is there a different approach to this ?
CREATE TABLE analysis (
prod_id text,
analyzed_time timestamp,
analysis text,
PRIMARY KEY (slno, analyzed_time)
) WITH CLUSTERING ORDER BY (analyzed_time DESC)
As for my own experience, although in theory Cassandra can handle large blobs, in practise it may be really painful. As for one of my past projects, we stored protobuf blobs in C* ranged from 3kb to 100kb, but there were some (~0.001%) of them with size up to 150mb. This caused problems:
Write timeouts. By default C* has 10s write timeout which is really not enough for large blobs.
Read timeouts. The same issue with read timeout, read repair, hinted handoff timeouts and so on. You have to debug all these possible failures and raise all these timeouts. C* has to read the whole heavy row to RAM from disk which is slow.
I personally suggest not to use C* for large blobs as it's not very effective. There are alternatives:
Distributed filesystems like HDFS. Store an URL of the file in C* and file contents in HDFS.
DSE (Commercial C* distro) has it's own distributed FS called CFS on top of C* which can handle large files well.
Rethink your schema in a way to have much lighter rows. But it really depends of your current task (and there's not enough information in original question about it)
Large values can be problematic, as the coordinator needs to buffer each row on heap before returning them to a client to answer a query. There's no way to stream the analysis_text value.
Internally Cassandra is also not optimized to handle such use case very well and you'll have to tweak a lot of settings to avoid problems such as described by shutty.

Is update in place possible in Cassandra?

I have a table in Cassandra where I populate some rows with 1000s of entries (each row is with 10000+ columns). The entries in the rows are very frequently updated, basically just a field (which is an integer) is updated with different values. All other values for the columns remains unmodified. My question is, will the updates be done in-place ? How good is Cassandra for frequent update of entries ?
First of all every update is also a sequential write for cassandra so, as far as cassandra goes it does not make any difference to cassandra whether you update or write.
The real question is how fast do you need to read those writes to be available for reading? As #john suggested, first all the writes are written to a mutable CQL Memtable which resides in memory. So, every update is essentially appended as a new sequential entry to memtable for a particular CQL table. It is concurrently periodically also written to `commitlog' (every 10 seconds) for durability.
When Memtable is full or total size for comittlog is reached, cassandra flushes all the data to immutable Sorted String Table (SSTable). After the flush, compaction is the procedure where all the PK entries for the new column values are kept and all the previous values (before update) are removed.
With flushing frequently comes the overhead on frequent sequential writes to disk and compaction which could take lot of I/O and have a serious impact on cassandra performance.
As far as read goes, first cassandra will try to read from row cache (if its enabled) or from memtable. If it fails there it will go to bloom filter, key cache, partition summary, partition index and finally to SSTable in that order. When the data is collected for all the column values, its aggregate in memory and the column values with latest timestamp are returned to client after aggregation and an entry is made in row cache for that partition key`.
So, yes when you query a partition key, it will scan across all the SSTable for that particular CQL table and the memtable for all the column values that are not being flushed to disk yet.
Initially these updates are stored in an in-memory data structure called Memtable. Memtables are flushed to immutable SSTables at regular intervals.
So a single wide row will be read from various SSTables. It is during a process called 'compacation' the different SSTables will be merged into a bigger SSTable on the disk.
Increasing thresholds for flushing Memtables is one way of optimization. If updates are coming very fast before Memtable is flushed to disk, i think that update should be in-place in memory, not sure though.
Also each read operation checks Memtables first, if data is still there, it will be simply returned – this is the fastest possible access.
Cassandra read path:
When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question
Cassandra write path:
No, in place updates are not possible.
As #john suggested, if you have frequent writes then you should delay the flush process. During the flush, the multiple writes to the same partition that are stored in the MemTable will be written as a single partition in the newly created SSTable.
C* is fine for heavy writes. However, you'll need to monitor the number of SSTables accessed per read. If the # is too high, then you'll need to review your compaction strategy.

Resources