I know that incoming data to the system is first put in memory (memtable or memstore). In the buffer, data is sorted according to row key and column name. When the buffer size reaches a certain limit the data is flushed to disk. If the buffer size limit is configured to a large value (~256 MB) the number of data points must be very large (~ tens million). What are data structures and sorting algorithms used for this purpose?
The element of storing data in HBase is KeyValue. It consists of pointer to bytes array there actual values are stored, add contains the length and offset. So KeyValues are tightly packed into some bytes arrays. To index them KeyValueSkipListSet (old version) or CellSkipListSet (new version) are used. Both class are build on top of ConcurrentSkipListMap java implementation of Skip list.
Internal storage data structure for HBase Store files/HFiles are LSM (Log-Structured Merge) trees. LSM trees are similar to B+ trees but it allows better scalability and distributed usage because it has an on-disk log file and in-memory store. So, once memstore reaches its limit, it gets flushed to the disk similar to the B+ tree data structure. Later on, its gets merged with other Store files to form a big Store file.
Benefits of this data strcuture over B+ tree is that disk io is not required for every update/delete which results in significant improvement.
Related
Implementation wise, how exactly does the memtable (in Cassandra, RocksDB, LevelDB, or any LSM-tree) flush to an SSTable?
I get that a memtable is some sorted data structured, like a red-black tree, but how do we turn that into a file of sorted key/value pairs? Do we iterate through the tree from the smallest key to the largest tree in a for-loop and insert the data one by one into an memory buffer (in SSTable format), and then write that to disk? Do we use some sort of tree-serialize method (if so, how is that still in SSTable format)? Can we just use a min-heap for the memtable and when flushing, keep getting the min-element and adding it to our array to flush?
I'm trying to understand the super specific details. I was looking at this file but was having a hard time understanding it: https://github.com/facebook/rocksdb/blob/fbfcf5cbcd3b09b6de0924d3c52a744a626135c0/db/flush_job.cc
You are correct.
The memtable is looped over from smallest to largest and written out to file.
In practicality there are other things as well written to the file but the foundation of the file is the section that has all the keys that were previously in the memtable. Such as bloom filters, seek sparse indices, and other metadata such as count, max key, min key
You don't need a minheap. As the data is already sorted in the skiplist
RocksDB's default memtable is implemented using skiplist, which is a linked list with binary search capability, similar to a B+ tree. When writing out to an SST table, it iterates all the keys in the sorted order.
assumption is, Cassandra will store fixed length data in column family. like a column family: id(bigint), age(int), description(text), picture(blob). Now description and picture have no limit. How does it store that? Does Cassandra externalize through an ID -> location way?
For example, looks like, in relational databases, a pointer is used to point to the actual location of large texts. See how it is done
Also, looks like, in mysql, it is recommended to use char instead of varchar for better performance. I guess simply because, there is no need for an "id lookup". See: mysql char vs varchar
enter code here
`
Cassandra stores individual cells (column values) in its on-disk files ("sstables") as a 32-bit length followed by the data bytes. So string values do not need to have a fixed size, nor are stored as pointers to other locations - the complete string appears as-is inside the data file.
The 32-bit length limit means that each "text" or "blob" value is limited to 2GB in length, but in practice, you shouldn't use anything even close to that - with Cassandra documentation suggesting you shouldn't use more than 1MB. There are several problems with having very large values:
Because values are not stored as pointers to some other storage, but rather stored inline in the sttable files, these large strings get copied around every time sstable files get rewritten, namely during compaction. It would be more efficient to keep the huge string on disk in a separate files and just copy around pointers to it - but Cassandra doesn't do this.
The Cassandra query language (CQL) does not have any mechanism for store or retrieving a partial cell. So if you have a 2GB string, you have to retrieve it entirely - there is no way to "page" through it, nor a way to write it incrementally.
In Scylla, large cells will result in large latency spikes because Scylla will handle the very large cell atomically and not context-switch to do other work. In Cassandra this problem will be less pronounced but will still likely cause problems (the thread stuck on the large cell will monopolize the CPU until preempted by the operating system).
As there is a size limit for cosmsos db for single entry of data, how can I add a data of size more than 2 mb as a single entry?
The 2MB limit is a hard-limit, not expandable. You'll need to work out a different model for your storage. Also, depending on how your data is encoded, it's likely that the actual limit will be under 2MB (since data is often expanded when encoded).
If you have content within an array (the typical reason why a document would grow so large), consider refactoring this part of your data model (perhaps store references to other documents, within the array, vs the subdocuments themselves). Also, with arrays, you have to deal with an "unbounded growth" situation: even with documents under 2MB, if the array can keep growing, then eventually you'll run into a size limit issue.
I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?
I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.
I am using Cassandra to store my parsed site logs. I have two column families with multiple secondary indices. The log data by itself is around 30 gb in size. However, the size of the cassandra data dir is ~91g. Is there any way I can reduce the size of this store? Also, will having multiple secondary indices have a big impact on the datastore size?
Potentially, the secondary indices could have a big impact, but obviously it depends what you put in them! If most of your data entries appear in one or more indexes, then the indexes could form a significant proportion of your storage.
You can see how much space each column family is using JConsole and/or 'nodetool cfstats'.
You can also look at the sizes of the disk data files to get some idea of usage.
It's also possible that data isn't being flushed to disk often enough - this can result in lots of commitlog files being left on disk for a long time, occupying extra space. This can happen if some of your column families are only lightly loaded. See http://wiki.apache.org/cassandra/MemtableThresholds for parameters to tune this.
If you have very large numbers of small columns, then the column names may use a significant proportion of the storage, so it may be worth shortening them where this makes sense (not if they are timestamps or other meaningful data!).