I'm trying to understand Cassandra read path and can't get why do we need a
compression offset map.
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
The partition index resides on disk and stores an index of all partition keys mapped to their offset.
The compression offset map stores pointers to the exact location on disk that the desired partition data will be found.
Why do we need both of them? Why can't partition index store pointers to exact location on disk?
I'm sorry for a stupid title, but that's what stackoverflow asked me, I couldn't use "Why do we need a compression offset map if we have a partition index?"
The file is compressed in chunks. By default 64k of data is compressed, then next 64k etc. The offsets written in index file are that of the uncompressed data. This is because as its writing, it knows how many bytes have been written so far so uses that to mark whenever starting new partition. The compression offsets maps the compressed offsets and their uncompressed positions so it knows which chunk to start decompressing to get to the partition at some uncompressed offset from the index.
If a partition exists in the middle of a 64k compressed chunk, you need do decompress that entire chunk. You cannot start reading in the middle of it due to how the compression algorithms work. This is why in some situations it makes sense to decrease the chunk size as it would reduce the overhead of reading a tiny partition.
Related
For example, I have a file A.bin with a ratio of 50%, how can I replace it with a similar A.bin (exact same size) from outside and keep exactly the ratio of 50%?
The compression ratio is defined as the size of the compressed file over the original file. Unfortunately, it is not always possible to achieve a given compression ratio for a certain file.
A good, tangible, example of this might be text versus images: compressing repetitive text like log files is a very easy problem and will typically yield a great compression ratio even with fairly relaxed compression levels whereas an image of static noise is very random data and therefore can be very hard to compress.
All this is to say that the compression ratio you achieve is highly dependent on what the data is and the algorithm used to compress it, it is not necessarily an easily tuneable factor. If by changing the file you add more 'randomness' to the data, it will not be possible to get a better compression ratio.
Now, if the compressed sizes of your files are same it should be possible to replace into the archive but as you can see getting two compressed files to have the same size is non-trivial.
If you can reliably guarantee that your transformation to the file decreases entropy then you might be able to guarantee that the compressed file will be smaller and add padding null bytes to keep the size of the compressed data the same (artificially increase the compression ratio).
I have a large GZIP-ed file. I want to read a few bytes from a specific offset of uncompressed data.
For example, I have a file that original size is 10GB. In gzipped state it has size 1GB. I want to read a few bytes at 5GB offset in that 1GB gzipped file.
You will need to read all of the first 5 GB in order to get just those bytes.
If you are frequently accessing just a few bytes from the same large gzip file, then it can be indexed for more rapid random access. You would read the entire file once to build the index. See zran.h and zran.c.
I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?
I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.
I know that incoming data to the system is first put in memory (memtable or memstore). In the buffer, data is sorted according to row key and column name. When the buffer size reaches a certain limit the data is flushed to disk. If the buffer size limit is configured to a large value (~256 MB) the number of data points must be very large (~ tens million). What are data structures and sorting algorithms used for this purpose?
The element of storing data in HBase is KeyValue. It consists of pointer to bytes array there actual values are stored, add contains the length and offset. So KeyValues are tightly packed into some bytes arrays. To index them KeyValueSkipListSet (old version) or CellSkipListSet (new version) are used. Both class are build on top of ConcurrentSkipListMap java implementation of Skip list.
Internal storage data structure for HBase Store files/HFiles are LSM (Log-Structured Merge) trees. LSM trees are similar to B+ trees but it allows better scalability and distributed usage because it has an on-disk log file and in-memory store. So, once memstore reaches its limit, it gets flushed to the disk similar to the B+ tree data structure. Later on, its gets merged with other Store files to form a big Store file.
Benefits of this data strcuture over B+ tree is that disk io is not required for every update/delete which results in significant improvement.
I want to know how many bytes are exactly stored on disk when I insert a new column in a Column Family of Cassandra.
My main problem is that I need to know this information when columns are compressed with Snappy, I know the calculation of raw bytes but, due to the variability of the data, I can not properly approximate the compression ratio.
Any information about where to find this amount of bytes in the Cassandra codebase will welcome.
Thanks in advance.
Compression can never give guaranteed compression ratios. The best you can get is an average ratio for sample data.
So get a load of sample data, insert it into a test instance, and measure the disk usage.
You might have data that compresses very poorly with Snappy and actually results in more on-disk usage than storing raw bytes.
When it comes to compression of your data there is one and only one rule: MEASURE