Cassandra compression codebase - cassandra

I want to know how many bytes are exactly stored on disk when I insert a new column in a Column Family of Cassandra.
My main problem is that I need to know this information when columns are compressed with Snappy, I know the calculation of raw bytes but, due to the variability of the data, I can not properly approximate the compression ratio.
Any information about where to find this amount of bytes in the Cassandra codebase will welcome.
Thanks in advance.

Compression can never give guaranteed compression ratios. The best you can get is an average ratio for sample data.
So get a load of sample data, insert it into a test instance, and measure the disk usage.
You might have data that compresses very poorly with Snappy and actually results in more on-disk usage than storing raw bytes.
When it comes to compression of your data there is one and only one rule: MEASURE

Related

Sorted parquet files for query optimization

Question Purpose
Sorting a parquet files provides a number of benefits:
more efficient filtering using file metadata
more efficient compression rate
There may be other benefits for this. There is a lot of discussion about this on the Internet. For this reason, the discussion of this question is not about the cause of sorting. Rather, the purpose of this question is to talk about how to sort, which is mentioned in all Internet links with the least explanation (about 30%) and the challenges of data sorting are not mentioned at all. The purpose of this question is to get help from all friends who are experts and experienced in this field and to determine the best method (based on cost and benefit) for sorting.
Brief explanation about Apache parquet library
Before starting discussing Spark, I will explain about the tool used to produce parquet files. The parquet-mr library (I use Java for example, but it can probably be extended to other languages) writes to a disk and memory at the same time when we create a parquet file. This library also has a feature called getDataSize() that returns the exact final size of the file after it is completely closed on the disk, so we can use it to achieve the following two conditions when we write parquet files:
Do not make parquet files with small size (which is not good for query engines)
All parquet files can be produced with a certain minimum size or fixed size (for example, 1 GB each file)
Since this library writes to disk and memory at the same time, it does not allow data to be sorted unless all the data is first sorted in memory and then given to the library. (But this is not possible with large volumes of data.) We also implicitly assume that data is being generated as a stream that we intend to store. (In the case of a fixed data, the problem stated in this question will be meaningless because it can be said that the whole data is arranged once and for all and the problem is over. But we assume that there is a flow of data, in which case it is important to have an optimal way to sort the data)
One advantage mentioned above for the Apache parquet library is that we can fix the exact size of the output parquet file. This is an advantage in my opinion. Because, for example, if I know that the size of Hadoop blocks is equal to 128 MB and the size of parquet row-group is 128 MB, I can fix the parquet file size to 1 GB. Then I know that all parquet files will have 8 blocks and HDFS storage will be used best and all parquet files will be the same. (Because in HDFS, when the block size is 128 MB, the smaller file will take up the same amount of space) This may not be an advantage for everyone, and we'd be happy for experienced people to critique it if needed.
Parquet File Sorting Challenges
One point before we start is that we are looking for permanent data sorting because we are going to use it in the next thousands of queries. Almost so far, the above descriptions have identified some of challenges for sorting, but I will describe all of the challenges below:
Parquet tools do not allow you to write sorted data. So one way is to keep all the data in memory and after sorting, give it to the parquet library to be written in the parquet file. This method has two drawbacks: 1) It is not possible to keep all data in memory. 2) Because all the data is in memory, the size of the parquet file is not known and may be less than or more than 1 GB or any amount after writing, and the advantage of being fixed parquet size is lost.
Suppose we want to do this sorting in a parallel process instead of doing it in real time and stream. In this way, if we want to use parquet library, we will still have the problem that we have to bring the whole data to the memory for sorting, which is not possible. So let's say we use a tool like Spark for sorting. A specific cost we give in this section is that cluster resources are used for sorting, and in practice each data is written twice. (Once the parquet writing time and once the sorting) The next point is that even if we skip these two cases, after sorting the data, depending on the other columns in the parquet file, the amount of parquet compression for that particular column and for the whole data may change and increase or decrease. For this reason, after the parquet file is written, small files may be created or the fixed size (for example, 1 GB) may change. Unfortunately, Spark does not provide a way to control the file size (it may not be possible in practice), and therefore if we want to restore the fixed file size, we may need to use methods such as the mentioned link, which will not be free (causes to write the file several times apart from the cluster resources that are consumed and the exact file size will not be fixed):How do you control the size of the output file
Maybe there is no other way and the only ways are the mentioned one at the above. In which case, I would be happy for this note to be expressed by experts so that others know that there is no other way right now.
Challenges In Summary
For this reason, we generally observed 2 types of problems in these solutions:
How to do sorting at a reasonable cost and time (in stream flow)
How to keep the size of parquet files fixed
For this reason, although it is said everywhere that sorting is very good (and the results of surveys, both on the Internet and by myself, show that it is really useful), there is no mention at all of its methods and challenges. I ask experienced and expert friends in this field to help me in this direction (hoping that it will help others as well) and if ways or points are missed in this explanation, please state it.
Sorry if there is a typo in some parts due to my weakness in English language. Thanks.

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Cassandra - Number of disk seeks in a read request

I'm trying to understand the maximum number of disk seeks required in a read operation in Cassandra. I looked at several online articles including this one: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
As per my understanding, two disk seeks are required in the worst case. One is for reading the partition index and another is to read the actual data from the compressed partition. The index of the data in compressed partitions is obtained from the compression offset tables (which is stored in memory). Am I on the right track here? Will there ever be a case when more than 1 disk seek is required to read the data?
I'm posting the answer here which I received from Cassandra user community thread in case someone else needs it:
youre right – one seek with hit in the partition key cache and two if not.
Thats the theory – but two thinge to mention:
First, you need two seeks per sstable not per entire read. So if you data is spread over multiple sstables on disk you obviously need more then two reads. Think of often updated partition keys – in combination with memory preassure you can easily end up with maaany sstables (ok they will be compacted some time in the future).
Second, there could be fragmentation on disk which leads to seeks during sequential reads.
Note: Each SSTable has it's own partition index.

Cassandra: Storing and retrieving large sized values (50MB to 100 MB)

I want to store and retrieve values from Cassandra which ranges from 50MB to 100MB.
As per documentation, Cassandra works well when the column value size is less than 10MB. Refer here
My table is as below. Is there a different approach to this ?
CREATE TABLE analysis (
prod_id text,
analyzed_time timestamp,
analysis text,
PRIMARY KEY (slno, analyzed_time)
) WITH CLUSTERING ORDER BY (analyzed_time DESC)
As for my own experience, although in theory Cassandra can handle large blobs, in practise it may be really painful. As for one of my past projects, we stored protobuf blobs in C* ranged from 3kb to 100kb, but there were some (~0.001%) of them with size up to 150mb. This caused problems:
Write timeouts. By default C* has 10s write timeout which is really not enough for large blobs.
Read timeouts. The same issue with read timeout, read repair, hinted handoff timeouts and so on. You have to debug all these possible failures and raise all these timeouts. C* has to read the whole heavy row to RAM from disk which is slow.
I personally suggest not to use C* for large blobs as it's not very effective. There are alternatives:
Distributed filesystems like HDFS. Store an URL of the file in C* and file contents in HDFS.
DSE (Commercial C* distro) has it's own distributed FS called CFS on top of C* which can handle large files well.
Rethink your schema in a way to have much lighter rows. But it really depends of your current task (and there's not enough information in original question about it)
Large values can be problematic, as the coordinator needs to buffer each row on heap before returning them to a client to answer a query. There's no way to stream the analysis_text value.
Internally Cassandra is also not optimized to handle such use case very well and you'll have to tweak a lot of settings to avoid problems such as described by shutty.

Does VoltDB compress data on disk?

I am curious whether VoltDB compresses the data on disk/at rest.
If it does, what is the algorithm used and are there options for 3rd party compression methods (e.g. a loss permitted proprietary video stream compression algorithm)?
VoltDB uses Snappy compression when writing Snapshots to disk. Snappy is an algorithm optimized for speed, but it still has pretty good compression. There aren't any options for configuring or customizing a different compression method.
Data stored in VoltDB (e.g. when you insert records) is stored 100% in RAM and is not compressed. There is a sizing worksheet built in to the web interface that can help estimate the RAM required based on the specific datatypes of the tables, and whatever indexes you may define.
One of the datatypes that is supported is VARBINARY which stores byte arrays, i.e. any binary data. You could store pre-compressed data in VARBINARY columns, or use a third-party java compression library within stored procedures to compress and decompress inputs. There is a maximum size limit of 1MB per column, and 2MB per record, however a procedure could store larger sized binary data by splitting it across multiple records. There is a maximum size of 50MB for the inputs to or the results from a stored procedure. You could potentially store and retrieve larger sized binary data using multiple transactions.
I saw you posted the same question in our forum, if you'd like to discuss more back and forth, that is the best place. We'd also like to talk to you about your solution, so if you like I can contact you at the email address from your VoltDB Forum account.

Resources