Efficiently numeric storage in Cassandra

Efficiently numeric storage in Cassandra - cassandra

I'm storing many small numbers in a Cassandra table with 7.5 billion rows. Many of the numbers can be represented as a tinyint (1 byte), but Cassandra doesn't seem to support any numeric data types which are smaller than 4 bytes. https://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html
My table is about 4 TB and I'm looking to cut down the size. Is varint my answer ("Arbitrary-precision integer")? How is varint represented in memory and what is its smallest size?
Or alternatively, is there a preferred compression configuration that can help this specific case?

You are looking on old version of documentation. Since Cassandra 2.2 smallint and tinyint are supported. See enter link description here
If you are worried about your disk usage, I would recommend to use Cassandra 3.x.

Related

How does Cassandra store variable data types like text

assumption is, Cassandra will store fixed length data in column family. like a column family: id(bigint), age(int), description(text), picture(blob). Now description and picture have no limit. How does it store that? Does Cassandra externalize through an ID -> location way?
For example, looks like, in relational databases, a pointer is used to point to the actual location of large texts. See how it is done
Also, looks like, in mysql, it is recommended to use char instead of varchar for better performance. I guess simply because, there is no need for an "id lookup". See: mysql char vs varchar
enter code here
`

Cassandra stores individual cells (column values) in its on-disk files ("sstables") as a 32-bit length followed by the data bytes. So string values do not need to have a fixed size, nor are stored as pointers to other locations - the complete string appears as-is inside the data file.
The 32-bit length limit means that each "text" or "blob" value is limited to 2GB in length, but in practice, you shouldn't use anything even close to that - with Cassandra documentation suggesting you shouldn't use more than 1MB. There are several problems with having very large values:
Because values are not stored as pointers to some other storage, but rather stored inline in the sttable files, these large strings get copied around every time sstable files get rewritten, namely during compaction. It would be more efficient to keep the huge string on disk in a separate files and just copy around pointers to it - but Cassandra doesn't do this.
The Cassandra query language (CQL) does not have any mechanism for store or retrieving a partial cell. So if you have a 2GB string, you have to retrieve it entirely - there is no way to "page" through it, nor a way to write it incrementally.
In Scylla, large cells will result in large latency spikes because Scylla will handle the very large cell atomically and not context-switch to do other work. In Cassandra this problem will be less pronounced but will still likely cause problems (the thread stuck on the large cell will monopolize the CPU until preempted by the operating system).

Hive/Impala performance with string partition key vs Integer partition key

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?

Well, it makes a difference if you look up the official Impala documentation.
Instead of elaborating, I will paste the section from the doc, as I think it states it quite well:
"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."
Reference: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html

No, there is no such recommendation. Consider this:
The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.
Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.
There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.
Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.
1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

How much disk space is allocated for Cassandra column having TEXT / VARCHAR data type?

Seems there is no any option to specify the maximum number of characters for TEXT or VARCHAR column in Cassandra v3, then how much disk space is allocated for that type of column ? I need that info to evaluate my disk space usage as I have many TEXT columns.
I can't find any relevant information in the net. Please give some useful links in your answers if any.
Thanks in advance.

I am using Datastax C client, to do insertions into the cassandra cluster. While inserting a text column, known as string datatype, you need to have variable with the type of "const char *". The size of const char * is fixed and it takes only the number of characters used and will not even consider the NULL character also. So, the text will occupy size based on how many characters you are inserting(and size is equal to that many bytes). Moreover, when we retrieved the data back through read queries, we checked the size and it is equal to the number of characters inserted and null is not initialised at the end.
If cassandra is using some mechanism (like LZ4 compression technique)to compress data, it will be reduced. But, there is no possibility that it will have more than the size of characters of your insertions. Correct me if I am wrong.

Datastax has a really good video on how you can estimate your data size. I recommend to attend it.

Why MemSQL is slower than SQL Server for Select SQL with Substring Operations on Binary Columns

I have a table with two binary columns used to store strings that are 64 bytes long maximum and two integer columns. This table has 10+ million rows and uses 2GB of memory out 7GB available memory, so there is plenty of available memory left. I also configured MemSQL based on http://docs.memsql.com/latest/setup/best_practices/.
For simple select SQL where binary columns are compared to certain values, MemSQL is about 3 times faster than SQL Server, so we could rule out issues such as configuration or hardware with MemSQL.
For complex SQLs that use
substring operations in the Select clause and
substring and length operations in the where clause
MemSQL is about 10 times slower than SQL Server. The measured performance of these SQLs on MemSQL were taken after the first few runs to make sure that the SQL compilation durations were not included. It looks like MemSQL’s performance issue has to do with how it handles binary columns and substring and string length operations.
Has anyone seen similar performance issues with MemSQL? If so, what were the column types and SQL operations?
Has anyone seen similar performance issues with MemSQL for substring and length operations on varchar columns?
Thanks.
Michael,

My recommendation: go back to rowstore, but with VARBINARY instead of BINARY, consider putting indexes on the columns or creating persisted columns, and try rewriting your predicate with like.
If you paste an example query, I can help you transform it.
The relevant docs are here
dev.mysql.com/doc/refman/5.0/en/pattern-matching.html
docs.memsql.com/4.0/concepts/computed_columns
Good luck.

It's hard to make general answers to perf questions, but in your case I would try a MemSQL columnstore table as opposed to an in-memory rowstore table. Since you are doing full scans anyway, you'll get the benefit of having the column data stacked up right next to each other.
http://docs.memsql.com/4.0/concepts/columnar/

Is there any row key partition limit in KairosDB

KairosDB built in top of cassandra, but the cassandra have row key partition limit, Is that partition limit is applicable to this also?

Yes, it has the same limitation -- in the Getting Started section you can read what follows:
Using with Cassandra [...]
The default configuration for Cassandra is to use wide rows. Each row
is set to contain 3 weeks of data. The reason behind setting it to 3
weeks is if you wrote a metric every millisecond for 3 weeks it would
be just over 1 billion columns. Cassandra has a 2 billion column
limit.
HTH,
Carlo

we already discussedvia the kairosDB discussion group, but you raise an interesting question here.
I am sure that if you come over 2^64 you will have much more problems than the index capability of Cassandra. Imagine you're only using 1 byte per series, that means 1.84e19 bytes... Only Google or facebook know yet how to store 18 exabytes, which is a cosmic size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string