What is the largest Key-Value cache/lookup table that has been built? Has a lookup table, of say, 2**128 elements ever built? - hashmap

Let's say each key is large, say 128 bytes and each value is 8-10 bytes.
I looked up general hashmaps in C++ and stuff and the largest KV pairs are around 2^25.

Related

How to store Bert embeddings in cassandra

I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
);
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.

How do I most effectively compress my highly-unique columns?

I have a Spark DataFrame consisting of many double columns that are measurements, but I want a way of annotating each unique row by computing a hash of several other non-measurement columns. This hash results in garbled strings that are highly unique, and I've noticed my dataset size increases substantially when this column is present. How can I sort / lay out my data to decrease the overall dataset size?
I know that the Snappy compression protocol used on my parquet files executes best upon runs of similar data, so I think a sort over the primary key could be useful, but I also can't coalesce() the entire dataset into a single file (it's hundreds of GB in total size before the primary key creation step).
My hashing function is SHA2(128) FYI.
If you have a column that can be computed from the other columns, then simply omit that column before compression, and reconstruct it after decompression.

Cassandra column value size for collections

How does the following two Cassandra limitations interplay with one another?
Cells in a partition: ~2 billion (2^31); single column value size: 2 GB (1 MB is recommended) [1]
Collection values may not be larger than 64KB. [2]
Are collections laid out inside of a single column and hence ought one limit the size of the entire collection to 1MB?
[1] https://docs.datastax.com/en/cql/3.3/cql/cql_reference/refLimits.html
[2] https://wiki.apache.org/cassandra/CassandraLimitations
A collection is a single column value with
a single value inside size limited to 64k (max value of unsinged short)
items in collections limited to to 64K (max value of unsinged short)
The 1MB is a recommendation and no hard limit, you can go higher if you need to - but as always do testing before production. But as you can have 2^16 items and 2^16 bytes in each - this will break the 2GB limit per cell.
But collections should be kept small for performance reasons anyway as they are always read entirely. And updates to collections are not very fast either.

How Bucket number is decided in HashMap?

When a HashMap is created its Initial Capacity will be 16, it means there will be 16 Buckets in memory where key value pair will be stored.
Now if implementation of my hashCode() in such a way that its generating random number in range of 1000 and 10000.
In this case how it is determined under which bucket this key value will be stored.?? How this random number in some range is transformed into Bucket number. ??
According to this article HOW DOES A HASHMAP WORK IN JAVA
All the keys with the same hash value are put in the same linked list (bucket). Keys with different hash values can end-up in the same bucket.
You may also want to check the answers from a similar question What is meant by number of buckets in the HashMap?

Cassandra Cell Number Limitation

is this 2 billion cells per partition limit still valid?
http://wiki.apache.org/cassandra/CassandraLimitations
Let's say you save 16 bytes on average per cell. Then you "just" can persist 16*2e9 bytes = 32 GB of data (plus column name) on one machine!?
Or if you imagine a quadratic table you will be able to store 44721 rows with 44721 columns each!?
Doesn't really sound like Big Data.
Is this correct?
Thanks!
Malte
The 2 billion cell limit is still valid and you most likly want to remodel your data if you start seeing that many cells per partition.
The maximum number of cells (rows x columns) in a single partition is
2 billion.
A partition is defined by they partition key in CQL and will define where a particular piece of data will live. For example if I had two nodes with a fictional range of 0-100 and 100-200. Partition keys which hashed to between 0 and 100 would reside on the first node and those with hashed value of between 100 and 200 would reside on the second node. In reality Cassandra uses the Murmur3 algorithm to hash primary keys generating values between -2^63 and 2^63-1.
The real limitation tends to be based on how many unique values you have for your partition key. If you don't have a good deal of uniqueness within a single column many users combine columns to generate more uniqueness(composite primary key).
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html
More info on hashing and how C* holds data.
http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePartitionerAbout_c.html

Resources