When a HashMap is created its Initial Capacity will be 16, it means there will be 16 Buckets in memory where key value pair will be stored.
Now if implementation of my hashCode() in such a way that its generating random number in range of 1000 and 10000.
In this case how it is determined under which bucket this key value will be stored.?? How this random number in some range is transformed into Bucket number. ??
According to this article HOW DOES A HASHMAP WORK IN JAVA
All the keys with the same hash value are put in the same linked list (bucket). Keys with different hash values can end-up in the same bucket.
You may also want to check the answers from a similar question What is meant by number of buckets in the HashMap?
Let's say each key is large, say 128 bytes and each value is 8-10 bytes.
I looked up general hashmaps in C++ and stuff and the largest KV pairs are around 2^25.
I want to use Cassandra as feature store to store precomputed Bert embedding,
Each row would consist of roughly 800 integers (ex. -0.18294132) Should I store all 800 in one large string column or 800 separate columns?
Simple read pattern, On read we would want to read every value in a row. Not sure which would be better for serialization speed.
Having everything as a separate column will be quite inefficient - each value will have its own metadata (writetime, for example) that will add significant overhead (at least 8 bytes per every value). Storing data as string will be also not very efficient, and will add the complexity on the application side.
I would suggest to store data as fronzen list of integers/longs or doubles/floats, depending on your requirements. Something like:
create table ks.bert(
rowid int primary key,
data frozen<list<int>>
In this case, the whole list will be effectively serialized as binary blob, occupying just one cell.
I'm inspecting the output of sstabledump, to gain better understanding of the cassandra data model, and I have some questions
From the output of of sstabledump it seems that
a table is a list of partitions (split by partition key)
a partition is a list of rows (split according to clustering key)
a row is a map of key-value pairs, where the keys belong in a predefined list
Question 1: For each partition, as well as for each row inside a partition, there is a position key. What does this value correspond to? Physical storage details? And how exactly?
Question 2: Each row inside each partition has a type: row key-value pair. Could this type be anything else? If yes, what? If not
why have a value that is always the same?
why is cassandra is classified as wide-column and other similar terms? Looks more like a two-level row storage.
Partition key is the murmur3 hash of whatever you assigned as the primary key. Consistent hashing is used with that hash to determine which node in the cluster that partition belongs to and its replicas. Within each partition data is sorted by clustering key, and then by cell name within the row. The structure is used so redundant things like timestamps if inserted for a row at once is only inserted once as a vint delta sequence from the partitions to save space.
On disk the partitions are sorted in order of this hashed key. The output of the position key is just referring to where in the sstable's data file its located (decompressed byte offset). type can also identify in that spot as a static block, which is located at the beginning of each partition for any static cells or a ranged tombstone marker (beginning or end). Note that values are sometimes for sstabledump repeated in json for readability even if not physically written on disk (ie repeated timestamps).
You can have many of these rows inside a partition, a common datamodel for time series for example is to use timestamp as the clustering key which makes very wide partitions with millions of rows. Pre 3.0 as well the data storage was closer to big table's design. It was essentially a Map<byte[], SortedMap<byte[], Cell>> where the Comparator of the sorted map was changed based on schema. It did not differentiate rows and columns within a partition, and it lead to massive amounts redundant data and was redesigned to fit the query language better.
Some more references:
Explanation of motivation of 3.0 change by DataStax here
Blog post by TLP has a good detailed explanation of the new disk format
I'm trying to reproduce the Murmur3 hashing in Cassandra. Does anyone know how to get at the actual hash values used in the row keys? I just need some key - hash value pairs from my data to check that my implementation of the hashing is correct.
Ask Cassandra! Insert some data in your table. Afterwards, you can use the token function in a select query to get the used token values. For example:
select token(id), id from myTable;
A composite partition key is serialized as n-times a byte array (that is always prepended with a short indicating its length) containing the serialized value of your key element and a closing 0. It's unclear to me what these closing zeros are for. Has something to do with SuperColumns...
My entity is a key value pair. 90% of the time i'll be retrieving the entity based on key but 10% of the time I'll also do a reverse lookup i.e. I'll search by value and get the key.
The key and value both are guaranteed to be unique and hence their combination is also guaranteed to be unique.
Is it correct to use Key as PartitionKey and Value as RowKey?
I believe this will also ensure that my data is perfectly load balanced between servers since ParitionKey is unique.
Are there any problems in the above decision?
Under any circumstance is it practical to have a hard coded partition key? I.e all rows have same partition key? and keeping the RowKey unique?
Is it doable, yes, but depending on the size of your data, I'm not so sure it's a good idea. When you query on partition key, Table Store can go directly to the exact partition and retrieve all your records. If you query on Rowkey alone, Table store has to check if the row exists in every partition of the table. so if you have 1000 key value pairs, searching by your key will read a single partition/row. If your search via your value alone, it will read all 1000 partitions!
I face a similar problem, I solved it 2 ways:
Have 2 different tables, one with partitionKey as your-key, the other with your-value as partitionKey. Storage is cheap, so duplicating data shouldn't cost much.
(What I finally did) If you're effectively returning single entites based on a unique key, just stick them in blobs(partitioned and pivoted as in point 1), because you don't need to traverse a table, so don't.