Hashmap value storage if it exceeds integer range - hashmap

As many of us know that initially HashMap allocates a memory with default initial capacity of 16 and the default load factor of 0.75. Now when we try to store values into HashMapfirst it calculates bucket location by calling hashcode function on hash map key. Suppose if user defined hashcode methods returns a value which is greater than integer range and that values exceeds initial capacity address of HashMap, then how the value will get stored in HashMap?
or
As hashcode method can return any value within in integer range then how JVM knows that it has to store hash map value within that initially allocated memory location?

Hash Map can have pairs up to Integer.MAX_VALUE, After exceeding that limit it will cause for some methods unexpected behavior such as size(), but it will not effect to the methods such as remove(), get() and put() Because remove() removes the mapping for the specified key from this map if present and get() returns the value to which the specified key is mapped, or null if this map contains no mapping for the key. And put() associates the specified value with the specified key in this map. If the map previously contained a mapping for the key, the old value is replaced. How ever there can be collision between new object with an existing object.
Refference - http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html

The index of the bucket to store an object in is computed from the result of hashCode, but is rarely actually equal to it.
e.g. if you use a power-of-two number of buckets, a common approach is
bucketIndex = key.hashCode() & (numberOfBuckets - 1);
I don't know what precise method HashMap uses to compute the bucket index.

Related

Storing arrays in Cassandra

I have lots of fast incoming data that is organised thusly;
Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
The data arrays themselves are not necessarily written in order.
The length of the arrays may vary.
The data is either read as an entire array at a time so makes sense to store the entire thing together.
The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.
For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.
In Cassandra/Scylla it looks like I have the options of either:
Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.
What other options do I have?
Thanks
Option 1, seems to be a good match:
I assume each logical object have an unique id (or better uuid)
In such a case, you can create something like
CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.
This allows
fast retrieve of the entire "array", even a big one, using paging
fast retrieve of an index in an array

Hazelcast Projection fetch single field

I have question regarding hazelcast Projection API.
Lets say I want to fetch just a single field from an entry in the map using Portable serialization.
In this example a name from an employee.
I'm guessing I will be getting better performance in relation to network traffic and deserialization by using a projection like this:
public String getName(Long key) {
return map.project(
(Projection<Entry<Long, Employee>, String>) entry -> entry.getValue().getName(),
(Predicate<Long, Event>) mapEntry -> mapEntry.getKey().equals(key))
.stream()
.findFirst().orElse(null);
}
Instead of something like:
public String getName(Long key) {
return map.get(key).getName();
}
Projection is a type of predicate that comes handy if you want to return only a part of the entire value object in a result set of many value objects. For one single key based lookup, it is an overkill. A map.get is a lighter weight operation than running a predicate.
Network traffic = not sure how much savings you will have as that depends on your network bandwidth + size of the object + no. of concurrent objects traveling.
Deserialization = not much saving unless the actual object stored as value is monstrous and the field you are extracting is a tiny bit.
If you are conscious of network bandwidth and ser/des cost then keep in-memory-format to OBJECT and use EntryProcessor to update. If you do not have anything to update then use ExecutorService.

Is there any alternative to 32K string limits?

I want to store WKT that can be quite large but I'm running into the 32K limit while storing them in object values.
create table A (id integer, wkt object);
So there is a way to store longer strings in objects:
CREATE TABLE IF NOT EXISTS A (
"id" INTEGER,
"wkt" OBJECT (IGNORED)
)
By using ignored the entire object is not indexed, which also prohibits it from being used in other SQL parts properly (they will always do a full table scan).
However, subscripts work just fine.
For other readers: WKT can also be stored as geo_shape type as well, or used with match directly.

Grouping related keys of an RDD together

I have a generated RDD with a set of key value pairs. Assume that the keys are [10, 20, 25,30, 40, 50]. The real keys are close by Geographic bins of size X.X meters that need to be aggregated to size 2*X.2*X size.
So in this RDD set I need to aggregate keys that are having a relation between them. Example a key that is twice that of the current key - say 10 and 20. Then these will be added together to give 30. The values will also be added together Similarly the result set would be [30,25,70,50].
I am assuming that since map and reduce work on the current key of an element in an RDD , there is no way to do it using map or groupbyKey or aggregatebyKey; as the grouping I want needs the state of the previous key
I was thinking the only way to do this is to iterate through the elements in the RDD using foreach and for each element pass in also the entire RDD to it.
def group_rdds_together(rdd,rdd_list):
key,val = rdd
xbin,ybin = key
rdd_list.foreach(group_similar_keys,xbin,ybin)
bin_rdd.map(lambda x : group_rdds_together(rdd,bin_rdd))
For that I have to pass in rdd to the map lambda as well as custom parameters to the foreach function
What I am doing is horribly wrong; just wanted to illustrate where i am going with this. There should be a simpler and better way than this

IMap Hazelcast size operation

If you have several nodes in Hazelcast will the return value in IMap.size() method return the size of the map in that node or the total size of all objects associated with that map distributed through all nodes?
Looking through the javadocs, it doesn't show that method being overridden, so I imagine that method call won't return what would normally be expected in a non-distributed map.
It returns the estimated size of the complete distributed map. Estimated because it just gets the count of elements in a partition at a given point in time, since multiple partition are operated one after another that does not need to be the 100% real amount but it is if you get the size but not mutating the actual map.

Resources