Understanding map eviction algorithm in hazelcast - hazelcast

I'm using hazelcast imdg 3.12.6 (yes, its too old i know) in my current projects. I try to figure out how map eviction algorithm exactly works.
First of all i read 2 things:
https://docs.hazelcast.org/docs/3.12.6/manual/html-single/index.html#map-eviction
https://docs.hazelcast.org/docs/3.12.6/manual/html-single/index.html#eviction-algorithm
I found a confusing thing about map eviction. So let's describe.
There is a class com.hazelcast.map.impl.eviction.EvictionChecker, containing method public boolean checkEvictable. This method checks if recordStore is evictable based on max size policy:
switch (maxSizePolicy) {
case PER_NODE:
return recordStore.size() > toPerPartitionMaxSize(maxConfiguredSize, mapName);
case PER_PARTITION:
return recordStore.size() > maxConfiguredSize;
//other cases...
I found confusing that per_node policy checks toPerPartitionMaxSize, and per_partition policy checks maxConfiguredSize.
It seems to be vice versa i consider.
If we take a look deeply into the history, how EvictionChecker was changed, we could find interesting thing.
Let's see git blame. This class has been changed 2 times:
7 years ago
4 years ago
7 years ago
4 years ago
I consider that per_node and per_partition conditions should be changed vice versa.
Can you please explain and confirm that com.hazelcast.map.impl.eviction.EvictionChecker#checkEvictable behaves clearly when PER_NODE is used.
UPD
I've done some local tests. My configuration is:
<map name="ReportsPerNode">
<eviction-policy>LRU</eviction-policy>
<max-size policy="PER_NODE">500</max-size>
</map>
I tried to put 151 elements into map. Map contains 112 elements, 39 elements are evicted by max size policy as a result. Calculation shows translatedPartitionSize = maxConfiguredSize * memberCount / partitionCount = 500 * 1 / 271 = 1,84.
If policy PER_PARTITION is used, my test completes well.
I analyzed how data is distributed over RecordStore. Every RecordStore contains from 1 to 4 elements.
According to formula maxConfiguredSize * memberCount / partitionCount, it means maxConfiguredSize should be 1084 elements. 1084 * 1 / 271 = 4.
As a result i have 2 configurations:
PER_NODE works well when max-size = 1084. Map contains 151 element as expected.
PER_PARTITION works well when max-size = 500; It also works well when max-size = 271. Map contains 151 element as expected.
It seems that data distribution over RecordStore strongly depends on key hash. Why we can't put 1 element per partition, if there are 271 partitions by default?
Its also unclear, why I should have map capacity 1084 to put only 151 element?

Related

StormCrawler: setting "maxDepth": 0 prevents ES seed injection

With StormCrawler 2.3-SNAPSHOT, setting "maxDepth": 0 in the urlfilters.json prevents the seed injection into the ES index. Is that the expected behaviour? Or should it be injecting the seeds and do a closed crawl on the injected seeds only with no redirection at all? (what I was expecting)
Launch looks fine but ES status index is empty.
See MaxDepthFilter, with a value of 0, everything gets filtered. Setting the filter to a value of 1 should do the trick, the seeds will be injected but their links won't be followed.
In MaxDepthFilter,
private String filter(final int depth, final int max, final String url) {
// deactivate the outlink no matter what the depth is
if (max == 0) {
return null;
}
if (depth >= max) {
LOG.debug("filtered out {} - depth {} >= {}", url, depth, maxDepth);
return null;
}
return url;
}
turns out that URLs need to have a depth of max-1 to be kept, so to put it differently, the actual maximum depth is max-1.
This feels not right and slightly confusing, I agree.
I think this is due to the sequence in which the outlinks get filtered. Often, this is done in the StatusEmitterBolt.
At the moment they first get filtered then inherit their metadata from the parent metadata. It is during that later step that their depth value gets incremented. I suspect this is why we are doing the max-1 trick.
There probably was a reason why the filtering was done first then the metadata inheritance but it has been a while and I can't remember any. I would be happy to change the order and get the metadata then filter and change the depth filtering so that it is more intuitive. Could you please open an issue on Github so that we discuss it there?
Thanks!

Azure StorageException when using emulated storage (within documented constraints)

Our application performs several batches of TableBatchOperation. We ensure that each of these table batch operations has
100 or fewer table operations
table operations for one entity partition key only
Along the lines of the following:
foreach (var batch in batches)
{
var operation = new TableBatchOperation();
operation.AddRange(batch.Select(x => TableOperation.InsertOrReplace(x)));
await table.ExecuteBatchAsync(operation);
}
When we use emulated storage we 're hitting a Microsoft.WindowsAzure.Storage.StorageException - "Element 99 in the batch returned an unexpected response code."
When we use production Azure, everything works fine.
Emulated storage is configured as follows:
<add key="StorageConnectionString" value="UseDevelopmentStorage=true;" />
I'm concerned that although everything is working OK in production (where we use real Azure), the fact that it's blowing up with emulated storage may be symptomatic of us doing something we shouldn't be.
I've run it with a debugger (before it blows up) and verified that (as per API):
The entire operation is only only 492093 characters when serialized to JSON (984186 bytes as UTF-16)
There are exactly 100 operations
All entities have the same partition key
See https://learn.microsoft.com/en-us/dotnet/api/microsoft.windowsazure.storage.table.tablebatchoperation?view=azurestorage-8.1.3
EDIT:
It looks like one of the items (#71/100) is causing this to fail. Structurally it is no different to the other items, however it does have some rather long string properties - so perhaps there is an undocumented limitation / bug?
EDIT:
The following sequence of Unicode UTF-16 bytes (on a string property) is sufficent to cause the exception:
r e n U+0019 space
114 0 101 0 110 0 25 0 115 0 32 0
(it's the bytes 25 0 115 0 i.e. unicode end-of-medium U+0019 which is causing the exception).
EDIT:
Complete example of failing entity:
JSON:
{"SomeProperty":"ren\u0019s ","PartitionKey":"SomePartitionKey","RowKey":"SomeRowKey","Timestamp":"0001-01-01T00:00:00+00:00","ETag":null}
Entity class:
public class TestEntity : TableEntity
{
public string SomeProperty { get; set; }
}
Entity object construction:
var entity = new TestEntity
{
SomeProperty = Encoding.Unicode.GetString(new byte[]
{114, 0, 101, 0, 110, 0, 25, 0, 115, 0, 32, 0}),
PartitionKey = "SomePartitionKey",
RowKey = "SomeRowKey"
};
According to your description, I also can reproduce the issue that you mentioned. After I tested I found that the special
Unicode Character 'END OF MEDIUM' (U+0019) seems that not supported by Azure Storage Emulator. If replace to other unicode is possible, please try to use another unicode to instead of it.
we also could give our feedback to Azure storage team.

Node.js - Multimap

I have the following data(example) -
1 - "Value1A"
1 - "Value1B"
1 - "Value1C"
2 - "Value2A"
2 - "Value2B"
I'm using Multimaps for the above data, such that the key 1, has 3 values(Value1A, Value1B, Value1C) and key 2 has 2 values(Value2A, Value2B).
When I try to retrieve all the values for a given key using the get function, it works. But I want to get the key given the value. i.e. if I have "Value1C", I want to use this to get its key 1, from the Multimap. Is this possible, if so how and if not what other than Multimap can I use to achieve this result.
Thanks for the help
https://www.npmjs.com/package/multimap
It is not possible to do this with a single operation, You will need to choose beetween use some extra memory or consume CPU resource.
Use more memory
In this case you need to store the data in a reverse mapping. So you will have another map to store as "Value1C" -> 1. This solution can cause consistency issues, since all the operations will need to be updated in both map. The original one and the reverse one.
The example for this code is basic:
//insert
map.set(1, "Value1C");
reverseMap.set("Value1C", 1);
//search
console.log(map.get(reverseMap.get("Value1C")));
Use more CPU
In this cause you will need to do a search throught all the values, this will be an O(n) complexity. It is not good if your list is too big, even worst in a single thread environment like Node.js.
Check the code example below:
function findValueInMultiMap(map, value, callback){
map.forEachEntry(function (entry, key) {
for(var e in entry){
if(entry[e]==value){
callback(map.get(key));
}
}
});
}
findValueInMultiMao(map, 'Value1C', function(values){
console.log(values);
});

Short user-friendly ID for mongo

I am creating a real time stock trading system and would like to provider the user with a human readible, user friendly way to refer to their orders. For example the ID should be like 8 characters long and only contain upper case characters e.g. Z9CFL8BA. For obvious reasons the id needs to be unique in the system.
I am using MongoDB as the backend database and have evaluated the following projects which do not meet my requirements.
hashids.org - this looks good but it generates ids which are too long:
var mongoId = '507f191e810c19729de860ea';
var id = hashids.encodeHex(mongoId);
console.log(id)
which results in: 1E6Y3Y4D7RGYHQ7Z3XVM4NNM
github.com/dylang/shortid - this requires that you specify a 64 character alphabet, and as mentioned I only want to use uppercase characters.
I understand that the only way to achieve what I am looking for may well be by generating random codes that meet my requirements and then checking the database for collisions. If this is the case, what would be the most efficient way to do this in a nodejs / mongodb environment?
You're attempting to convert a base-16 (hexadecimal) to base-36 (26 characters in alphabet plus 10 numbers). A simple way might be to simply use parseInt's radix parameter to parse the hexadecimal id, and then call .toString(36) to convert that into base-36. Which would turn "507f191e810c19729de860ea" into "VDFGUZEA49X1V50356", reducing the length from 24 to 18 characters.
function toBase36(id) {
var half = Math.floor(id.length / 2);
var first = id.slice(0, half);
var second = id.slice(half);
return parseInt(first, 16).toString(36).toUpperCase()
+ parseInt(second, 16).toString(36).toUpperCase();
}
function toBase36(id) {
var half = Math.floor(id.length / 2);
var first = id.slice(0, half);
var second = id.slice(half);
return parseInt(first, 16).toString(36).toUpperCase()
+ parseInt(second, 16).toString(36).toUpperCase();
}
// Ignore everything below (for demo only)
function convert(e){ if (e.target.value.length % 2 === 0) base36.value = toBase36(e.target.value) }
var base36 = document.getElementById('base36');
var hex = document.getElementById('hex');
document.getElementById('hex').addEventListener('input', convert, false);
convert({ target: { value: hex.value } });
input { font-family: monospace; width: 15em; }
<input id="hex" value="507f191e810c19729de860ea">
<input id="base36" readonly>
I understand that the only way to achieve what I am looking for may well be by generating random codes that meet my requirements and then checking the database for collisions. If this is the case, what would be the most efficient way to do this in a nodejs / mongodb environment?
Given your description, you use 8 chars in the range [0-9A-Z] as "id". That is 36⁸ combinations (≈ 2.8211099E12). Assuming your trading system does not gain insanely huge popularity in the short to mid term, the chances of collision are rather low.
So you can take an optimistic approach, generating a random id with something along the lines of the code below (as noticed by #idbehold in a comment, be warn that Math.random is probably not random enough and so might increase the chances of collision -- if you go that way, maybe you should investigate a better random generator [1])
> rid = Math.floor(Math.random()*Math.pow(36, 8))
> rid.toString(36).toUpperCase()
30W13SW
Then, using a proper unique index on that field, you only have to loop, regenerating a new random ID until there is no collision when trying to insert a new transaction. As the chances of collision are relatively small, this should terminate. And most of the time this will insert the new document on first iteration as there was no collision.
If I'm not too wrong, assuming 10 billion of transactions, you still have only 0.3% chance of collision on first turn, and a little bit more than 0.001% on the second turn
[1] On node, you might prefer using crypto.pseudoRandomBytes to generate your random id. You might build something around that, maybe:
> b = crypto.pseudoRandomBytes(6)
<SlowBuffer d3 9a 19 fe 08 e2>
> rid = b.readUInt32BE(0)*65536 + b.readUInt16BE(4)
232658814503138
> rid.toString(36).substr(0,8).toUpperCase()
'2AGXZF2Z'

Querying documents containing two tags with CouchDB?

Consider the following documents in a CouchDB:
{
"name":"Foo1",
"tags":["tag1", "tag2", "tag3"],
"otherTags":["otherTag1", "otherTag2"]
}
{
"name":"Foo2",
"tags":["tag2", "tag3", "tag4"],
"otherTags":["otherTag2", "otherTag3"]
}
{
"name":"Foo3",
"tags":["tag3", "tag4", "tag5"],
"otherTags":["otherTag3", "otherTag4"]
}
I'd like to query all documents that contain ALL (not any!) tags given as the key.
For example, if I request using '["tag2", "tag3"]' I'd like to retrieve Foo1 and Foo2.
I'm currently doing this by querying by tag, first for "tag2", then for "tag3", creating the union manually afterwards.
This seems to be awfully inefficient and I assume that there must be a better way.
My second question - but they are quite related, I think - would be:
How would I query for all documents that contain "tag2" AND "tag3" AND "otherTag3"?
I hope a question like this hasn't been asked/answered before. I searched for it and didn't find one.
Do you have a maximum number of?
Tags per document, and
Tags allowed in the query
If so, you have an upper-bound on the maximum number of tags to be indexed. For example, with a maximum of 5 tags per document, and 5 tags allowed in the AND query, you could simply output every 1, 2, 3, 4, and 5-tag combination into your index, for a maximum of 1 (five-tag combos + 5 (four-tag combos) + 10 (three-tag combos) + 10 (two-tag combos) + 5 (one-tag combos) = 31 rows in the view for that document.
That may be acceptable to you, considering that it's quite a powerful query. The disk usage may be acceptable (especially if you simply emit(tags, {_id: doc._id}) to minimize data in the view, and you can use ?include_docs=true to get the full document later. The final thing to remember is to always emit the key array sorted, and always query it the same way, because you are emitting only tag combinations, not permutations.
That can get you so far, however it does not scale up indefinitely. For full-blown arbitrary AND queries, you will indeed be required to split into multiple queries, or else look into CouchDB-Lucene.

Resources