Two FULLTEXT searches on ArangoDb Cluster: V8 is involved - arangodb

I am investigating ArangoDb cluster and found out that in case of usage two FULLTEXT() searches one of them involves V8 engine.
My data:
[
{
"TITL": "Attacks induced by bromocryptin in Parkinson patients",
"WORD": [
"hascites",
"Six patients with Parkinson's disease"
],
"ID":1,
},
{
"TITL": "Linear modeling of possible mechanisms for Parkinson tremor generation",
"WORD": [
"hascites",
"jsubsetIM"
],
"ID":2,
},
{
"TITL": "Drug-induced parkinsonism in the rat- a model for biochemical ...",
"WORD": [
"hascites",
"Following treatment with reserpine or alternatively with ...",
"hasabstract"
],
"ID":3,
}
]
Simplest query:
FOR title IN FULLTEXT(pmshort,"TITL","parkinson")
FOR word IN FULLTEXT(pmshort,"WORD","hascites")
FILTER title.ID==word.ID
RETURN title
In other words, I am trying to find all documents that have parkinson in TITL and hascites in WORD. This example is seriously simplified, so the usage of something like
FILTER word.WORD=='hascites'
is not possible. Two or more FULLTEXT searches are required for providing the necessary functionality.
Collection includes about 520,000 documents. FullText indexes are set up on each field.
I found out that each of FULLTEXT queries, being run separately, involves index:
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode DBS 1 * ROOT
5 IndexNode DBS 526577 - FOR title IN pmshort /* fulltext index scan */
8 RemoteNode COOR 526577 - REMOTE
9 GatherNode COOR 526577 - GATHER
4 ReturnNode COOR 526577 - RETURN title
But in case of usage both FOR first one is being processed by V8 (JavaScript) and runs on coordinator, not DBS:
Execution plan:
Id NodeType Site Est. Comment
1 SingletonNode COOR 1 * ROOT
2 CalculationNode COOR 1 - LET #2 = FULLTEXT(pmshort /* all collection documents */, "TITL", "parkinson") /* v8 expression */
3 EnumerateListNode COOR 100 - FOR title IN #2 /* list iteration */
10 ScatterNode COOR 100 - SCATTER
11 RemoteNode DBS 100 - REMOTE
9 IndexNode DBS 52657700 - FOR word IN pmshort /* fulltext index scan */
6 CalculationNode DBS 52657700 - LET #6 = (title.`ID` == word.`ID`) /* simple expression */ /* collections used: word : pmshort */
7 FilterNode DBS 52657700 - FILTER #6
12 RemoteNode COOR 52657700 - REMOTE
13 GatherNode COOR 52657700 - GATHER
8 ReturnNode COOR 52657700 - RETURN title
Of course, this slows down system a lot.
So my questions are:
1. Why ArangoDb cluster can't process both conditions on DBS, not on coordinator (COOR)?
2. How to avoid such situation since performance drops 300-500 times?
3. May be somebody can point on some additional materials to read about this.
Any help is appreciated.
Thanks!

It looks like the query optimizer stops looking for further fulltext improvements after having applied one fulltext transformation in each query/subquery.
A potential fix for this can be found in this pull request (which targets 3.3.10).

Thanks a lot!
It should be available in 3.3.10 and future 3.4, right?

Related

Understanding map eviction algorithm in hazelcast

I'm using hazelcast imdg 3.12.6 (yes, its too old i know) in my current projects. I try to figure out how map eviction algorithm exactly works.
First of all i read 2 things:
https://docs.hazelcast.org/docs/3.12.6/manual/html-single/index.html#map-eviction
https://docs.hazelcast.org/docs/3.12.6/manual/html-single/index.html#eviction-algorithm
I found a confusing thing about map eviction. So let's describe.
There is a class com.hazelcast.map.impl.eviction.EvictionChecker, containing method public boolean checkEvictable. This method checks if recordStore is evictable based on max size policy:
switch (maxSizePolicy) {
case PER_NODE:
return recordStore.size() > toPerPartitionMaxSize(maxConfiguredSize, mapName);
case PER_PARTITION:
return recordStore.size() > maxConfiguredSize;
//other cases...
I found confusing that per_node policy checks toPerPartitionMaxSize, and per_partition policy checks maxConfiguredSize.
It seems to be vice versa i consider.
If we take a look deeply into the history, how EvictionChecker was changed, we could find interesting thing.
Let's see git blame. This class has been changed 2 times:
7 years ago
4 years ago
7 years ago
4 years ago
I consider that per_node and per_partition conditions should be changed vice versa.
Can you please explain and confirm that com.hazelcast.map.impl.eviction.EvictionChecker#checkEvictable behaves clearly when PER_NODE is used.
UPD
I've done some local tests. My configuration is:
<map name="ReportsPerNode">
<eviction-policy>LRU</eviction-policy>
<max-size policy="PER_NODE">500</max-size>
</map>
I tried to put 151 elements into map. Map contains 112 elements, 39 elements are evicted by max size policy as a result. Calculation shows translatedPartitionSize = maxConfiguredSize * memberCount / partitionCount = 500 * 1 / 271 = 1,84.
If policy PER_PARTITION is used, my test completes well.
I analyzed how data is distributed over RecordStore. Every RecordStore contains from 1 to 4 elements.
According to formula maxConfiguredSize * memberCount / partitionCount, it means maxConfiguredSize should be 1084 elements. 1084 * 1 / 271 = 4.
As a result i have 2 configurations:
PER_NODE works well when max-size = 1084. Map contains 151 element as expected.
PER_PARTITION works well when max-size = 500; It also works well when max-size = 271. Map contains 151 element as expected.
It seems that data distribution over RecordStore strongly depends on key hash. Why we can't put 1 element per partition, if there are 271 partitions by default?
Its also unclear, why I should have map capacity 1084 to put only 151 element?

Ignoring write queries from different clients for the same row in Cassandra

I'm making a simple survey program.
INSERT INTO sosang_survey_list (partner_id,survey_id,content,coupon_id,end_date,respon_cnt,start_date,state,title) VALUES (56177c09-cc8e-47fa-bfd6-fa316655ddde,f1ce1520-cdbd-11eb-b799-2112347ce13a,[{question:'test',choice_list:[{answer:'test',cnt:0},{answer:'test2',cnt:0}]}],null,'2021-06-15',0,'2021-06-15',false,'testeste');
I use the above write query on the client (node.js) server that uses a different IP than mine. (The 'content' column is nested of user defined types.)
Then, my spring server increments the 'cnt' of the 'content' column by 1 through an update query for that row. but it doesn't work. It is obviously increased in the query response shown by Spring Boot, but 'cnt' is still 0 in the actual DB.
I thought it was a consistency problem, so I tried setting the consistency option to 'qurom' on the node server and 'all' on my spring server, but the same symptom.
The current server configuration has three nodes, each operating on a different IP,
SimpleStrategy and replication_factor is 3.
please give me a hint..
Below is the 'content' column composed of user defined types..
[ {
"question" : "title",
"choice_list" : [ {
"answer" : "answer1",
"cnt" : 0
}, {
"answer" : "answer2",
"cnt" : 0
} ]
} ]
The reason cnt = 0 in the database is you are explicitly setting the value to zero here:
INSERT INTO sosang_survey_list
(...)
VALUES (..., [{question:'test',choice_list:[{answer:'test',cnt:0},{answer:'test2',cnt:0}]}],...)
If you have multiple clients/app instances writing to the database, they could be overwriting the row with zero over and over again. Cheers!

Limit on the gremlin Query result set size in cosmosDB

in cosmosDB graph i have a vertex named "student" which have the following properties.
query:
g.V().hasLabel('student').valueMap().limit(1)
output:
{
"StudentID": [
10000
],
"StudentName": [
"RITM0809903"
],
"Student_Age": [
"ritm0809903"
],
"Student_class": [
"Awaiting User Training"
],
"Student_long_description": [
"*******************HUGE STUDENT DESCRIPTION*****************************"
]
}
Note: "HUGE STUDENT DESCRIPTION" is a huge description about a student.
Total number of student vertices available are 9.
i am using gremlinpython module to hit the query on cosmosdb and fetch the query results.
but when i try to do valueMap('StudentID','StudentName','Student_long_description') and get all the 9 vertices("g.V().hasLabel('student').valueMap('StudentID','StudentName','Student_long_description')") in output i am only able to see the 7 vertices , but when i exclude the property ""Student_long_description" i am able to see all 9 vertices.
is it because of the limit on the result set size.
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/querylimits
but when i include fold at last ("g.V().hasLabel('student').valueMap('StudentID','StudentName','Student_long_description').fold()") i am able to see all the 9 vertices along with the property ""Student_long_description" but folded.
please let me know is their any option i can use to get all the 9 vertices with all the properties without using fold in the query.

Trending hot and popular topics in an application

I created a simple mobile app, backend is made up of Node.JS and MongoDB (Mongoose) and express.
My app has topics with fields such as:
_id, title, subtitle, details,...,tags[ ], viewCount, likeCount, shareCount, commentsCount, updatedAt, createdAt
Now I want to find trending topics (topics that are currently hot and popular) mainly based on viewCount, likeCount, shareCount, commentsCount (other suggestions are welcome)
Currently I am using the following formula:
popularity = viewCount * 1 + likeCount * 2 + shareCount * 2 + commentsCount * 3
But this is foolish, as it does not take into account the main goal of trending which is current (now)
Any suggestion on how to improve my formula to get the desired results.
Note: I am willing to add or modify fields in my database
This is what i finally came up with
popularity =
(viewCount * 1 + likeCount * 2 + shareCount * 2 + commentsCount * 3) / actionTimestamp
where actionTimestamp is updated every time when one of the following is updated viewCount, likeCount, shareCount, commentsCount
I know this is not solid, any help is appriciated

MongoDB: How to retrieve a large number of documents in successive queries

This is what I want to achieve:
I have a collection with a large number of documents. A user will query a certain field and this will return a large number of documents.
But for bandwidth/processing reasons, I don't want to get all documents and send them to user (browser) at once.
Lets say the user makes a GET request with a search field. This would give 1000 results. But I want to only send the first 10 to the user. And then the user would request for the next 10, and so on.
Is there a way to achieve this in Mongodb? Can I query with a certain counter and increment it in successive queries?
What is a good way to achieve this mechanism?
Thank you.
MongoDB natively supports the paging operation using the skip() and limit() commands.
//first 10 results -> page 1
collection.find().limit (10)
//next 10 results -> page 2
collection.find().skip(10).limit(10)
//next 10 results -> page 3
collection.find().skip(20).limit(10)
Here number represents 1,2,3,..... means first 10 records if you are giving number = 1.
db.students.find().skip(number > 0 ? ((number-1)*10) : 0).limit(10);
//first 10 records ,number = 1
db.students.find().skip(1 > 0 ? ((1-1)*10) : 0).limit(10);
//next 10 records ,number = 2
db.students.find().skip(2 > 0 ? ((2-1)*10) : 0).limit(10);

Resources