Counting string occurrences with ArangoDB AQL

Counting string occurrences with ArangoDB AQL - arangodb

To count the number of objects containing a specific attribute value I can do something like:
FOR t IN thing
COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
FILTER other != false
RETURN otherCount
But how can I count three other occurrences within the same query, without resulting to subqueries running through the same dataset multiple times?
I've tried something like:
FOR t IN thing
COLLECT
other = t.name = "Other",
some = t.name = "Some",
thing = t.name = "Thing"
WITH COUNT INTO count
RETURN {
other, some, thing,
count
}
But I can't make sense of the results: I must be approaching this in the wrong way?

Split and count
You could split the string by the phrase and subtract 1 from the count. This works for any substring, which on the other hand means it does not respect word boundaries.
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(SPLIT(t.name, "Some"))-1
LET Other = LENGTH(SPLIT(t.name, "Other"))-1
LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
RETURN {
Some, Other, Thing
}
Result:
[
{
"Some": 3,
"Other": 2,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
You can use SPLIT(LOWER(t.name), LOWER("...")) to make it case-insensitive.
COLLECT words
The TOKENS() function can be utilized to split the input into word arrays, which can then be grouped and counted. Note that I changed the input slightly. An input "SomeSome" will not be counted because "somesome" != "some" (this variant is word and not substring based).
LET things = [
{name: "Here are SOME some and Some Other Things. More Other!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")
FOR t IN things
LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
LET counts = MERGE(FOR w IN whitelisted
COLLECT word = w WITH COUNT INTO count
RETURN { [word]: count }
)
RETURN {
name: t.name,
some: counts.some || 0,
other: counts.other || 0,
things: counts.things ||0
}
Result:
[
{
"name": "Here are SOME some and Some Other Things. More Other!",
"some": 3,
"other": 2,
"things": 0
},
{
"name": "There are no such substrings in here.",
"some": 0,
"other": 0,
"things": 0
},
{
"name": "some-Other-here-though!",
"some": 1,
"other": 1,
"things": 0
}
]
This does use a subquery for the COLLECT, otherwise it would count the total number of occurrences for the entire input.
The whitelist step is not strictly necessary, you could also let it count all words. For larger input strings it might save some memory to not do this for words you are not interested in anyway.
You might want to create a separate Analyzer with stemming disabled for the language if you want to match the words precisely. You can also turn off normalization ("accent": true, "case": "none"). An alternative would be to use REGEX_SPLIT() for typical whitespace and punctuation characters for a simpler tokenization, but that depends on your use case.
Other solutions
I don't think that it's possible to count each input object independently with COLLECT without subquery, unless you want a total count.
Splitting is a bit of a hack, but you could substitute SPLIT() by REGEX_SPLIT() and wrap the search phrases in \b to only match if word boundaries are on both sides. Then it should only match words (more or less):
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
RETURN {
Some, Other, Thing
}
Result:
[
{
"Some": 1,
"Other": 1,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
A more elegant solution would be to utilize ArangoSearch for word counting, but it doesn't have a feature to let you retrieve how often a word occurs. It might keep track of that already internally (Analyzer feature "frequency"), but it's definitely not exposed at this point in time.

Related

force error when parsing "01" from string to number in rust

I've have a string like this
"32" or "28", "01", "001"
and I want to parse them to a number.
However it should not parse a string that starts with 0.
Currently, I'm doing this
let num = str.parse().unwrap_or(-1);
With this implementation it converts "01" to 1 but I want to force -1 when the string stars with 0.

As mentioned in the comments - you could use this:
let num = if s.len() > 1 && s.starts_with('0') {
-1
} else {
s.parse().unwrap_or(-1)
};

Terraform Range function to start from 1 instead of 0

Is there a way to make the range function of terraform start from 1 instead of 0 or any other function or way to achieve the end result.
Let's say I have code as seen below.
variable "nodes" {
default = 1
}
locals {
node_range = range(var.nodes)
}
This returns the following output.
[
0
]
I would like to be able to get the output as shown below (pseudo code)
[
1
]
The reason I would like to have it this way is that, we cannot use count.index + 1 in for_each resources. Hence, if I get the list from range function which starts from 1, then I can simply use it in other places.
I have name tags that should start from myec2instance01, myec2instance02 ..etc. But if we get the range start from 0 then we get the numbering of tag from 00 (myec2instance00).
Any other way to achieve the end result is also accepted as a valid solution.

The first argument to range is start. So you could do the following for example:
variable "nodes" {
default = 5
}
locals {
node_range = range(1, var.nodes + 1)
}
output "out" {
value = local.node_range
}
which gives:
out = [
1,
2,
3,
4,
5,
]

locals {
node_range = range(1, var.nodes+1)
}

Query ArangoDB for Arrays

I am having a problem querying ArangoDB in java for a value of Arrays. I have tried with both String[] and ArrayList, both with no success.
My query:
FOR document IN documents FILTER #categoriesArray IN document.categories[*].title RETURN document
BindParams:
Map<String, Object> bindVars = new MapBuilder().put("categoriesArray", categoriesArray).get();
categoriesArray contains a bunch of Strings. I'm not sure why it isn't returning any results, because if I query using:
FOR document IN documents FILTER "Politics" IN document.categories[*].title RETURN document
I get the results I am looking for. Just not when using an Array or ArrayList.
I also tried querying for:
FOR document IN documents FILTER ["Politics","Law] IN document.categories[*].title RETURN document
in order to emulate an ArrayList, but this doesn't return any results. I would query using a bunch of individual Strings, but there are too many and I get an error from the Java driver when querying with a String that long. Thus, I must query using an Array or ArrayList.
An example of the categoriesArray:
["Politics", "Law", "Nature"]
A sample image of the database:

The reason is that the IN operator works by searching for the value on its left-hand side in each member of the array on the right side.
With the following the query, this will work if "Politics" is a member of document.categories[*].title:
FOR document IN documents FILTER "Politics" IN document.categories[*].title RETURN document
However the following will not work query even if "Politics" is a member of document.categories[*].title:
FOR document IN documents FILTER [ "Politics", "Law" ] IN document.categories[*].title RETURN document
This is because it will be searched for the exact value [ "Politics", "Law" ] in each member on the right side, and this will not be present. What you are probably looking for is a comparison that looks for "Politics" and "Law" separately, e.g.:
FOR document IN documents
LET contained = (
FOR title IN [ "Politics", "Law" ] /* or #categoriesArray */
FILTER title IN document.categories[*].title
RETURN title
)
FILTER LENGTH(contained) > 0
RETURN document

Arango also (now) has Array Comparison Operators which allow searching ALL IN, ANY IN, or NONE IN
[ 1, 2, 3 ] ALL IN [ 2, 3, 4 ] // false
[ 1, 2, 3 ] ALL IN [ 1, 2, 3 ] // true
[ 1, 2, 3 ] NONE IN [ 3 ] // false
[ 1, 2, 3 ] NONE IN [ 23, 42 ] // true
[ 1, 2, 3 ] ANY IN [ 4, 5, 6 ] // false
[ 1, 2, 3 ] ANY IN [ 1, 42 ] // true
[ 1, 2, 3 ] ANY == 2 // true
[ 1, 2, 3 ] ANY == 4 // false
[ 1, 2, 3 ] ANY > 0 // true
[ 1, 2, 3 ] ANY <= 1 // true
[ 1, 2, 3 ] NONE < 99 // false
[ 1, 2, 3 ] NONE > 10 // true
[ 1, 2, 3 ] ALL > 2 // false
[ 1, 2, 3 ] ALL > 0 // true
[ 1, 2, 3 ] ALL >= 3 // false
["foo", "bar"] ALL != "moo" // true
["foo", "bar"] NONE == "bar" // false
["foo", "bar"] ANY == "foo" // true
So you could now filter by:
FOR document IN documents
FILTER ["Politics", "Law] ANY IN (document.categories[*].title)[**]
RETURN document

MongoDB Filter data points within an interval

I have a database query that selects all documents having a timestamp field (tmp) falling in a certain range, like so
{ tmp: { '$gte': 1411929000000, '$lte': 1419010200000 } }
This query returns a large number of records, say 10000.
Objective:
To fetch documents in the same interval range, but separated by say (1 hour timestamp) interval in between hence reduce the number of records that is fetched.
Is there a way of doing this entirely using MongoDB query system?
Due to NDA I can not show the code, but it basically contains Stock Exchange data (say in 1 minute interval). And the objective is to send a sample of these data between two endpoints (time). But the thing is, the client can ask for a 5 minute interval data or 10 min, or 1 hour etc, so from these 1 minute interval data I need to sample and send only the relevant ones. Hope that makes it more clear.
Any comments would be very helpful. Thanks.

There's no way to accomplish your objective directly, but you can do something very close. Given a range of time [s, t] and a separation p, you're looking for approximately (t - s) / p documents evenly spread over the range, to give a "zoomed-out" sense of the data. Pick x, ideally small compared to p, large enough to contain documents but small enough not to contain very many, and look for documents within an interval of width x around evenly spaced points separated by p. You can do this with a single $or query or with a series of queries. For example, simplifying using integers instead of dates, if I have a field score with values in the range [0, 50] and want a resolution of p = 10, I'll look at intervals of width x = 1 around points separated by 10:
db.test.find({ "$or" : [
{ "score" : { "$gte" : 0, "$lte" : 1 } },
{ "score" : { "$gte" : 9, "$lte" : 11 } },
{ "score" : { "$gte" : 19, "$lte" : 21 } },
{ "score" : { "$gte" : 29, "$lte" : 31 } },
{ "score" : { "$gte" : 39, "$lte" : 41 } },
{ "score" : { "$gte" : 49, "$lte" : 50 } },
] })
You could break this into 6 ((t - s) / p + 1) queries and limit 1 result in each query, alternatively.
There are a couple of other higher-level ways to approach your problem. I'd suggest looking at the following two schema design articles from the MongoDB Manual:
Pre-Aggregated Reports
Hierarchical Aggregation

Longest Substring Pair Sequence is it Longest Common Subsequence or what?

I have a pair of strings, for example: abcabcabc and abcxxxabc and a List of Common Substring Pairs (LCSP), in this case LCSP is 6 pairs, because three abc in the first string map to two abc in the second string. Now I need to find the longest valid (incrementing) sequence of pairs, in this case there are three equally long solutions: 0:0,3:6; 0:0,6:6; 3:0,6:6 (those numbers are starting positions of each pair in the original strings, the length of substrings is 3 as length of "abc"). I would call it the Longest Substring Pair Sequence or LSPQ. (Q is not to confuse String and Sequence)
Here is the LCSP for this example:
LCSP('abcabcabc', 'abcxxxabc') =
[ [ 6, 6, 3 ],
[ 6, 0, 3 ],
[ 3, 6, 3 ],
[ 0, 6, 3 ],
[ 3, 0, 3 ],
[ 0, 0, 3 ] ]
LSPQ(LCSP('abcabcabc', 'abcxxxabc'), 0, 0, 0) =
[ { a: 0, b: 0, size: 3 }, { a: 3, b: 6, size: 3 } ]
Now I find it with brute force recursively trying all combinations. So I am limited to about 25 pairs, otherwise it is unpractical. Size=[10,15,20,25,26,30], Time ms = [0,15,300,1000,2000,19000]
Is there a way to do that in linear time or at least not quadratic complexity so that longer input LCSP (List of Common Substring Pairs) could be used.
This problem is similar to the "Longest Common Subsequence", but not exactly it, because the input is not two strings but a list of common substrings sorted by their length. So I do not know where to look for an existing solutions or even if they exist.
Here is my particular code (JavaScript):
function getChainSize(T) {
var R = 0
for (var i = 0; i < T.length; i++) R += T[i].size
return R
}
function LSPQ(T, X, Y, id) {
// X,Y are first unused character is str1,str2
//id is current pair
function findNextPossible() {
var x = id
while (x < T.length) {
if (T[x][0] >= X && T[x][1] >= Y) return x
x++
}
return -1
}
var id = findNextPossible()
if (id < 0) return []
var C = [{a:T[id][0], b:T[id][1], size:T[id][2] }]
// with current
var o = T[id]
var A = C.concat(LSPQ(T, o[0]+o[2], o[1]+o[2], id+1))
// without current
var B = LSPQ(T, X, Y, id+1)
if (getChainSize(A) < getChainSize(B)) return B
return A
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Counting string occurrences with ArangoDB AQL - arangodb

Related

force error when parsing "01" from string to number in rust

Terraform Range function to start from 1 instead of 0

Query ArangoDB for Arrays

MongoDB Filter data points within an interval

Longest Substring Pair Sequence is it Longest Common Subsequence or what?

Categories

Resources