Finding documents with a time-range which intersect a given time-range - couchdb
There is a list of documents P having two timestamps representing the time range [ P ] in which the document is valid. An index over these intervals was created:
function (doc) {
emit([doc.start, doc.end], someStuff(doc));
}
We want to receive documents P wich start before some end timestamp E and stop after some start timestamp S:
P(S, E) = { P | P_s <= E && P_e >= S }
For instance, in a picture like this
<-- TIME -->
..------------------S-------------------------------------E----------------------..
.. P0 ][ P1 ][ P2 ][ P3 ][ P4 ][ P5 ][ P6 ..
we expected the subset {P1, P2, P3, P4} as the result. We try to get the desired result using the following key-range
_view/range?descending=false&startkey=[0,S]&endkey=[E,{}]
The result P(A, E) = {P0, P1, P2, P3, P4} is wrong which makes sense when checking the following example for S=17 and E=30:
key startkey endkey accept
_________________________________________________________________________
[10,15] [0,17] <= [10,15] <= [30, {}] -> True <- This is wrong
[15,25] [0,17] <= [15,25] <= [30, {}] -> True OK
[25,30] [0,17] <= [25,30] <= [30, {}] -> True OK
[25,50] [0,17] <= [25,50] <= [30, {}] -> True OK
[35,50] [0,17] <= [35,50] <= [30, {}] -> False OK
Is it possible to define a range such that we get the desired result?
This is much easier to achieve using the POST /db/_find endpoint. You can express your query as a selector:
{
"selector": {
"start": { "$lt": 100 },
"end": { "$gt": 300 }
}
"sort": ["start"]
}
This is the equivalent to the SQL SELECT * FROM db WHERE start<100 AND end > 300 SORY BY start.
You will almost certainly need an index on "start" too to speed things up.
Related
Transpose CSV data using nodejs
"A",1,2,3,4 "B",1,2,3,4 "C",1,2,3,4 I want to transpose and get the output as "A""B""C" 111 222 333 444
Hi sai kiran bandari ! Please provide more information for your next question. I expect you to have a 2 dimension array for which i made a solution. You want to iteratoe through the array using 2 loops and split the values to a new 2 dimension array according to first array index i in this case. const data =[['A', 1,2,3,4], ['B', 1,2,3,4], ['C', 1,2,3,4]] const transposed = [] // iterate through 2 dimension array.. for(var i = 0; i < data.length; i++) { var arr = data[i] // Iterate through inner araray for(let p = 0; p < arr.length; p++) { // create new inner array if there is not already one at destination.. if(Array.isArray(transposed[p]) === false) { transposed[p] = [] } // we want to take all the values of the first array and split it // up to a single array each. In the second iteration, alle the // values will be splitted again and you have your transpose. transposed[p].push(arr[p]) } } console.log(transposed) /* Output => [ [ 'A', 'B', 'C' ], [ 1, 1, 1 ], [ 2, 2, 2 ], [ 3, 3, 3 ], [ 4, 4, 4 ] ] */
Counting string occurrences with ArangoDB AQL
To count the number of objects containing a specific attribute value I can do something like: FOR t IN thing COLLECT other = t.name = "Other" WITH COUNT INTO otherCount FILTER other != false RETURN otherCount But how can I count three other occurrences within the same query, without resulting to subqueries running through the same dataset multiple times? I've tried something like: FOR t IN thing COLLECT other = t.name = "Other", some = t.name = "Some", thing = t.name = "Thing" WITH COUNT INTO count RETURN { other, some, thing, count } But I can't make sense of the results: I must be approaching this in the wrong way?
Split and count You could split the string by the phrase and subtract 1 from the count. This works for any substring, which on the other hand means it does not respect word boundaries. LET things = [ {name: "Here are SomeSome and Some Other Things, brOther!"}, {name: "There are no such substrings in here."}, {name: "some-Other-here-though!"} ] FOR t IN things LET Some = LENGTH(SPLIT(t.name, "Some"))-1 LET Other = LENGTH(SPLIT(t.name, "Other"))-1 LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1 RETURN { Some, Other, Thing } Result: [ { "Some": 3, "Other": 2, "Thing": 1 }, { "Some": 0, "Other": 0, "Thing": 0 }, { "Some": 0, "Other": 1, "Thing": 0 } ] You can use SPLIT(LOWER(t.name), LOWER("...")) to make it case-insensitive. COLLECT words The TOKENS() function can be utilized to split the input into word arrays, which can then be grouped and counted. Note that I changed the input slightly. An input "SomeSome" will not be counted because "somesome" != "some" (this variant is word and not substring based). LET things = [ {name: "Here are SOME some and Some Other Things. More Other!"}, {name: "There are no such substrings in here."}, {name: "some-Other-here-though!"} ] LET whitelist = TOKENS("Some Other Things", "text_en") FOR t IN things LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w) LET counts = MERGE(FOR w IN whitelisted COLLECT word = w WITH COUNT INTO count RETURN { [word]: count } ) RETURN { name: t.name, some: counts.some || 0, other: counts.other || 0, things: counts.things ||0 } Result: [ { "name": "Here are SOME some and Some Other Things. More Other!", "some": 3, "other": 2, "things": 0 }, { "name": "There are no such substrings in here.", "some": 0, "other": 0, "things": 0 }, { "name": "some-Other-here-though!", "some": 1, "other": 1, "things": 0 } ] This does use a subquery for the COLLECT, otherwise it would count the total number of occurrences for the entire input. The whitelist step is not strictly necessary, you could also let it count all words. For larger input strings it might save some memory to not do this for words you are not interested in anyway. You might want to create a separate Analyzer with stemming disabled for the language if you want to match the words precisely. You can also turn off normalization ("accent": true, "case": "none"). An alternative would be to use REGEX_SPLIT() for typical whitespace and punctuation characters for a simpler tokenization, but that depends on your use case. Other solutions I don't think that it's possible to count each input object independently with COLLECT without subquery, unless you want a total count. Splitting is a bit of a hack, but you could substitute SPLIT() by REGEX_SPLIT() and wrap the search phrases in \b to only match if word boundaries are on both sides. Then it should only match words (more or less): LET things = [ {name: "Here are SomeSome and Some Other Things, brOther!"}, {name: "There are no such substrings in here."}, {name: "some-Other-here-though!"} ] FOR t IN things LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1 LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1 LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1 RETURN { Some, Other, Thing } Result: [ { "Some": 1, "Other": 1, "Thing": 1 }, { "Some": 0, "Other": 0, "Thing": 0 }, { "Some": 0, "Other": 1, "Thing": 0 } ] A more elegant solution would be to utilize ArangoSearch for word counting, but it doesn't have a feature to let you retrieve how often a word occurs. It might keep track of that already internally (Analyzer feature "frequency"), but it's definitely not exposed at this point in time.
Longest Substring Pair Sequence is it Longest Common Subsequence or what?
I have a pair of strings, for example: abcabcabc and abcxxxabc and a List of Common Substring Pairs (LCSP), in this case LCSP is 6 pairs, because three abc in the first string map to two abc in the second string. Now I need to find the longest valid (incrementing) sequence of pairs, in this case there are three equally long solutions: 0:0,3:6; 0:0,6:6; 3:0,6:6 (those numbers are starting positions of each pair in the original strings, the length of substrings is 3 as length of "abc"). I would call it the Longest Substring Pair Sequence or LSPQ. (Q is not to confuse String and Sequence) Here is the LCSP for this example: LCSP('abcabcabc', 'abcxxxabc') = [ [ 6, 6, 3 ], [ 6, 0, 3 ], [ 3, 6, 3 ], [ 0, 6, 3 ], [ 3, 0, 3 ], [ 0, 0, 3 ] ] LSPQ(LCSP('abcabcabc', 'abcxxxabc'), 0, 0, 0) = [ { a: 0, b: 0, size: 3 }, { a: 3, b: 6, size: 3 } ] Now I find it with brute force recursively trying all combinations. So I am limited to about 25 pairs, otherwise it is unpractical. Size=[10,15,20,25,26,30], Time ms = [0,15,300,1000,2000,19000] Is there a way to do that in linear time or at least not quadratic complexity so that longer input LCSP (List of Common Substring Pairs) could be used. This problem is similar to the "Longest Common Subsequence", but not exactly it, because the input is not two strings but a list of common substrings sorted by their length. So I do not know where to look for an existing solutions or even if they exist. Here is my particular code (JavaScript): function getChainSize(T) { var R = 0 for (var i = 0; i < T.length; i++) R += T[i].size return R } function LSPQ(T, X, Y, id) { // X,Y are first unused character is str1,str2 //id is current pair function findNextPossible() { var x = id while (x < T.length) { if (T[x][0] >= X && T[x][1] >= Y) return x x++ } return -1 } var id = findNextPossible() if (id < 0) return [] var C = [{a:T[id][0], b:T[id][1], size:T[id][2] }] // with current var o = T[id] var A = C.concat(LSPQ(T, o[0]+o[2], o[1]+o[2], id+1)) // without current var B = LSPQ(T, X, Y, id+1) if (getChainSize(A) < getChainSize(B)) return B return A }
How to find "nearest" value in a large list in Erlang
Suppose I have a large collection of integers (say 50,000,000 of them). I would like to write a function that returns me the largest integer in the collection that doesn't exceed a value passed as a parameter to the function. E.g. if the values were: Values = [ 10, 20, 30, 40, 50, 60] then find(Values, 25) should return 20. The function will be called many times a second and the collection is large. Assuming that the performance of a brute-force search is too slow, what would be an efficient way to do it? The integers would rarely change, so they can be stored in a data structure that would give the fastest access. I've looked at gb_trees but I don't think you can obtain the "insertion point" and then get the previous entry. I realise I could do this from scratch by building my own tree structure, or binary chopping a sorted array, but is there some built-in way to do it that I've overlooked?
To find nearest value in large unsorted list I'd suggest you to use divide and conquer strategy - and process different parts of list in parallel. But enough small parts of list may be processed sequentially. Here is code for you: -module( finder ). -export( [ nearest/2 ] ). -define( THRESHOLD, 1000 ). %% %% sequential finding of nearest value %% %% if nearest value doesn't exists - return null %% nearest( Val, List ) when length(List) =< ?THRESHOLD -> lists:foldl( fun ( X, null ) when X < Val -> X; ( _X, null ) -> null; ( X, Nearest ) when X < Val, X > Nearest -> X; ( _X, Nearest ) -> Nearest end, null, List ); %% %% split large lists and process each part in parallel %% nearest( Val, List ) -> { Left, Right } = lists:split( length(List) div 2, List ), Ref1 = spawn_nearest( Val, Left ), Ref2 = spawn_nearest( Val, Right ), Nearest1 = receive_nearest( Ref1 ), Nearest2 = receive_nearest( Ref2 ), %% %% compare nearest values from each part %% case { Nearest1, Nearest2 } of { null, null } -> null; { null, Nearest2 } -> Nearest2; { Nearest1, null } -> Nearest1; { Nearest1, Nearest2 } when Nearest2 > Nearest1 -> Nearest2; { Nearest1, Nearest2 } when Nearest2 =< Nearest1 -> Nearest1 end. spawn_nearest( Val, List ) -> Ref = make_ref(), SelfPid = self(), spawn( fun() -> SelfPid ! { Ref, nearest( Val, List ) } end ), Ref. receive_nearest( Ref ) -> receive { Ref, Nearest } -> Nearest end. Testing in shell: 1> c(finder). {ok,finder} 2> 2> List = [ random:uniform(1000) || _X <- lists:seq(1,100000) ]. [444,724,946,502,312,598,916,667,478,597,143,210,698,160, 559,215,458,422,6,563,476,401,310,59,579,990,331,184,203|...] 3> 3> finder:nearest( 500, List ). 499 4> 4> finder:nearest( -100, lists:seq(1,100000) ). null 5> 5> finder:nearest( 40000, lists:seq(1,100000) ). 39999 6> 6> finder:nearest( 4000000, lists:seq(1,100000) ). 100000 Performance: (single node) 7> 7> timer:tc( finder, nearest, [ 40000, lists:seq(1,10000) ] ). {3434,10000} 8> 8> timer:tc( finder, nearest, [ 40000, lists:seq(1,100000) ] ). {21736,39999} 9> 9> timer:tc( finder, nearest, [ 40000, lists:seq(1,1000000) ] ). {314399,39999} Versus plain iterating: 1> 1> timer:tc( lists, foldl, [ fun(_X, Acc) -> Acc end, null, lists:seq(1,10000) ] ). {14994,null} 2> 2> timer:tc( lists, foldl, [ fun(_X, Acc) -> Acc end, null, lists:seq(1,100000) ] ). {141951,null} 3> 3> timer:tc( lists, foldl, [ fun(_X, Acc) -> Acc end, null, lists:seq(1,1000000) ] ). {1374426,null} So, yo may see, that on list with 1000000 elements, function finder:nearest is faster than plain iterating through list with lists:foldl. You may find optimal value of THRESHOLD in your case. Also you may improve performance, if spawn processes on different nodes.
Here is another code sample that uses ets. I believe a lookup would be made in about constant time: 1> ets:new(tab,[named_table, ordered_set, public]). 2> lists:foreach(fun(N) -> ets:insert(tab,{N,[]}) end, lists:seq(1,50000000)). 3> timer:tc(fun() -> ets:prev(tab, 500000) end). {21,499999} 4> timer:tc(fun() -> ets:prev(tab, 41230000) end). {26,41229999} The code surrounding would be a bit more than this of course but it is rather neat
So if the input isn't sorted, you can get a linear version by doing: closest(Target, [Hd | Tl ]) -> closest(Target, Tl, Hd). closest(_Target, [], Best) -> Best; closest(Target, [ Target | _ ], _) -> Target; closest(Target, [ N | Rest ], Best) -> CurEps = erlang:abs(Target - Best), NewEps = erlang:abs(Target - N), if NewEps < CurEps -> closest(Target, Rest, N); true -> closest(Target, Rest, Best) end. You should be able to do better if the input is sorted. I invented my own metric for 'closest' here as I allow the closest value to be higher than the target value - you could change it to be 'closest but not greater than' if you liked.
In my opinion, if you have a huge collection of data that does not change often, you shoud think about organize it. I have wrote a simple one based on ordered list, including insertion an deletion functions. It gives good results for both inserting and searching. -module(finder). -export([test/1,find/2,insert/2,remove/2,new/0]). -compile(export_all). new() -> []. insert(V,L) -> {R,P} = locate(V,L,undefined,-1), insert(V,R,P,L). find(V,L) -> locate(V,L,undefined,-1). remove(V,L) -> {R,P} = locate(V,L,undefined,-1), remove(V,R,P,L). test(Max) -> {A,B,C} = erlang:now(), random:seed(A,B,C), L = lists:seq(0,100*Max,100), S = random:uniform(100000000), I = random:uniform(100000000), io:format("start insert at ~p~n",[erlang:now()]), L1 = insert(I,L), io:format("start find at ~p~n",[erlang:now()]), R = find(S,L1), io:format("end at ~p~n result is ~p~n",[erlang:now(),R]). remove(_,_,-1,L) -> L; remove(V,V,P,L) -> {L1,[V|L2]} = lists:split(P,L), L1 ++ L2; remove(_,_,_,L) ->L. insert(V,V,_,L) -> L; insert(V,_,-1,L) -> [V|L]; insert(V,_,P,L) -> {L1,L2} = lists:split(P+1,L), L1 ++ [V] ++ L2. locate(_,[],R,P) -> {R,P}; locate (V,L,R,P) -> %% io:format("locate, value = ~p, liste = ~p, current result = ~p, current pos = ~p~n",[V,L,R,P]), {L1,[M|L2]} = lists:split(Le1 = (length(L) div 2), L), locate(V,R,P,Le1+1,L1,M,L2). locate(V,_,P,Le,_,V,_) -> {V,P+Le}; locate(V,_,P,Le,_,M,L2) when V > M -> locate(V,L2,M,P+Le); locate(V,R,P,_,L1,_,_) -> locate(V,L1,R,P). which give the following results (exec#WXFRB1824L)6> finder:test(10000000). start insert at {1347,28177,618000} start find at {1347,28178,322000} end at {1347,28178,728000} result is {72983500,729836} that is 704ms to insert a new value in a list of 10 000 000 elements and 406ms to find the nearest value int the same list.
I tried to have a more accurate information about the performance of the algorithm I proposed above, an reading the very interesting solution of Stemm, I decide to use the tc:timer/3 function. Big deception :o). On my laptop, I got a very bad accuracy of the time. So I decided to left my corei5 (2 cores * 2 threads) + 2Gb DDR3 + windows XP 32bit to use my home PC: Phantom (6 cores) + 8Gb + Linux 64bit. Now tc:timer works as expected, I am able to manipulate lists of 100 000 000 integers. I was able to see that I was loosing a lot of time calling at each step the length function, so I re-factored the code a little to avoid it: -module(finder). -export([test/2,find/2,insert/2,remove/2,new/0]). %% interface new() -> {0,[]}. insert(V,{S,L}) -> {R,P} = locate(V,L,S,undefined,-1), insert(V,R,P,L,S). find(V,{S,L}) -> locate(V,L,S,undefined,-1). remove(V,{S,L}) -> {R,P} = locate(V,L,S,undefined,-1), remove(V,R,P,L,S). remove(_,_,-1,L,S) -> {S,L}; remove(V,V,P,L,S) -> {L1,[V|L2]} = lists:split(P,L), {S-1,L1 ++ L2}; remove(_,_,_,L,S) ->{S,L}. %% local insert(V,V,_,L,S) -> {S,L}; insert(V,_,-1,L,S) -> {S+1,[V|L]}; insert(V,_,P,L,S) -> {L1,L2} = lists:split(P+1,L), {S+1,L1 ++ [V] ++ L2}. locate(_,[],_,R,P) -> {R,P}; locate (V,L,S,R,P) -> S1 = S div 2, S2 = S - S1 -1, {L1,[M|L2]} = lists:split(S1, L), locate(V,R,P,S1+1,L1,S1,M,L2,S2). locate(V,_,P,Le,_,_,V,_,_) -> {V,P+Le}; locate(V,_,P,Le,_,_,M,L2,S2) when V > M -> locate(V,L2,S2,M,P+Le); locate(V,R,P,_,L1,S1,_,_,_) -> locate(V,L1,S1,R,P). %% test test(Max,Iter) -> {A,B,C} = erlang:now(), random:seed(A,B,C), L = {Max+1,lists:seq(0,100*Max,100)}, Ins = test_insert(L,Iter,[]), io:format("insert:~n~s~n",[stat(Ins,Iter)]), Fin = test_find(L,Iter,[]), io:format("find:~n ~s~n",[stat(Fin,Iter)]). test_insert(_L,0,Res) -> Res; test_insert(L,I,Res) -> V = random:uniform(1000000000), {T,_} = timer:tc(finder,insert,[V,L]), test_insert(L,I-1,[T|Res]). test_find(_L,0,Res) -> Res; test_find(L,I,Res) -> V = random:uniform(1000000000), {T,_} = timer:tc(finder,find,[V,L]), test_find(L,I-1,[T|Res]). stat(L,N) -> Aver = lists:sum(L)/N, {Min,Max,Var} = lists:foldl(fun (X,{Mi,Ma,Va}) -> {min(X,Mi),max(X,Ma),Va+(X-Aver)*(X-Aver)} end, {999999999999999999999999999,0,0}, L), Sig = math:sqrt(Var/N), io_lib:format(" average: ~p,~n minimum: ~p,~n maximum: ~p,~n sigma : ~p.~n",[Aver,Min,Max,Sig]). Here are some results. 1> finder:test(1000,10). insert: average: 266.7, minimum: 216, maximum: 324, sigma : 36.98121144581393. find: average: 136.1, minimum: 105, maximum: 162, sigma : 15.378231367748375. ok 2> finder:test(100000,10). insert: average: 10096.5, minimum: 9541, maximum: 12222, sigma : 762.5642595873478. find: average: 5077.4, minimum: 4666, maximum: 6937, sigma : 627.126494417195. ok 3> finder:test(1000000,10). insert: average: 109871.1, minimum: 94747, maximum: 139916, sigma : 13852.211285206417. find: average: 40428.0, minimum: 31297, maximum: 56965, sigma : 7797.425562325042. ok 4> finder:test(100000000,10). insert: average: 8067547.8, minimum: 6265625, maximum: 16590349, sigma : 3199868.809140206. find: average: 8484876.4, minimum: 5158504, maximum: 15950944, sigma : 4044848.707872872. ok On the 100 000 000 list, it is slow, and the multi process solution cannot help on this dichotomy algorithm... It is a weak point of this solution, but if you have several processes in parallel requesting to find a nearest value, it will be able to use the multicore anyway. Pascal.
In Groovy, how do I add up the values for a certain property in a map?
I have the following map: def map = []; map.add([ item: "Shampoo", count: 5 ]) map.add([ item: "Soap", count: 3 ]) I would like to get the sum of all the count properties in the map. In C# using LINQ, it would be something like: map.Sum(x => x.count) How do I do the same in Groovy?
Assuming you have a list like so: List list = [ [item: "foo", count: 5], [item: "bar", count: 3] ] Then there are multiple ways of doing it. The most readable is probably int a = list.count.sum() Or you could use the Closure form of sum on the whole list int b = list.sum { it.count } Or you could even use a more complex route such as inject int c = list.count.inject { tot, ele -> tot + ele } // Groovy 2.0 // c = list.count.inject( 0 ) { tot, ele -> tot + ele } // Groovy < 2.0 All of these give the same result. assert ( a == b ) && ( b == c ) && ( c == 8 ) I would use the first one.
You want to use the collect operator. I checked the following code with groovysh: list1 = [] total = 0 list1[0] = [item: "foo", count: 5] list1[1] = [item: "bar", count: 3] list1.collect{ total += it.count } println "total = ${total}"
First of all, you're confusing map and list syntax in your example. Anyhow, Groovy injects a .sum(closure) method to all collections. Example: [[a:1,b:2], [a:5,b:4]].sum { it.a } ===> 6