Summing the max values when grouping other then exact - couchdb

I will explain step by step. I'm trying to get one single integer value for a certain key set.
So let's say that i have the following map function
function(doc) {
emit([doc.key1, doc.key2, doc.key3], doc.integerValue);
}
the doc.IntegerValue in this case is a sequence. So let's say that someone makes 0,10,20,30 progress and after that he starts on a new sessionId with once again 0,10,20,30,40.
Now i was trying to reach the following.
1) When doing a group exact it should ouptut the max integerValue found.(first 30, second 40)
2) When doing a group on level 2 it should output the sum according to the max values on scenario 1. So that i can count the total value. So let's assume that key2 and key1 where the same in both examples, it would give me 70.
3) when doing a group on level 1 it should also give 70. as it are the same 2 rows as in the last scenario.
The _stats reduce function does not do the job as it only gives the max of the whole group or the sum of the whole group and my request is a combination of both.
besides that i know, i have the option to leaf the grouping thing out, and then select with startKey and endKey, but i want to check if it is possible with the grouping option.
Thanks in advance for any ideas.!

If you are not interested in stats, why don't you just output the single highest value of integerValue* in your view and use _sum as the reduce function?
// map function
function(doc) {
if(!doc.integerValue || !(doc.integerValue instanceof Array)) return
return ([doc.key1, doc.key2, doc.key3], doc.integerValue.sort().reverse()[0])
}
// reduce function
_sum
This will mean
group_level=exact will output the highest value of the sequence
group_level=2 will output the sum of those single highest values of each sequence
group_level=1 will do the same.
*Assuming integerValue is an array -- I wasn't 100% sure from your question.

Related

Counting if part of string is within interval

I am currently trying to check if a number in a comma-separated string is within a number interval. What I am trying to do is to check if an area code (from the comma-separated string) is within the interval of an area.
The data:
AREAS
Area interval
Name
Number of locations
1000-1499
Area 1
?
1500-1799
Area 2
?
1800-1999
Area 3
?
GEOLOCATIONS
Name
Areas List
Location A
1200, 1400
Location B
1020, 1720
Location C
1700, 1920
Location D
1940, 1950, 1730
The result I want here is the number of unique locations in the "Areas list" within the area interval. So Location D should only count ONCE in the 1800-1999 "area", and the Location A the same in the 1000-1499 location. But location B should count as one in both 1000-1499 and one in 1500-1799 (because a number from each interval is in the comma-separated string in "Areas list"):
Area interval
Name
Number of locations
1000-1499
Area 1
2
1500-1799
Area 2
3
1800-1999
Area 3
2
How is this possible?
I have tried with a COUNTIFS, but it doesnt seem to do the job.
Here is one option using FILTERXML():
Formula in C2:
=SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]"))
Where:
"<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>" - Is the part where we construct a valid piece of XML. The theory here is that we use three axes here. Each t-node will be named a literal 1 to make sure that once we return them with xpath we can sum the result. The outer x-nodes are there to make sure Excel will handle the inner axes correctly. If you are curious to know how this xml-syntax looks at the end, it's best to step through using the 'Evaluate Formula' function on the Data-tab;
//t[count(.//*[.>="&SUBSTITUTE(A2,"-","][.<=")&"])>0]")) - Basically means that we collect all t-nodes where the count of child s-nodes that are >= to the leftmost number and <= to the rightmost number is larger than zero. For A2 the xpath would look like //t[count(.//*[.>=1000][.<=1499])>0]")) after substitution. In short: //t - Select t-nodes, where count(.//* select all child-nodes where count of nodes that fullfill both requirements [.>=1000][.<=1499] is larger than zero;
Since all t-nodes equal the number 1, the SUM() of these t-nodes equals the amount of unique locations that have at least one area in its Areas List;
Important to note that FILTERXML() will result into an error if no t-nodes could be found. That would mean we need to wrap the FILTERXML() in an IFERROR(...., 0) to counter that and make the SUM() still work correctly.
Or, wrap the above in BYROW():
Formula in C2:
=BYROW(A2:A4,LAMBDA(a,SUM(FILTERXML("<x><t>"&TEXTJOIN("</s></t><t>",,"1<s>"&SUBSTITUTE(B$7:B$10,", ","</s><s>"))&"</s></t></x>","//t[count(.//*[.>="&SUBSTITUTE(a,"-","][.<=")&"])>0]"))))
Using MMULT and TEXTSPLIT:
=LET(rng,TEXTSPLIT(D2,"-"),
tarr,IFERROR(--TRIM(TEXTSPLIT(TEXTJOIN(";",,$B$2:$B$5),",",";")),0),
SUM(--(MMULT((tarr>=--TAKE(rng,,1))*(tarr<=--TAKE(rng,,-1)),SEQUENCE(COLUMNS(tarr),,1,0))>0)))
I am in very distinguished company but will add my version anyway as byrow probably is a slightly different approach
=LET(range,B$2:B$5,
lowerLimit,--#TEXTSPLIT(E2,"-"),
upperLimit,--INDEX(TEXTSPLIT(E2,"-"),2),
counts,BYROW(range,LAMBDA(r,SUM((--TEXTSPLIT(r,",")>=lowerLimit)*(--TEXTSPLIT(r,",")<=upperLimit)))),
SUM(--(counts>0))
)
Here the ugly way to do it, with A LOT of helper columns. But not so complicated 🙂
F4= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(B4;",";"</r><r>")&"</r></m>";"//r"))
F11= =TRANSPOSE(FILTERXML("<m><r>"&SUBSTITUTE(A11;"-";"</r><r>")&"</r></m>";"//r"))
F16= =SUM(F18:F21)
F18= =IF(SUM(($F4:$O4>=$F$11)*($F4:$O4<=$G$11))>0;1;"")
G18= =IF(SUM(($F4:$O4>=$F$12)*($F4:$O4<=$G$12))>0;1;"")
H18= =IF(SUM(($F4:$O4>=$F$13)*($F4:$O4<=$G$13))>0;1;"")

Get top(N) or single value using Bql

How is it possible to use a PXSelect statement so that it retrieves the Top(N) or the first value for a particular DAC.
Let's say that I have a table with a sequence number and I want to obtain the record with the largest sequence number. How can I do that?
Of course, I would like that for performance reasons, SQL just sends 1 record.
You can use SelectWindowed in place of Select on your PXSelect to get the top N records. In the example below it will get the Top 1. If you change the totalRows value of 1 to 5 it would get the top 5 (except you would have to loop or get the PXResultSet to use all 5 records retrieved.)
Top 1 Example:
DiscountSequence firstRow = PXSelect<DiscountSequence,
Where<DiscountSequence.discountID, Equal<Required<DiscountSequence.discountID>>>
>.SelectWindowed(this, 0, 1, someDiscountID);
Top 5 Example:
foreach (DiscountSequence row in PXSelect<DiscountSequence,
Where<DiscountSequence.discountID, Equal<Required<DiscountSequence.discountID>>>
>.SelectWindowed(this, 0, 5, someDiscountID))
{
//5 rows returned
}

Detailed Explanation for Dense Rank Explanation

I am trying to understand Dense rank function when multiple arguments were uses. Could some one help me in understanding with the below example or any other example?
Thanks a ton in Advance!
Calculated Column: DenseRank([Country],[Event Identifier])
From the Help Pages
DenseRank(Arg1, Arg2, Arg3...)
Returns an integer value ranking of the values in the selected column.
The first argument is the column to be ranked. An optional argument is
a string determining whether to use an ascending (default) or a
descending ranking. For the highest value to retrieve rank 1, use the
argument "desc", for the lowest value to retrieve rank 1, use "asc".
Ties are given the same rank value and the highest ranking number
equals the number of unique values in the column.
Additional column arguments (optional) can be used when the column
should be split into separately ranked categories.
Examples:
DenseRank([Sales])
DenseRank([Sales], "desc", [Region])
So, in your example you are ranking the Country grouped by / partitioned by the Even Identifier with a default "asc" order. This is done alphabetically, thus, if we look at Interim 1 we will see 4 ranks, 1-4 since you have 4 countries for Interim 1, in alphabetical order (ascending). Each "group", which is the 3rd argument and in your case Event Identifier, will get a set of rankings from 1 - n where n is the number of distinct values. If you remove this argument, the entire data set will be ranked without consideration of Event Identifier

Binary search - worst/avg case

I'm finding it difficult to understand why/how the worst and average case for searching for a key in an array/list using binary search is O(log(n)).
log(1,000,000) is only 6. log(1,000,000,000) is only 9 - I get that, but I don't understand the explanation. If one did not test it, how do we know that the avg/worst case is actually log(n)?
I hope you guys understand what I'm trying to say. If not, please let me know and I'll try to explain it differently.
Worst case
Every time the binary search code makes a decision, it eliminates half of the remaining elements from consideration. So you're dividing the number of elements by 2 with each decision.
How many times can you divide by 2 before you are down to only a single element? If n is the starting number of elements and x is the number of times you divide by 2, we can write this as:
n / (2 * 2 * 2 * ... * 2) = 1 [the '2' is repeated x times]
or, equivalently,
n / 2^x = 1
or, equivalently,
n = 2^x
So log base 2 of n gives you x, which is the number of decisions being made.
Finally, you might ask, if I used log base 2, why is it also OK to write it as log base 10, as you have done? The base does not matter because the difference is only a constant factor which is "ignored" by Big O notation.
Average case
I see that you also asked about the average case. Consider:
There is only one element in the array that can be found on the first try.
There are only two elements that can be found on the second try. (Because after the first try, we chose either the right half or the left half.)
There are only four elements that can be found on the third try.
You can see the pattern: 1, 2, 4, 8, ... , n/2. To express the same pattern going in the other direction:
Half the elements take the maximum number of decisions to find.
A quarter of the elements take one fewer decision to find.
etc.
Since half of the elements take the maximum amount of time, it doesn't matter how much less time the other elements take. We could assume that all elements take the maximum amount of time, and even if half of them actually take 0 time, our assumption would not be more than double whatever the true average is. We can ignore "double" since it is a constant factor. So the average case is the same as the worst case, as far as Big O notation is concerned.
For binary search, the array should be arranged in ascending or descending order.
In each step, the algorithm compares the search key value with the key value of the middle element of the array.
If the keys match, then a matching element has been found and its index, or position, is returned.
Otherwise, if the search key is less than the middle element's key, then the algorithm repeats its action on the sub-array to the left of the middle element.
Or, if the search key is greater,then the algorithm repeats its action on the sub-array to the right.
If the remaining array to be searched is empty, then the key cannot be found in the array and a special "not found" indication is returned.
So, a binary search is a dichotomic divide and conquer search algorithm. Thereby it takes logarithmic time for performing the search operation as the elements are reduced by half in each of the iteration.
For sorted lists which we can do a binary search, each "decision" made by the binary search compares your key to the middle element, if greater it takes the right half of the list, if less it will take the left half of the list (if it's a match it will return the element at that position) you effectively reduce your list by half for every decision yielding O(logn).
Binary search however, only works for sorted lists. For un-sorted lists you can do a straight search starting with the first element yielding a complexity of O(n).
O(logn) < O(n)
Although it entirely depends on how many searches you'll be doing, your inputs, etc what your best approach would be.
For Binary search the prerequisite is a sorted array as input.
• As the list is sorted:
• Certainly we don't have to check every word in the dictionary to look up a word.
• A basic strategy is to repeatedly halve our search range until we find the value.
• For example, look for 5 in the list of 9 #s below.v = 1 1 3 5 8 10 18 33 42
• We would first start in the middle: 8
• Since 5<8, we know we can look at just the first half: 1 1 3 5
• Looking at the middle # again, narrow down to 3 5
• Then we stop when we're down to one #: 5
How many comparison is needed: 4 =log(base 2)(9-1)=O(log(base2)n)
int binary_search (vector<int> v, int val) {
int from = 0;
int to = v.size()-1;
int mid;
while (from <= to) {
mid = (from+to)/2;
if (val == v[mid])
return mid;
else if (val > v[mid])
from = mid+1;
else
to = mid-1;
}
return -1;
}

Extend value to arithmetic mean

Might be a quite stupid question and I'm not sure if it belongs here or to math.
My problem:
I have several elements of type X which have a boolean attribute Y.
To calculate the percentage of elements where Y is true, I count all X where Y is true and divide it by the number of elements.
But I don't want to iterate all the time above all elements to update that percentage-value.
My idea was:
If I had 33% for 3 elements, and am adding a fourth one where Y is true:
(0.33 * 3 + 1) / 4 = 0.4975
Obviously that does not work well because of the 0.33.
Is there any way for getting an accurate solution without iteration or saving the number of items where Y is true?
Keep a count of the total number of elements and of the "true" ones. Global vars, object member variables, whatever. I assume that sometime back when the program is starting, you have zero elements. Every time an element is added, removed, or its boolean attribute changes, increment or decrement those counts as appropriate. You'll never have to iterate over the list (except maybe for testing) but at the cost of every change to the list having to include fiddling with those variables.
Your idea doesn't work because 0.33 does not equal 1/3. It's an approximation. If you take the exact value, you get the right answer:
(1/3 * 3 + 1) / 4 = (1 + 1) / 4 = 1/2
My question is, if you can store the value of 33% without iterating, why not just store the values of 1 and 3 and calculate them? That is, just keep a running total of the number of true values and number of objects. Increment when you get new ones. Calculate on demand. It's not necessary to iterate every time is way.

Resources