How to find 90th% value? - groovy

I have this list:
def grades = [5,4,3,2,1,1]
Where index is a grade, and value is an occurrence of the grade:
Grade
Occurrence
0
5
1
4
2
3
3
2
4
1
5
1
How can I calculate the 90th percentile for the grades?

This gets me a whole grade. I was hoping to find an exact value of 90th percentile, but this will do for me.
def grades = [5,4,3,2,1,1]
sum=grades.sum()
per_grade=0
per_value=sum*0.9
grades.eachWithIndex { grade_count, grade ->
per_value-=grade_count
if (per_value<0){per_grade=grade-1}
if (per_value==0){per_grade=grade}
}
out.write(per_grade)

A total of 16 grades have been given. 90% are 14.4 grades, so discard the lowest 14 grades and take the smallest remaining (in your example it will be 4).
How to code? There are some ways:
You may count through the array you have got. Subtract 5 from 14 (= 9), then 4, then 3, then 2. Once you reach zero, you’re at the index of the 90th percentile.
Maybe easier to understand, but will require a few more code lines: Put all 16 grades into an array (or list): [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5]. Since the array is sorted, the median is found at index 14.

Related

Nuanced Excel Question; calculating proportions

Fellow overflowers, all help is appreciated;
I have the following rows of values (always 7 values per row) of data in Excel (3 examples below), where data is coded as 1 or 2. I am interested in the 1's.
2, 2, 1, 2, 2, 1, 1.
1, 2, 2, 2, 2, 1, 2.
2, 2, 2, 1, 1, 1, 2.
I use the =MATCH(1,A1:G1,0) to tell me WHEN the first 1 appears, BUT now I want to calculate the proportion that 1's make up of the the remaining values in the row.
For example;
2, 2, 1, 2, 2, 1, 1. (1 first appears at point 3, but then 1's make up 2 out of 4 remaining points; 50%).
1, 2, 2, 2, 2, 1, 2. (1 first appears at point 1, but then 1's make up 1 out of the 6 remaining points; 16%).
2, 2, 2, 1, 1, 1, 2. (1 first appears at point 4, but then 1's make up 2 out of the 3 remaining points; 66%).
Please help me calculate this proportion!
You could use this one
=(LEN(SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",",""))
-LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",",""),1,""))
)/LEN(SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",",""))
The
SUBSTITUTE(SUBSTITUTE(MID(A1,SEARCH(1,A1)+3,1000)," ",""),",","")
-part gets the string after the first 1. The single 1 in the middle part is the one, you want to calculate the percentage for. So if you want to adapt the formula to other chars, you have to change the single 1 in th emiddle part and the three 1s in the three searches.
EDIT thank you for the hint #foxfire
A solution for values in columns would be
=COUNTIF(INDEX(A1:G1,1,MATCH(1,A1:G1,0)+1):G1,1)/(COUNT(A1:G1)-MATCH(1,A1:G1,0))
You can do it with SUMPRODUCT:
My formula in column H is a MATCH like yours:
=MATCH(1;A3:G3;0)
My formula for calculatin % of 1's over reamining numbers after first 1 found, is:
=SUMPRODUCT((A3:G3=1)*(COLUMN(A3:G3)>H3))/(7-H3)
This is how it works:
(A3:G3=1) will return an array of 1 and 0 if cell value is 1 or not. So for row 3 it would be {0;0;1;0;0;1;1}.
COLUMN(A3:G3)>H3 will return an array of 1 and 0 if column number of cell is higher than column number of first 1 found, (that matchs with its position inside array). So for row 3 it would be {0;0;0;1;1;1;1}
We multiply both arrays. So for row 3 it would be {0;0;1;0;0;1;1} * {0;0;0;1;1;1;1} = {0;0;0;0;0;1;1}
With SUMPRODUCT we sum up the array of 1 and 0 from previous step. So for row 3 we would obtain 2. That means there are 2 cells with value 1 after first 1 found.
(7-H3) will just return how many cells are after first 1 found, so fo row 3, it means there are 4 cells after first 1 found.
We divide value from step 4 by value from previous step, and that's the % you want. So for row 3, it would be 2/4=0,50. That means 50%
update: I used 2 columns just in case you need to show where is the first 1. But in case you want a single column with the %, formula would be:
=SUMPRODUCT((A3:G3=1)*(COLUMN(A3:G3)>MATCH(1;A3:G3;0)))/(7-MATCH(1;A3:G3;0))

Pyspark: How to count the number of each equal distance interval in RDD

I have a RDD[Double], I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD.
For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10]. I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10].
As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2,3),[3,4),[4,5),[5,6), and both [6,7) and [7,8) have two element. [9,10] has one element.
Finally I expected an array like array([1,1,1,1,1,1,2,2,0,1].
Try this. I have assumed that first element of the range is inclusive and last exclusive. Please confirm on this. For example when considering the range [0,1] and element is 0 the condition is element >= 0 and element < 1.
for index_upper, element_upper in enumerate(array_range):
counter = 0
for index, element in enumerate(rdd.collect()):
if element >= element_upper[0] and element < element_upper[1] :
counter +=1
countElementsWithinRange.append(counter)
print(rdd.collect())
# [0, 1, 2, 3, 4, 5, 6, 6, 7, 7, 10]
print(countElementsWithinRange)
# [1, 1, 1, 1, 1, 1, 2, 2, 0, 0]

Pandas - Least frequent value in column

I have a Pandas series of integers, 'win'. I want the values most_common and least_common to be the most and least frequent values in the column. for example, with the following numbers, I would want most_common to be 2 and least_common to be 1. If it is a tie (either way) then this can be broken arbitrarily.
0 1 2 2 2 0 0 2 2 0
I can find most_common using the following code:
win.mode()[0]
How can I find the least common? I tried the following code, but it did not work, and in any case I was not sure if this was the best way to go about this:
lowest =valid_loss.value_counts().tail(1)[0]
I think need last value of index for lowest value and first index for top value:
valid_loss = pd.Series([0, 1, 2, 2, 2, 0, 0, 2, 2, 0])
s = valid_loss.value_counts()
print (s)
2 5
0 4
1 1
dtype: int64
highest = s.index[0]
print (highest)
2
lowest = s.index[-1]
print (lowest)
1

Dynamic Programming: Finding the number of ways in which a order-dependant sum of numbers is less than or equal to a number

Given a number N, and a set S of numbers, find the number of ways in which a order-dependant sum of numbers of S is less than or equal to N. The numbers in S can occur more than once. For example, when N = 3 and S={1, 2}, the answer is 6. In this example, 1, 1+1, 2, 1+1+1, 1+2, 2+1 are less than or equal to 3.
When S = {1, 2}, the answers for N = 0, 1, 2... are 0, 1, 3, 6, 11, 19, 32.... Think about why these numbers might be the same as the Fibonacci sequence with 2 subtracted.
When S={n1, n2, …, nk}, you have f(N)=f(N-n1)+f(N-n2)+…+f(N-nk). So you just have to compute f(i) for i < nk and then you can easily compute f(n) with the formula (f(n),f(n+1),…,f(n+nk))=(f(0),f(1),…,f(nk))*A^n where A is the companion matrix of the sequence.

find the longest increasing subsequence (LIS)

Given A= {1,4,2,9,7,5,8,2}, find the LIS. Show the filled dynamic programming table and how the solution is found.
My book doesnt cover LIS so im a bit lost on how to start. For the DP table, ive done something similar with Longest Common Subsequences. Any help on how to start this would be much appreciated.
Already plenty of answers on this topic but here's my walkthrough, I view this site as a repository of answers for future posterity and this is just to provide additional insight when I worked through it myself.
The longest Increasing Subsequence (LIS) problem is to find the length of the longest subsequence of a given sequence such that all elements of the
subsequence are sorted in increasing order. For example, length of LIS for
{ 10, 22, 9, 33, 21, 50, 41, 60, 80 } is 6 and LIS is {10, 22, 33, 50, 60, 80}.
Let S[pos] be defined as the smallest integer that ends an increasing sequence of length pos.
Now iterate through every integer X of the input set and do the following:
If X > last element in S, then append X to the end of S. This essentialy means we have found a new largest LIS.
Otherwise find the smallest element in S, which is >= than X, and change it to X. Because S is sorted at any time, the element can be found
using binary search in log(N).
Total runtime - N integers and a binary search for each of them - N * log(N) = O(N log N)
Now let's do a real example:
Set of integers: 2 6 3 4 1 2 9 5 8
Steps:
0. S = {} - Initialize S to the empty set
1. S = {2} - New largest LIS
2. S = {2, 6} - 6 > 2 so append that to S
3. S = {2, 3} - 6 is the smallest element > 3 so replace 6 with 3
4. S = {2, 3, 4} - 4 > 3 so append that to s
5. S = {1, 3, 4} - 2 is the smallest element > 1 so replace 2 with 1
6. S = {1, 2, 4} - 3 is the smallest element > 2 so replace 3 with 2
7. S = {1, 2, 4, 9} - 9 > 4 so append that to S
8. S = {1, 2, 4, 5} - 9 is the smallest element > 5 replace 9 with 5
9. S = {1, 2, 4, 5, 8} - 8 > 5 so append that to S
So the length of the LIS is 5 (the size of S).
Let's take some other sequences to see that this will cover all possible caveats, each presents its own issue
say we have 1,2,3,4,9,2,3,4,5,6,7,8,10
basically it builds out 12349 first, then 2 will replace 3, 3 will replace 4, 4 will replace 9, then append 5,6,7,8,10
so will look like 1,2,2,3,4,6,7,8,10
take the other case we have 1,2,3,4,5,9,2,10
this will give us 1,2,2,4,5,9,10
or take the case we have 1,2,3,4,5,9,6,7,8,10
this will give us 1,2,3,4,5,7,8,10
so that kind of illuminates what goes on, in the first case the critical juncture being what happens when you hit the 2 after the 9,
how do you deal with these. well the block of 2,3,4 won't do anything really, when you hit 5 you replace the 9 because the 5 and 9
are virtually indifferentiable 9 ends the block of the first 5 increasing elements, you replace 9 with 5 because 5 is smaller so there
is greater potential to hit something > 5 later on. but you only replace the smallest element > itself. for ex. in the last case,
if your 6 doesn't replace 9 but instead replaces 1 and 7 replaces 2 and 8 replaces 3, then we get a final array of 7 elements instead
of 9. So just do a couple of these and figure out the pattern, this logic isn't the easiest to translate to paper.
There's a very strong relation between LIS and LCS.
http://en.wikipedia.org/wiki/Longest_increasing_subsequence
This article explains it pretty well I think. Basically the idea is, you can reduce one problem to the other (this is the case in many situations involving Dynamic programming).

Resources