How to check maximum of consecutive values greater than 5 - python-3.x

I want to check in dataframe for how many rows consecutive values greater than 5.
df=pd.DataFrame({'A':[3,4,7,8,11,6,15,3,15,16,87]})
out:
df=pd.DataFrame({'count_greater_5_max':[5]})

Use:
#compare greater like 5
a = df.A.gt(5)
#running sum
b = a.cumsum()
#counter only for consecutive values
out = b-b.mask(a).ffill()
#maximum value of counter
print (int(out.max()))
5
df=pd.DataFrame({'count_greater_5_max':[int(out.max())]})

Related

I want to improve speed of my algorithm with multiple rows input. Python. Find average of consequitive elements in list

I need to find average of consecutive elements from list.
At first I am given lenght of list,
then list with numbers,
then am given how many test i need to perform(several rows with inputs),
then I am given several inputs to perform tests(and need to print as many rows with results)
every row for test consist of start and end element in list.
My algorithm:
nu = int(input()) # At first I am given lenght of list
numbers = input().split() # then list with numbers
num = input() # number of rows with inputs
k =[float(i) for i in numbers] # given that numbers in list are of float type
i= 0
while i < int(num):
a,b = input().split() # start and end element in list
i += 1
print(round(sum(k[int(a):(int(b)+1)])/(-int(a)+int(b)+1),6)) # round up to 6 decimals
But it's not fast enough.I was told it;s better to get rid of "while" but I don't know how. Appreciate any help.
Example:
Input:
8 - len(list)
79.02 36.68 79.83 76.00 95.48 48.84 49.95 91.91 - list
10 - number of test
0 0 - a1,b1
0 1
0 2
0 3
0 4
0 5
0 6
0 7
1 7
2 7
Output:
79.020000
57.850000
65.176667
67.882500
73.402000
69.308333
66.542857
69.713750
68.384286
73.668333
i= 0
while i < int(num):
a,b = input().split() # start and end element in list
i += 1
Replace your while-loop with a for loop. Also you could get rid of multiple int calls in the print statement:
for _ in range(int(num)):
a, b = [int(j) for j in input().split()]
You didn't spell out the constraints, but I am guessing that the ranges to be averaged could be quite large. Computing sum(k[int(a):(int(b)+1)]) may take a while.
However, if you precompute partial sums of the input list, each query can be answered in a constant time (sum of numbers in the range is a difference of corresponding partial sums).

calculate percentage of occurrences in column pandas

I have a column with thousands of rows. I want to select the top significant one. Let's say I want to select all the rows that would represent 90% of my sample. How would I do that?
I have a dataframe with 2 columns, one for product_id one showing whether it was purchased or not (value is or 0 or 1)
product_id purchased
a 1
b 0
c 0
d 1
a 1
. .
. .
with df['product_id'].value_counts() I can have all my product-ids ranked by number of occurrences.
Let's say now I want to get the number of product_ids that I should consider in my future analysis that would represent 90% of the total of occurences.
Is there a way to do that?
If want all product_id with counts under 0.9 then use:
s = df['product_id'].value_counts(normalize=True).cumsum()
df1 = df[df['product_id'].isin(s.index[s < 0.9])]
Or if want all rows sorted by counts and get 90% of them:
s1 = df['product_id'].map(df['product_id'].value_counts()).sort_values(ascending=False)
df2 = df.loc[s1.index[:int(len(df) * 0.9)]]

How to find the min and max elements in a list that has nan in between

I have a dataframe that has a column named "score". I am extracting all the elements from that column into a list. It has 'nan's in between. I wish to identify the min and max of elements before every 'nan' occurs.
I was looking into converting the column into a list, and traverse the list until I encounter an "nan". But how do I traverse back to find the min and max elements right before nan?
This is the code I wrote to convert a column of a dataframe into a list and then identify the "nan".
score_list = description_df['score'].tolist()
for i in score_list:
print(i)
if math.isnan(i):
print("\n")
Suppose my data looks like this,
11.03680137760893
5.351482041139766
10.10019513222711
nan
0.960990030082931
nan
6.46983084276682
32.46794015293125
nan
Then, I should be able to identify max as 11.03680137760893
and min as 5.351482041139766 before the occurrence of the first "nan", 0.960990030082931 as the min and max before the occurrence of second nan and after the occurrence of first nan, and 32.46794015293125 as max and 6.46983084276682 as min after the second 'nan' and before the third 'nan'
You can create groups by testing missing values by Series.isna with Series.cumsum, aggregate by GroupBy.agg with min and max and last remove only missing rows by DataFrame.dropna:
df = df.groupby(df['score'].isna().cumsum())['score'].agg(['min','max']).dropna()
print (df)
min max
score
0 5.351482 11.036801
1 0.960990 0.960990
2 6.469831 32.467940
You can create two variables called min and max that begin with a default value each time you find a nan and print them (or store).
import sys
score_list = description_df['score'].tolist()
max = sys.float_info.min
min = sys.float_info.max
for i in score_list:
print(i)
if math.isnan(i):
print("max =", max, "min =", min, "\n")
max = sys.float_info.min
min = sys.float_info.max
else:
if i > max:
max = i
if i < min:
min = i

Intermediate steps in evaluation of Frequency formula

This has reference to [SO question]Counting unique list of items from range based on criteria from other ranges
Formula Suggested by Scot Craner is :
=SUM(--(FREQUENCY(IF(B2:B7<=25,IF(C2:C7<=35,COUNTIF(A2:A7,"<"&A2:A7),""),""),COUNTIF(A2:A7,"<"&A2:A7))>0))
I have been able to understand clearly the logic and evaluation of the formula except for this step shown in the attached snapshots.
As per MS Office document:
FREQUENCY(data_array, bins_array) The FREQUENCY function syntax has
the following arguments: Data_array Required. An array of or
reference to a set of values for which you want to count frequencies.
If data_array contains no values, FREQUENCY returns an array of zeros.
Bins_array Required. An array of or reference to intervals into
which you want to group the values in data_array. If bins_array
contains no values, FREQUENCY returns the number of elements in
data_array.
It is clear to me as to How {1;1;4;0;"";"") comes in data_array and also how {1;1;4;0;5;3} comes in bins_array.But how it evaluates to {2;0;1;1;0;0;0} is not clear to me.
Would appreciate if someone can lucidly explain it.
So you wants to know how
FREQUENCY({1;1;4;0;"";""},{1;1;4;0;5;3}) evaluates to {2;0;1;1;0;0;0}?
Problem is that the bins_array not needs to be sorted to make FREQUENCY working. But of course it internally must sort the bins_array to get the intervals into which to group the values in data_array. Then it groups and counts and then it returns the counted numbers in the same order the bins was given in bins_array.
Scores Bins
1 1
1 1
4 4
0 0
"" 5
"" 3
Bins sorted
0 (<=0)
1 (>0, <=1)
1 (>1, <=1) == not possible
3 (>1, <=3)
4 (>3, <=4)
5 (>4, <=5)
(>5)
Bin Description Result
1 Number of scores (>0, <=1) 2
1 Number of scores (>1, <=1) == not possible 0
4 Number of scores (>3, <=4) 1
0 Number of scores (<=0) 1
5 Number of scores (>4, <=5) 0
3 Number of scores (>1, <=3) 0
Number of scores (>5) 0

What is wrong with formula: =IF(A1>B4:D4,1,0)?

Question 1
What is wrong with formula?
=IF(A1>B4:D4,1,0)
I want if cell A1 is greater than set of cells (B4:D4) then it returns 1.
Answered
Question 2:
How can i select/indentify two MAX values from set of cells? I want to count two max values.
For example:
(header) A B C D E
(row1) 1 5 4 1 3
It should return
(header) F G H I J
(row1) 0 1 1 0 0
For top 2 values I would use the LARGE function
=IF(A1>LARGE(B4:D4,2),1,0)
The LARGE function returns the nth largest value, so LARGE(B4:D4,1) would be equivalent to MAX(B4:D4), but LARGE(B4:D4,2) returns the 2nd largest value
I guess you mean bigger than the max of them:
=IF(A1>MAX(B4:D4),1,0)

Resources