How to count value greater than or equal to 0.5 continuous for 5 or greater than 5 rows python - python-3.x

I am trying to count values in column x greater than or equal to 0.5 for continuous for 5 times or greater. i also need to use groupby function for my data .
i used this function work fine , but this function can not count continuous occurrence of value , its just count all values greater than or equal 0.5
data['points_greater_0.5'] = data[abs(data['x'])>=0.5].groupby(['y','z','n'])['x'].count()
but i want to count if greater than or equal to 0.5 value occurs continuous for 5 times or more

As the source DataFrame I took:
x y z n
0 0.1 1.0 1.0 1.0
1 0.5 1.0 1.0 1.0
2 0.6 1.0 1.0 1.0
3 0.7 1.0 1.0 1.0
4 0.6 1.0 1.0 1.0
5 0.5 1.0 1.0 1.0
6 0.1 1.0 1.0 1.0
7 0.5 1.0 1.0 1.0
8 0.6 1.0 1.0 1.0
9 0.7 1.0 1.0 1.0
10 0.1 1.0 1.0 1.0
11 0.5 1.0 1.0 1.0
12 0.6 1.0 1.0 1.0
13 0.7 1.0 1.0 1.0
14 0.7 1.0 1.0 1.0
15 0.6 1.0 1.0 1.0
16 0.5 1.0 1.0 1.0
17 0.1 1.0 1.0 1.0
18 0.5 2.0 1.0 1.0
19 0.6 2.0 1.0 1.0
20 0.7 2.0 1.0 1.0
21 0.6 2.0 1.0 1.0
22 0.5 2.0 1.0 1.0
(one group for (y, z, n) == (1.0, 1.0, 1.0) and another for (2.0, 1.0, 1.0)).
Start from import itertools as it.
Then define the following function to get the count of your "wanted"
elements from the current group:
def getCnt(grp):
return sum(filter(lambda x: x >= 5, [ len(list(group))
for key, group in it.groupby(grp.x, lambda elem: elem >= 0.5)
if key ]))
Note that it contains it.groupby, i.e. groupby function from itertools
(not the pandasonic version of it).
The difference is that the itertools version starts a new group on each change
of the grouping key (by default, the value of the source element).
Steps:
it.groupby(grp.x, lambda elem: elem >= 0.5) - create an iterator,
returning pairs (key, group), from x column of the current group.
The key states whether the current group (from itertools grouping)
contains your "wanted" elements (>= 0.5) and the group contains these
elements.
[ len(list(group)) for key, group in … if key ] - get a list of
lengths of groups, excluding groups of "smaller" elements.
filter(lambda x: x >= 5, …) - filter the above list, leaving only counts
of groups with 5 or more members.
sum(…) - sum the above counts.
Then, to get your expected result, as a DataFrame, apply this function to
each group of rows, this time grouping with the pandasonic version of
groupby.
Then set the name of the resulting Series (it will be the column name
in the final result) and reset the index, to convert it to a DataFrame.
The code to do it is:
result = df.groupby(['y','z','n']).apply(getCnt).rename('Cnt').reset_index()
The result is:
y z n Cnt
0 1.0 1.0 1.0 11
1 2.0 1.0 1.0 5

Related

DataFrame of Dates into sequential dates

I would like to turn a dataframe as follows into a data frame of sequential dates.
Date
01/25/1995
01/20/1995
01/20/1995
01/23/1995
into
Date Value Cumsum
01/20/1995 2 2
01/21/1995 0 2
01/22/1995 0 2
01/23/1995 1 3
01/24/1995 0 3
01/25/1995 1 4
Try this:
df['Date'] = pd.to_datetime(df['Date'])
df_out = df.assign(Value=1).set_index('Date').resample('D').asfreq().fillna(0)
df_out = df_out.assign(Cumsum=df_out['Value'].cumsum())
print(df_out)
Output:
Value Cumsum
Date
1995-01-20 1.0 1.0
1995-01-21 0.0 1.0
1995-01-22 0.0 1.0
1995-01-23 1.0 2.0
1995-01-24 0.0 2.0
1995-01-25 1.0 3.0

Perform arithmetic operation mainly subtraction and division over a pandas series on null values

Simply i want when i subtract/division operation with null value it will give the value(digit).Ex - 3/np.nan = 3 or 2-np.nan = 2.
By using np.nansum and np.nanprod i have handled addition and multiplication,but dont know how will i do operation for subtraction and division.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c=a-b d=a/b
0 1 1.0 0.0 1.0
1 2 2.0 0.0 1.0
2 3 NaN 3.0 3.0
3 4 NaN 4.0 4.0
Above i mention that actually what i am looking for.
#Use fill value of 0 for subtraction operation
df['c']=df.a.sub(df.b,fill_value=0)
#Use fill value of 1 for division operation
df['d']=df.a.div(df.b,fill_value=1)
IIUC using sub with fill_value
df.a.sub(df.b,fill_value=0)
Out[251]:
0 0.0
1 0.0
2 3.0
3 4.0
dtype: float64

Similar random variation for two columns in pandas

data = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABCD') )
data[['B', 'C']] = data[['B', 'C']].apply(lambda x: x + (-1)**random.randrange(2)*1)
I wanted to randomly vary column B and C, such that the the variation is the same for both columns. If column B increase by one, column C must increase by one too. however for each row, the value can increase/decrease randomly. Code above doesn't work. Then I tried this with random seed:
data['B'] = data['B'].apply(lambda x: x + (-1)**random.randrange(2)*1)
data['C'] = data['C'].apply(lambda x: x + (-1)**random.randrange(2)*1)
Each rows vary randomly but the change in column B and C are not the same. how do I do this?
expected output
A B C D
1 1.0 1.0 1.0 1.0
2 1.0 2.0 2.0 1.0
3 1.0 2.0 2.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 0.0 0.0 1.0

Get number of rows for all combinations of attribute levels in Pandas

I have a dataframe bunch of categorical variables, each row corresponds to a product.I wanted to find the number of rows for every combination of attribute levels and decided to run the following:
att1=list(frame_base.columns.values)
f1=att.groupby(att1,as_index=False).size().rename('counts').to_frame()
att1 is the list of all attributes, f1 does not seem to provide the correct value as f1.counts.sum() is not equal to len(f1) before the group by.Why doesn't this work?
One possible problem is NaN row, but maybe there is typo - need att instead frame_base:
att = pd.DataFrame({'A':[1,1,3,np.nan],
'B':[1,1,6,np.nan],
'C':[2,2,9,np.nan],
'D':[1,1,5,np.nan],
'E':[1,1,6,np.nan],
'F':[1,1,3,np.nan]})
print (att)
A B C D E F
0 1.0 1.0 2.0 1.0 1.0 1.0
1 1.0 1.0 2.0 1.0 1.0 1.0
2 3.0 6.0 9.0 5.0 6.0 3.0
3 NaN NaN NaN NaN NaN NaN
att1=list(att.columns.values)
f1=att.groupby(att1).size().reset_index(name='counts')
print (f1)
A B C D E F counts
0 1.0 1.0 2.0 1.0 1.0 1.0 2
1 3.0 6.0 9.0 5.0 6.0 3.0 1

Pandas Pivot and Summarize For Multiple Rows Vertically

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,0,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 0
5 b 0.0 0
I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column.
Here's how I accomplish this for 'y':
df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1 2 x 50.0
b 1 3 x 33.3
In reality, I have 48 such numerical data columns that need to be stacked as such.
Thanks in advance!
First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
Another solution if is neccessary define values columns in pivot_table:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3

Resources