groupby and ranking based on the string in one column - string

I am working on a data frame, which contains 70 over actions. I have a column that groups those 70 actions. I want to create a new column that is the rank of string from an existing column. The following the sample of the data frame:
DF = pd.DataFrame()
DF ['template']= ['Attk','Attk','Attk','Attk','Attk','Attk','Def','Def','Def','Def','Def','Def','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy']
DF ['Stats'] = ['Goal','xG','xA','Goal','xG','xA','Block','interception','tackles','Block','interception','tackles','Acc.passes','Acc.actions','Acc.crosses','Acc.passes','Acc.actions','Acc.crosses']
DF=DF.sort_values(['template','Stats'])
The new column that I wanted to create is groupby [template] and ranking the Stats alphabetical order.
The expected data frame is as follow:
I have 10 to 15 of Stats under each of the template.

Use GroupBy.transform with lambda function and factorize, also because python counts from 0 is added 1:
f = lambda x: pd.factorize(x)[0]
DF['Order'] = DF.groupby('template')['Stats'].transform(f) + 1
print (DF)
template Stats Order
13 Accuracy Acc.actions 1
16 Accuracy Acc.actions 1
14 Accuracy Acc.crosses 2
17 Accuracy Acc.crosses 2
12 Accuracy Acc.passes 3
15 Accuracy Acc.passes 3
0 Attk Goal 1
3 Attk Goal 1
2 Attk xA 2
5 Attk xA 2
1 Attk xG 3
4 Attk xG 3
6 Def Block 1
9 Def Block 1
7 Def interception 2
10 Def interception 2
8 Def tackles 3
11 Def tackles 3

Related

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

want to calculate the count of pass instances of data set using python pandas

x=[]
y1=[]
r1=len(df)
L1=len(df.columns)
for i in range(r1):
ll=(df.loc[i,'LL'])
ul=(df.loc[i,'UL'])
count1 =0
for j in range(5,L1):
if isinstance(df.iloc[i,j],str):
df.loc[i,j]=0
if ll<=df.iloc[i,j]<=ul:
count1=count1+1
if count1==(L1-5):
x.append('Pass')
else:
x.append('Fail')
y1.append(count1)
se = pd.Series(x)
se1=pd.Series(y1)
df['Min']=min1.values
df['Mean']=mean1.values
df['Median']=median1.values
df['Max']=max1.values
df['Pass Count']=se1.values
df['Result']=se.values
min1 = df.iloc[:,5:].min(axis=1)
mean1=df.iloc[:,5:].astype(float).mean(axis=1,skipna = True)
median1=df.iloc[:,5:].astype(float).median(axis=1,skipna = True)
max1=df.iloc[:,5:].max(axis=1)
count1=df.iloc[:,5:].count(axis=1)
yield1=[]
for i in range(len(se1)):
yd1=(se1[i]/(L1-3))*100
yield1.append(yd1)
se2=pd.Series(yield1)
df['Yield']=se2.values
df1=df.loc[:,['PARAMETER','Min','Mean','Median','Max','Result','Pass Count','Yield']]
df1
Below is my data set, it is sensor data on daily basis. Daily data should be within the Lower Limit (LL) and Upper Limit(UL). I want to count how many days sensors data is within the LL and UL.
I am not able to calculate the number of days for sensor data within LL and UL using Pandas. How can I calculate the number of days for sensor data within LL and UL?
Take a few key ideas
need a list of the columns that go into calc daycols
transpose these columns into an array then to test, gives a boolean array
sum this boolean array and you have your desired calc
df = pd.read_csv(io.StringIO("""sensor location,LL,UL,day1,day2,day3,day4,day5,day6,day7,number of days sensor data within LL and UL
A,1,10,12,6,9,4,9,7,15,5
B,1,12,4,15,7,1,11,1,7,6
C,1,15,13,13,13,10,7,13,13,7
D,1,10,12,1,14,12,15,4,4,3
E,1,20,11,15,8,14,1,14,14,7"""))
daycols = [d for i,d in enumerate(df.columns) if "day" in d and "number" not in d]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.loc[:,daycols].T>=dfa["LL"]) &
(dfa.loc[:,daycols].T<=dfa["UL"])).sum()
)
print(df.to_string(index=False))
output
sensor location LL UL day1 day2 day3 day4 day5 day6 day7 number of days sensor data within LL and UL daysBetween
A 1 10 12 6 9 4 9 7 15 5 5
B 1 12 4 15 7 1 11 1 7 6 6
C 1 15 13 13 13 10 7 13 13 7 7
D 1 10 12 1 14 12 15 4 4 3 3
E 1 20 11 15 8 14 1 14 14 7 7
speed up
It you have many columns then you can use slice capability to identify them and turn into indexes so iloc can be used. Additionally the transpose is not necessary.
dayi = [df.columns.get_loc(c) for c in df.columns[3:-1]]
df = df.assign(
# use fact true is 1 so sum a truth array gives the answer
daysBetween=lambda dfa: ((dfa.iloc[:,dayi]>=dfa["LL"]) &
(dfa.iloc[:,dayi]<=dfa["UL"])).sum()
)

How to remove Initial rows in a dataframe in python

I have 4 dataframes with weekly sales values for a year for 4 products. Some of the initial rows are 0 as no sales. there are some other 0 values as well in between the weeks.
I want to remove those initial 0 values, keeping the in between 0s.
For example
Week Sales(prod 1)
1 0
2 0
3 100
4 120
5 55
6 0
7 60.
Week Sales(prod 2)
1 0
2 0
3 0
4 120
5 0
6 30
7 60.
I want to remove row 1,2 from 1st table and 1,2,3 frm 2nd.
Few Assumption based on your example dataframe:
DataFrame is created using pandas
week always start with 1
will remove all the starting weeks only which are having 0 sales
Solution:
Python libraries Required
- pandas, more_itertools
Example DataFrame (df):
Week Sales
1 0
2 0
3 0
4 120
5 0
6 30
7 60
Python Code:
import pandas as pd
import more_itertools as mit
filter_col = 'Sales'
filter_val = 0
##function which returns the index to be removed
def return_initial_week_index_with_zero_sales(df,filter_col,filter_val):
index_wzs = [False]
if df[filter_col].iloc[1]==filter_val:
index_list = df[df[filter_col]==filter_val].index.tolist()
index_wzs = [list(group) for group in mit.consecutive_groups(index_list)]
else:
pass
return index_wzs[0]
##calling above function and removing index from the dataframe
df = df.set_index('Week')
weeks_to_be_removed = return_initial_week_index_with_zero_sales(df,filter_col,filter_val)
if weeks_to_be_removed:
print('Initial weeks with 0 sales are {}'.format(weeks_to_be_removed))
df = df.drop(index=weeks_to_be_removed)
else:
print('No initial week has 0 sales')
df.reset_index(inplace=True)
Result:df
Week Sales
4 120
5 55
6 0
7 60
I hope it helps, you can modify the function as per your requirement.

Creating a sub-index in pandas dataframe [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 1 year ago.
Okay this is tricky. I have a pandas dataframe and I am dealing with machine log data. I have an index in the data, but this dataframe has various jobs in it. I wanted to be able to give those individual jobs an index of their own, so that i could compare them with each other. So I want another column with an index beginning with zero, which goes till the end of the job and then resets to zero for the new job. Or do i do this line by line?
I think you need set_index with cumcount for count categories:
df = df.set_index(df.groupby('Job Columns').cumcount(), append=True)
Sample:
np.random.seed(456)
df = pd.DataFrame({'Jobs':np.random.choice(['a','b','c'], size=10)})
#solution with sorting
df1 = df.sort_values('Jobs').reset_index(drop=True)
df1 = df1.set_index(df1.groupby('Jobs').cumcount(), append=True)
print (df1)
Jobs
0 0 a
1 1 a
2 2 a
3 0 b
4 1 b
5 2 b
6 3 b
7 0 c
8 1 c
9 2 c
#solution with no sorting
df2 = df.set_index(df.groupby('Jobs').cumcount(), append=True)
print (df2)
Jobs
0 0 b
1 1 b
2 0 c
3 0 a
4 1 c
5 2 c
6 1 a
7 2 b
8 2 a
9 3 b

Resources