Analysis on dataframe with python - python-3.x

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'

Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

Related

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

How to do mean centering on a dataframe in pandas?

I want to use two transformation techniques on a data frame, mean centering and standardization. How can I perform the mean centering method on my dataframe?
I have performed standardization using StandardScaler() from sklearn.preprocessing.
from sklearn.preprocessing import StandardScaler()
standard.iloc[:,1:-1] = StandardScaler().fit_transform(standard.iloc[:,1:-1])
I am expecting a transformed data frame which is mean-centered
dataxx = {'Name':['Tom', 'gik','Tom','Tom','Terry','Jerry','Abel','Dula','Abel'],
'Age':[20, 21, 19, 18,88,89,95,96,97],'gg':[1, 1,1, 30, 30,30,40,40,40]}
dfxx = pd.DataFrame(dataxx)
dfxx["meancentered"] = dfxx.Age - dfxx.Age.mean()
Index
Name
Age
gg
meancentered
0
Tom
20
1
-40.333333
1
gik
21
1
-39.333333
2
Tom
19
1
-41.333333
3
Tom
18
30
-42.333333
4
Terry
88
30
27.666667
5
Jerry
89
30
28.666667
6
Abel
95
40
34.666667
7
Dula
96
40
35.666667
8
Abel
97
40
36.666667

How to do cumulative mean and count in a easy way

I have following dataframe in pandas
data = {'call_put':['C', 'C', 'P','C', 'P'],'price':[10,20,30,40,50], 'qty':[11,12,11,14,9]}
df['amt']=df.price*df.qty
df=pd.DataFrame(data)
call_put price qty amt
0 C 10 11 110
1 C 20 12 240
2 P 30 11 330
3 C 40 14 560
4 P 50 9 450
I want output something like following based on call_put value is 'C' or 'P' count, median and calculation as follows
call_put price qty amt cummcount cummmedian cummsum
C 10 11 110 1 110 110
C 20 12 240 2 175 ((110+240)/2 ) 350
P 30 11 330 1 330 680
C 40 14 560 3 303.33 (110+240+560)/3 1240
P 50 9 450 2 390 ((330+450)/2) 1690
Can it be done in some easy way without creating additional dataframes and functions?
create a grouped element named g and use df.assign to assign values:
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
call_put price qty amt cum_count cummedian cum_sum
0 C 10 11 110 1 110.000000 110
1 C 20 12 240 2 175.000000 350
2 P 30 11 330 1 303.333333 680
3 C 40 14 560 3 330.000000 1240
4 P 50 9 450 2 390.000000 1690
Note: for P , the cummedian should be 390 since (330+450)/2 = 390
For cum_count look at df.groupby.cumcount()
for cummedian check how expanding() works ,
for cumsum check df.cumsum()
IIUC, this should work
df['cumcount']=df.groupby('call_put').cumcount()
df['cummidean']=df.groupby('call_put')['amt'].cumsum()
df['cumsum']=df.groupby('call_put').cumsum()
Thanks following solution is fine
g=df.groupby('call_put')
final=df.assign(cum_count=g.cumcount().add(1),
cummedian=g['amt'].expanding().mean().reset_index(drop=True), cum_sum=df.amt.cumsum())
if I run following without drop=True
g['amt'].expanding().mean().reset_index()
why output is showing level_1
call_put level_1 amt
0 C 0 110.000000
1 C 1 175.000000
2 C 3 303.333333
3 P 2 330.000000
4 P 4 390.000000
g['amt'].expanding().mean().reset_index(drop=True)
0 110.000000
1 175.000000
2 303.333333
3 330.000000
4 390.000000
Name: amt, dtype: float64
Can you pl explain in more detail ?
How do you add one more condition in groupby clause
g=df.groupby('call_put', 'price' < 50)
TypeError: '<' not supported between instances of 'str' and 'int'

Plot individuals home-range with Adehabitat

I am trying to put the name from the individuals of my research in a polygons home-range plot, but after many attempts I still can not achieve it.
Here and example of my data: X and Y are coordinates and id are individuals
X Y id
29 29 4
44 28 7
57 57 5
60 81 11
32 41 4
43 29 7
57 57 5
46 83 11
32 41 4
43 29 7
57 56 5
60 82 11
35 40 4
43 28 7
62 55 5
54 73 11
27 40 4
43 28 7
61 54 5
First, i calculated the home-range of my data with MPC
cp <- mcp((data)[,1],percent=95, unin = c("m"), unout = c("m2"))
And the plot it
plot(cp, axes=TRUE, border = rainbow(12))
But i don´t kown which polygons correspond to each individual, and if possible i need to include the id of my individuals inside each polygon
Any help would be appreciated!!
Thanks
Juan
Here is an example using the example data from the adehabitatHR package, since you not really providing a reproducible example.
library(adehabitatHR)
data("puechabonsp")
cp <- mcp(puechabonsp$relocs[, 1], percent=95, unin = c("m"), unout = c("m2"))
One way would be to use ggplot2 and sf:
library(sf)
library(tidyverse)
st_as_sf(cp) %>% ggplot(., aes(fill = id)) + geom_sf(alpha = 0.5) +
scale_fill_discrete(name = "Animal id")

Find total no of links to and from node based on data in csv

I have a csv with the following info
Src Rx LinkId Weight
===================================
2 1 4000 10
2 1 4056 15
3 1 4100 10
3 1 4156 15
28 1 10650 8
113 2 15051 205
113 3 15058 205
1 4 3952 9
1 4 3951 5
1 4 3950 34
2 4 4052 9
47 4 18672 44
47 4 18670 38
69 4 4701 11
69 4 4700 21
70 4 4801 11
`
The linkId is unique. Each row represents the link between two devices. For example, source 2 and rx 1 means that a link goes from 2 to 1.
I intend to compute the total weight of all the links originating from each device and coming into each device like so:
Device Out weight In weight
=============================
2 25 205
1 48 58
and so on.
I would like to know if doing this is possible in excel. If yes, how.
Using a pivot table may be the best solution here and I think that if you select this table and click pivot-table it will give you your answer.
Alternatively, you can make a column for each in and out and use =sumif(Src, 1, weight ) and then use the totals at the bottom of each column.

Resources