How to do mean centering on a dataframe in pandas? - python-3.x

I want to use two transformation techniques on a data frame, mean centering and standardization. How can I perform the mean centering method on my dataframe?
I have performed standardization using StandardScaler() from sklearn.preprocessing.
from sklearn.preprocessing import StandardScaler()
standard.iloc[:,1:-1] = StandardScaler().fit_transform(standard.iloc[:,1:-1])
I am expecting a transformed data frame which is mean-centered

dataxx = {'Name':['Tom', 'gik','Tom','Tom','Terry','Jerry','Abel','Dula','Abel'],
'Age':[20, 21, 19, 18,88,89,95,96,97],'gg':[1, 1,1, 30, 30,30,40,40,40]}
dfxx = pd.DataFrame(dataxx)
dfxx["meancentered"] = dfxx.Age - dfxx.Age.mean()
Index
Name
Age
gg
meancentered
0
Tom
20
1
-40.333333
1
gik
21
1
-39.333333
2
Tom
19
1
-41.333333
3
Tom
18
30
-42.333333
4
Terry
88
30
27.666667
5
Jerry
89
30
28.666667
6
Abel
95
40
34.666667
7
Dula
96
40
35.666667
8
Abel
97
40
36.666667

Related

Analysis on dataframe with python

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'
Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

Creating a list from series of pandas

Click here for the imageI m trying to create a list from 3 different series which will be of the shape "({A} {B} {C})" where A denotes the 1st element from series 1, B is for 1st element from series 2, C is for 1st element from series 3 and this way it should create a list containing 600 element.
List 1 List 2 List 3
u_p0 1 v_p0 2 w_p0 7
u_p1 21 v_p1 11 w_p1 45
u_p2 32 v_p2 25 w_p2 32
u_p3 45 v_p3 76 w_p3 49
... .... ....
u_p599 56 v_p599 78 w_599 98
Now I want the output list as follows
(1 2 7)
(21 11 45)
(32 25 32)
(45 76 49)
.....
These are the 3 series I created from a dataframe
r1=turb_1.iloc[qw1] #List1
r2=turb_1.iloc[qw2] #List2
r3=turb_1.iloc[qw3] #List3
Pic of the seriesFor the output I think formatted string python method will be useful but I m quite not sure how to proceed.
turb_3= ["({A} {B} {C})".format(A=i,B=j,C=k) for i in r1 for j in r2 for k in r3]
Any kind of help will be useful.
Use pandas.DataFrame.itertuples with str.format:
# Sample data
print(df)
col1 col2 col3
0 1 2 7
1 21 11 45
2 32 25 32
3 45 76 49
fmt = "({} {} {})"
[fmt.format(*tup) for tup in df[["col1", "col2", "col3"]].itertuples(False, None)]
Output:
['(1 2 7)', '(21 11 45)', '(32 25 32)', '(45 76 49)']

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

How can I add an X axis showing plot data seconds to a matplotlib pyplot price volume graph?

The code below plots a price volume chart using data from a tab separated csv file. Each row contains values for those columns: IDX, TRD, TIMESTAMPMS, VOLUME and PRICE. As is, the X axis shows the IDX value. I would like the X axis to display the seconds computed from the timestamp in milliseconds attached to each row. How can this be obtained ?
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
data = pd.read_csv('secondary-2018-08-12-21-32-56.csv', index_col=0, sep='\t')
print(data.head(50))
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(10,5))
ax[0].plot(data.index, data['PRICE'])
ax[1].bar(data.index, data['VOLUME'])
plt.show()
The drawn graph looks like this:
Here are the data as displayed by the
print(data.head(50))
instruction:
TRD TIMESTAMPMS VOLUME PRICE
IDX
1 4 1534102380000 0.363583 6330.41
2 20 1534102381000 5.509219 6329.13
3 3 1534102382000 0.199049 6328.69
4 5 1534102383000 1.055055 6327.36
5 2 1534102384000 0.006343 6328.26
6 4 1534102385000 0.167502 6330.38
7 1 1534102386000 0.002039 6326.69
8 0 1534102387000 0.000000 6326.69
9 4 1534102388000 0.163813 6327.62
10 2 1534102389000 0.007060 6326.66
11 4 1534102390000 0.015489 6327.64
12 5 1534102391000 0.035618 6328.35
13 2 1534102392000 0.006003 6330.12
14 5 1534102393000 0.172913 6328.77
15 1 1534102394000 0.019972 6328.03
16 3 1534102395000 0.007429 6328.03
17 1 1534102396000 0.000181 6328.03
18 3 1534102397000 1.041483 6328.03
19 2 1534102398000 0.992897 6328.74
20 3 1534102399000 0.061871 6328.11
21 2 1534102400000 0.000123 6328.77
22 4 1534102401000 0.028650 6330.25
23 2 1534102402000 0.035504 6330.01
24 3 1534102403000 0.982527 6330.11
25 5 1534102404000 0.298366 6329.11
26 2 1534102405000 0.071119 6330.06
27 3 1534102406000 0.025547 6330.02
28 2 1534102407000 0.003413 6330.11
29 4 1534102408000 0.431217 6330.05
30 3 1534102409000 0.021627 6330.23
31 1 1534102410000 0.009661 6330.28
32 1 1534102411000 0.004209 6330.27
33 1 1534102412000 0.000603 6328.07
34 6 1534102413000 0.655872 6330.31
35 1 1534102414000 0.000452 6328.09
36 7 1534102415000 0.277340 6328.07
37 8 1534102416000 0.768351 6328.04
38 1 1534102417000 0.078893 6328.20
39 2 1534102418000 0.000446 6326.24
40 2 1534102419000 0.317381 6326.83
41 2 1534102420000 0.100009 6326.24
42 2 1534102421000 0.000298 6326.25
43 6 1534102422000 0.566820 6330.00
44 1 1534102423000 0.000060 6326.30
45 2 1534102424000 0.047524 6326.30
46 4 1534102425000 0.748773 6326.61
47 3 1534102426000 0.007656 6330.23
48 1 1534102427000 0.000019 6326.32
49 1 1534102428000 0.000014 6326.34
50 0 1534102429000 0.000000 6326.34
I believe you need to data.setindex('TIMESTAMPMS') to get the axis to autoscale
I dont know if i understood you correctly, try with:
data['TIMESTAMPMS'] = data['TIMESTAMPMS']/1000
ax[0].plot(data['TIMESTAMPMS'], data['PRICE'])
ax[1].bar(data['TIMESTAMPMS'], data['VOLUME'])

Grouping and Multiindexing a pandas dataframe

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?
If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Resources