How to add another column when looping over a dataframe [duplicate] - python-3.x

My pandas dataframe looks like this:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 32917 271 88172 Male
2 18273 552 90291 Female
I want to replicate every row 3 times reset the index to get:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
I tried solutions such as:
pd.concat([df[:5]]*3, ignore_index=True)
And:
df.reindex(np.repeat(df.index.values, df['ID']), method='ffill')
But none of them worked.

Solutions:
Use np.repeat:
Version 1:
Try using np.repeat:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)
The above code will output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
np.repeat repeats the values of df, 3 times.
Then we add the columns with assigning new_df.columns = df.columns.
Version 2:
You could also assign the column names in the first line, like below:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Version 3:
You could shorten it with loc and only repeat the index, like below:
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
I use reset_index to replace the index with monotonic indexes (0, 1, 2, 3, 4...).
Without np.repeat:
Version 4:
You could use the built-in pd.DataFrame.index.repeat function, like the below:
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Remember to add reset_index to line-up the index.
Version 5:
Or by using concat with sort_index, like below:
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Version 6:
You could also use loc with Python list multiplication and sorted, like below:
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Timings:
Timing with the following code:
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame({'Person': {0: 12345, 1: 32917, 2: 18273}, 'ID': {0: 882, 1: 271, 2: 552}, 'ZipCode': {0: 38182, 1: 88172, 2: 90291}, 'Gender': {0: 'Female', 1: 'Male', 2: 'Female'}})
def version1():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
def version2():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
def version3():
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
def version4():
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
def version5():
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
def version6():
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print('Version 1 Speed:', timeit.timeit('version1()', 'from __main__ import version1', number=20000))
print('Version 2 Speed:', timeit.timeit('version2()', 'from __main__ import version2', number=20000))
print('Version 3 Speed:', timeit.timeit('version3()', 'from __main__ import version3', number=20000))
print('Version 4 Speed:', timeit.timeit('version4()', 'from __main__ import version4', number=20000))
print('Version 5 Speed:', timeit.timeit('version5()', 'from __main__ import version5', number=20000))
print('Version 6 Speed:', timeit.timeit('version6()', 'from __main__ import version6', number=20000))
Output:
Version 1 Speed: 9.879425965991686
Version 2 Speed: 7.752138633004506
Version 3 Speed: 7.078321029010112
Version 4 Speed: 8.01169377300539
Version 5 Speed: 19.853051771002356
Version 6 Speed: 9.801617017001263
We can see that Versions 2 & 3 are faster than the others, the reason for this is because they both use the np.repeat function, and numpy functions are very fast because they are implemented with C.
Version 3 wins against Version 2 marginally due to the usage of loc instead of DataFrame.
Version 5 is significantly slower because of the functions concat and sort_index, since concat copies DataFrames quadratically, which takes longer time.
Fastest Version: Version 3.

These will repeat the indices and preserve the columns as op demonstrated
iloc version 1
df.iloc[np.arange(len(df)).repeat(3)]
iloc version 2
df.iloc[np.arange(len(df) * 3) // 3]

Using concat:
pd.concat([df]*3).sort_index()
Out[129]:
Person ID ZipCode Gender
0 12345 882 38182 Female
0 12345 882 38182 Female
0 12345 882 38182 Female
1 32917 271 88172 Male
1 32917 271 88172 Male
1 32917 271 88172 Male
2 18273 552 90291 Female
2 18273 552 90291 Female
2 18273 552 90291 Female

I'm not sure why this was never proposed, but you can easily use df.index.repeat in conjection with .loc:
new_df = df.loc[df.index.repeat(3)]
Output:
>>> new_df
Person ID ZipCode Gender
0 12345 882 38182 Female
0 12345 882 38182 Female
0 12345 882 38182 Female
1 32917 271 88172 Male
1 32917 271 88172 Male
1 32917 271 88172 Male
2 18273 552 90291 Female
2 18273 552 90291 Female
2 18273 552 90291 Female

You can try the following code:
df = df.iloc[df.index.repeat(3),:].reset_index()
df.index.repeat(3) will create a list where each index value will be repeated 3 times and df.iloc[df.index.repeat(3),:] will help generate a dataframe with the rows as exactly returned by this list.

You can do it like this.
def do_things(df, n_times):
ndf = df.append(pd.DataFrame({'name' : np.repeat(df.name.values, n_times) }))
ndf = ndf.sort_values(by='name')
ndf = ndf.reset_index(drop=True)
return ndf
if __name__ == '__main__':
df = pd.DataFrame({'name' : ['Peter', 'Quill', 'Jackson']})
n_times = 3
print do_things(df, n_times)
And with explanation...
import pandas as pd
import numpy as np
n_times = 3
df = pd.DataFrame({'name' : ['Peter', 'Quill', 'Jackson']})
# name
# 0 Peter
# 1 Quill
# 2 Jackson
# Duplicating data.
df = df.append(pd.DataFrame({'name' : np.repeat(df.name.values, n_times) }))
# name
# 0 Peter
# 1 Quill
# 2 Jackson
# 0 Peter
# 1 Peter
# 2 Peter
# 3 Quill
# 4 Quill
# 5 Quill
# 6 Jackson
# 7 Jackson
# 8 Jackson
# The DataFrame is sorted by 'name' column.
df = df.sort_values(by=['name'])
# name
# 2 Jackson
# 6 Jackson
# 7 Jackson
# 8 Jackson
# 0 Peter
# 0 Peter
# 1 Peter
# 2 Peter
# 1 Quill
# 3 Quill
# 4 Quill
# 5 Quill
# Reseting the index.
# You can play with drop=True and drop=False, as parameter of `reset_index()`
df = df.reset_index()
# index name
# 0 2 Jackson
# 1 6 Jackson
# 2 7 Jackson
# 3 8 Jackson
# 4 0 Peter
# 5 0 Peter
# 6 1 Peter
# 7 2 Peter
# 8 1 Quill
# 9 3 Quill
# 10 4 Quill
# 11 5 Quill

If you need to index your repeats (e.g. for a multi-index) and also base the number of repeats on a value in a column, you can do this:
someDF["RepeatIndex"] = someDF["RepeatBasis"].fillna(value=0).apply(lambda x: list(range(int(x))) if x > 0 else [])
superDF = someDF.explode("RepeatIndex").dropna(subset="RepeatIndex")
This gives a DataFrame in which each record is repeated however many times is indicated in the "RepeatBasis" column. The DataFrame also gets a "RepeatIndex" column, which you can combine with the existing index to make into a multi-index, preserving index uniqueness.
If anyone's wondering why you'd want to do such a thing, in my case it's when I get data in which frequencies have already been summarized and for whatever reason, I need to work with singular observations. (think of reverse-engineering a histogram)

This question doesn't have enough answers yet! Here are some more ways to do this that are still missing and that allow chaining :)
# SQL-style cross-join
# (one line and counts replicas)
(
data
.join(pd.DataFrame(range(3), columns=["replica"]), how="cross")
.drop(columns="replica") # remove if you want to count replicas
)
# DataFrame.apply + Series.repeat
# (most readable, but potentially slow)
(
data
.apply(lambda x: x.repeat(3))
.reset_index(drop=True)
)
# DataFrame.explode
# (fun to have explosions in your code)
(
data
.assign(replica=lambda df: [[x for x in range(3)]] * len(df))
.explode("replica", ignore_index=True)
.drop(columns="replica") # or keep if you want to know which copy it is
)
(Edit: On a more serious note, using explode is useful if you need to count replicas and have a dynamic replica count per row. For example, if you have per-customer usage data with a start and end date, you can use the above to transform the data into monthly per-customer usage data.)
And of course here is the snippet to create the data for testing:
data = pd.DataFrame([
[12345, 882, 38182, "Female"],
[32917, 271, 88172, "Male"],
[18273, 552, 90291, "Female"],
],
columns=["Person", "ID", "ZipCode", "Gender"]
)

Use pd.concat: create three of the same dataFrames and merge them together, doesn't use a lot of code:
df = pd.concat([df]*3, ignore_index=True)
print(df)
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Note: ignore_index=True makes the index reset.

Could also use np.tile()
df.loc[np.tile(df.index,3)].sort_index().reset_index(drop=True)
Output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

Related

Changing multiindex in a pandas series?

I have a dataframe like this:
mainid pidl pidw score
0 Austria 1 533
1 Canada 2 754
2 Canada 3 267
3 Austria 4 852
4 Taiwan 5 124
5 Slovakia 6 344
6 Spain 7 1556
7 Taiwan 8 127
I want to select top 5 pidw for each pidl.
When I have grouped by on column 'pidl' and then sorted the score in descending order in each group , i got the following series, s..
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5)
pidl pidl pidw score
Austria Austria 49 948
47 859
48 855
50 807
46 727
Belgium Belgium 15 2339
14 1861
45 1692
16 1626
46 1423
Name: score, dtype: float64
The result looks correct, but I wish I could remove a second 'pidl' from this series.
I have tried
s.reset_index('pidl')
to get 'ValueError: The name location occurs multiple times, use a level number'.
and
s.to_frame().reset_index()
ValueError: cannot insert pidl, already exists.
so I am not sure how to proceed about it.
Use group_keys=False parameter in DataFrame.groupby:
s= df.set_index(['pidl', 'pidw']).groupby('pidl', group_keys=False)['score'].nlargest(5)
print (s)
pidl pidw
Austria 4 852
1 533
Canada 2 754
3 267
Slovakia 6 344
Spain 7 1556
Taiwan 8 127
5 124
Name: score, dtype: int64
Or add Series.droplevel for remove first level (pandas count from 0, so used 0):
s= df.set_index(['pidl', 'pidw']).groupby('pidl')['score'].nlargest(5).droplevel(0)

Merge 2 pandas dataframes - python

I have 2 pandas dataframes:
data1=
sample ID
name
sex
0
a
male
1
b
male
2
c
male
3
d
male
4
e
male
data2=
samples
Diabetic
age
0
yes
43
1
yes
50
2
no
63
3
no
21
4
yes
44
I want to merge both data frames to end up with the following data frame
samples
Diabetic
age
name
sex
0
yes
43
a
male
1
yes
50
b
male
2
no
63
c
male
3
no
21
d
male
4
yes
44
e
male

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

ValueError in onehotencoding

I'm not able to encode this column
Sex
male
female
female
female
male
male
male
male
female
female
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder=LabelEncoder()
X[:,2]=labelencoder.fit_transform(X[:,2])
ohe=OneHotEncoder(categorical_features=X[2])
ohe.fit_transform(X)
I'm getting this error.
could not convert string to float: 'male'
Can anyone help me with this?
Demo:
In [6]: df
Out[6]:
Sex
0 male
1 female
2 female
3 female
4 male
5 male
6 male
7 male
8 female
9 female
In [7]: le = LabelEncoder()
In [8]: df['Sex'] = le.fit_transform(df['Sex'])
In [9]: df
Out[9]:
Sex
0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0
9 0
In [10]: df.dtypes
Out[10]:
Sex int64
dtype: object

Create column on two conditions with pandas

I'm utilizing pandas to do some analysis exercise. I want to create a new column that the value is the sum of two rows. The original data set is as follow...
Admit Gender Dept Freq
0 Admitted Male A 512
1 Rejected Male A 313
2 Admitted Female A 89
3 Rejected Female A 19
4 Admitted Male B 353
5 Rejected Male B 207
6 Admitted Female B 17
7 Rejected Female B 8
8 Admitted Male C 120
9 Rejected Male C 205
10 Admitted Female C 202
11 Rejected Female C 391
12 Admitted Male D 138
13 Rejected Male D 279
14 Admitted Female D 131
15 Rejected Female D 244
16 Admitted Male E 53
17 Rejected Male E 138
18 Admitted Female E 94
19 Rejected Female E 299
20 Admitted Male F 22
21 Rejected Male F 351
22 Admitted Female F 24
23 Rejected Female F 317
I want to create a new column utilizing the following data frame...
Dept Gender Freq
0 A Female 108
1 A Male 825
2 B Female 25
3 B Male 560
4 C Female 593
5 C Male 325
6 D Female 375
7 D Male 417
8 E Female 393
9 E Male 191
10 F Female 341
11 F Male 373
I want to create a new column in the first data frame utilizing the Freq column of the second data frame. I need to insert the 108 value if Detp and Gender are the same in both data frames. The new data frame should look like this...
Admit Gender Dept Freq Total
0 Admitted Male A 512 825
1 Rejected Male A 313 825
2 Admitted Female A 89 108
3 Rejected Female A 19 108
4 Admitted Male B 353 560
5 Rejected Male B 207 560
6 Admitted Female B 17 25
7 Rejected Female B 8 25
I have tried the following code...
for i in data.iterrows():
for j in total_freq.iterrows():
if i[1].Gender == total_freq.Gender & i[1].Dept == total_freq.Dept:
data['Total'] = total_freq.Freq
I get the following error... TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
Any help to create the column with the correct values?
You can use transform
df['Total'] = df.groupby(['Dept', 'Gender']).Freq.transform('sum')
You get
Admit Gender Dept Freq Total
0 Admitted Male A 512 825
1 Rejected Male A 313 825
2 Admitted Female A 89 108
3 Rejected Female A 19 108
4 Admitted Male B 353 560
5 Rejected Male B 207 560
6 Admitted Female B 17 25
7 Rejected Female B 8 25
8 Admitted Male C 120 325
9 Rejected Male C 205 325
10 Admitted Female C 202 593
11 Rejected Female C 391 593
12 Admitted Male D 138 417
13 Rejected Male D 279 417
14 Admitted Female D 131 375
15 Rejected Female D 244 375
16 Admitted Male E 53 191
17 Rejected Male E 138 191
18 Admitted Female E 94 393
19 Rejected Female E 299 393
20 Admitted Male F 22 373
21 Rejected Male F 351 373
22 Admitted Female F 24 341
23 Rejected Female F 317 341
You can use pandas.DataFrame.merge() to left join your totals from the second dataframe to the first. First, rename freq in the totals df.
df1 = df1.rename(columns={'Freq':'Total'})
df_totals = pd.merge(df, df1['Total'], how='left', on=['Gender', 'Dept'])

Resources