How to sum columns in pandas and add the result into a new row? - python-3.x

In this code I want to sum each column and add it as a new row.
It does the sum but it does not show the new row.
df = pd.DataFrame(g, columns=('AWA', 'REM', 'S1', 'S2'))
df['xSujeto'] = df.sum(axis=1)
xEstado = df.sum(axis=0)
df.append(xEstado, ignore_index=True)
df

I think you can use loc:
df = pd.DataFrame({'AWA':[1,2,3],
'REM':[4,5,6],
'S1':[7,8,9],
'S2':[1,3,5]})
#add 1 to last index value
print (df.index[-1] + 1)
3
df.loc[df.index[-1] + 1] = df.sum()
print (df)
AWA REM S1 S2
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
3 6 15 24 9
Or append from comment of Nickil Maveli:
xEstado = df.sum()
df = df.append(xEstado, ignore_index=True)
print (df)
AWA REM S1 S2
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
3 6 15 24 9

Related

Create new column with a list of max frequency values for each row of a pandas dataframe

Given this Dataframe:
df2 = pd.DataFrame([[3,3,3,3,3,3,5,5,5,5],[2,2,2,2,8,8,8,8,6,6]], columns=list('ABCDEFGHIJ'))
A B C D E F G H I J
0 3 3 3 3 3 3 5 5 5 5
1 2 2 2 2 8 8 8 8 6 6
I created 2 news columns which give for each row the max_freq and the max_freq_value:
df2["max_freq_val"] = df2.apply(lambda x: x.mode().agg(list), axis=1)
df2["max_freq"] = df2.loc[:, df2.columns != "max_freq_val"].apply(lambda x: x.value_counts().max(), axis=1)
A B C D E F G H I J max_freq_val max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
EDIT: I've edited my code inspired by the answer given by #rhug123.
Thanks to all of you for your answers.
Try this, it uses mode()
df2.assign(max_freq=pd.Series(df2.mode(axis=1).stack().groupby(level=0).agg(list)),
max_freq_value = df2.eq(df2.mode(axis=1)[0].squeeze(),axis=0).sum(axis=1))
or
df2.assign(freq = df2.eq((s := df2.mode(axis=1).stack().groupby(level=0).agg(list)).str[0],axis=0).sum(axis=1),val = s)
We can try stack then adjust the freq with agg put the multiple into the list
s = df2.stack().groupby(level=0).value_counts()
s = s[s.eq(s.max(level=0),level=0)].reset_index(level=1).groupby(level=0).agg(val= ('level_1',list),fre=(0,'first'))
df2 = df2.join(s)
df2
Out[156]:
A B C D E F G H I J val fre
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Perhaps you could use this function:
def give_back_maximums(a = [2,2,2,2,8,8,8,8,6,6]):
values, counts = np.unique(a, return_counts=True)
return values[counts >= counts.max()].tolist()
The order of the below could affect the result
df2["max_freq_value"] = df2.apply(lambda x: give_back_maximums(x), axis=1)
df2["max_freq"] = df2.apply(lambda x: x.value_counts().max(), axis=1)
print(df2)
A B C D E F G H I J max_freq_value max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Hope it helps : )

How to replenish a data frame based on another one?

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

The way `Drop column by id ` result in all same columns removed in dataframe

import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8

How to check value change in column

in my dataframe have three columns columns value ,ID and distance . i want to check in ID column when its changes from 2 to any other value count rows and record first value and last value when 2 changes to other value and save and also save corresponding value of column distance when change from 2 to other in ID column.
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],'ID':[2,2,8,8,8,2,2,2,5,5],'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 2 0
1 4 2 0
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
required results:
df_out=pd.DataFrame({'rows_Count':[3,2],'value_first':[7,15],'value_last':[11,16],'distance_first':[1,0]})
print(df_out)
rows_Count value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Use:
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Verify in changed data (not only 2 first group):
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],
'ID':[1,7,8,8,8,2,2,2,5,5],
'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 1 0 <- changed ID
1 4 7 0 <- changed ID
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 2 15 16 0

How can i calculate population in pandas?

I have a data set like this:-
S.No.,Year of birth,year of death
1, 1, 5
2, 3, 6
3, 2, -
4, 5, 7
I need to calculate population on till that years let say:-
year,population
1 1
2 2
3 3
4 3
5 4
6 3
7 2
8 1
How can i solve it in pandas?
Since i am not good in pandas.
Any help would be appreciate.
First is necessary choose maximum year of year of death if not exist, in solution is used 8.
Then convert values of year of death to numeric and replace missing values by this year. In first solution is used difference between birth and death column with Index.repeat with GroupBy.cumcount, for count is used Series.value_counts:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
df['year of death'] = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
df = df.loc[df.index.repeat(df['year of death'].add(1).sub(df['Year of birth']).astype(int))]
df['Year of birth'] += df.groupby(level=0).cumcount()
df1 = (df['Year of birth'].value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Another solution use list comprehension with range for repeat years:
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
L = [x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)]
df1 = (pd.Series(L).value_counts()
.sort_index()
.rename_axis('year')
.reset_index(name='population'))
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1
Similar like before, only is used Counter for dictionary for final DataFrame:
from collections import Counter
#if need working with years
#today_year = pd.to_datetime('now').year
today_year = 8
s = pd.to_numeric(df['year of death'], errors='coerce').fillna(today_year)
d = Counter([x for s, e in zip(df['Year of birth'], s) for x in range(s, e + 1)])
print (d)
Counter({5: 4, 3: 3, 4: 3, 6: 3, 2: 2, 7: 2, 1: 1, 8: 1})
df1 = pd.DataFrame({'year':list(d.keys()),
'population':list(d.values())})
print (df1)
year population
0 1 1
1 2 2
2 3 3
3 4 3
4 5 4
5 6 3
6 7 2
7 8 1

Resources