If there a Python function to sum all of the columns of a particular row? If not, what would be the best way to go about this? [duplicate] - python-3.x

I'm going through the Khan Academy course on Statistics as a bit of a refresher from my college days, and as a way to get me up to speed on pandas & other scientific Python.
I've got a table that looks like this from Khan Academy:
| Undergraduate | Graduate | Total
-------------+---------------+----------+------
Straight A's | 240 | 60 | 300
-------------+---------------+----------+------
Not | 3,760 | 440 | 4,200
-------------+---------------+----------+------
Total | 4,000 | 500 | 4,500
I would like to recreate this table using pandas. Of course I could create a DataFrame using something like
"Graduate": {...},
"Undergraduate": {...},
"Total": {...},
But that seems like a naive approach that would both fall over quickly and just not really be extensible.
I've got the non-totals part of the table like this:
df = pd.DataFrame(
{
"Undergraduate": {"Straight A's": 240, "Not": 3_760},
"Graduate": {"Straight A's": 60, "Not": 440},
}
)
df
I've been looking and found a couple of promising things, like:
df['Total'] = df.sum(axis=1)
But I didn't find anything terribly elegant.
I did find the crosstab function that looks like it should do what I want, but it seems like in order to do that I'd have to create a dataframe consisting of 1/0 for all of these values, which seems silly because I've already got an aggregate.
I have found some approaches that seem to manually build a new totals row, but it seems like there should be a better way, something like:
totals(df, rows=True, columns=True)
or something.
Does this exist in pandas, or do I have to just cobble together my own approach?

Or in two steps, using the .sum() function as you suggested (which might be a bit more readable as well):
import pandas as pd
df = pd.DataFrame( {"Undergraduate": {"Straight A's": 240, "Not": 3_760},"Graduate": {"Straight A's": 60, "Not": 440},})
#Total sum per column:
df.loc['Total',:] = df.sum(axis=0)
#Total sum per row:
df.loc[:,'Total'] = df.sum(axis=1)
Output:
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500

append and assign
The point of this answer is to provide an in line and not an in place solution.
append
I use append to stack a Series or DataFrame vertically. It also creates a copy so that I can continue to chain.
assign
I use assign to add a column. However, the DataFrame I'm working on is in the in between nether space. So I use a lambda in the assign argument which tells Pandas to apply it to the calling DataFrame.
df.append(df.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
Fun alternative
Uses drop with errors='ignore' to get rid of potentially pre-existing Total rows and columns.
Also, still in line.
def tc(d):
return d.assign(Total=d.drop('Total', errors='ignore', axis=1).sum(1))
df.pipe(tc).T.pipe(tc).T
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500

From the original data using crosstab, if just base on your input, you just need melt before crosstab
s=df.reset_index().melt('index')
pd.crosstab(index=s['index'],columns=s.variable,values=s.value,aggfunc='sum',margins=True)
Out[33]:
variable Graduate Undergraduate All
index
Not 440 3760 4200
Straight A's 60 240 300
All 500 4000 4500
Toy data
df=pd.DataFrame({'c1':[1,2,2,3,4],'c2':[2,2,3,3,3],'c3':[1,2,3,4,5]})
# before `agg`, I think your input is the result after `groupby`
df
Out[37]:
c1 c2 c3
0 1 2 1
1 2 2 2
2 2 3 3
3 3 3 4
4 4 3 5
pd.crosstab(df.c1,df.c2,df.c3,aggfunc='sum',margins
=True)
Out[38]:
c2 2 3 All
c1
1 1.0 NaN 1
2 2.0 3.0 5
3 NaN 4.0 4
4 NaN 5.0 5
All 3.0 12.0 15

The original data is:
>>> df = pd.DataFrame(dict(Undergraduate=[240, 3760], Graduate=[60, 440]), index=["Straight A's", "Not"])
>>> df
Out:
Graduate Undergraduate
Straight A's 60 240
Not 440 3760
You can only use df.T to achieve recreating this table:
>>> df_new = df.T
>>> df_new
Out:
Straight A's Not
Graduate 60 440
Undergraduate 240 3760
After computing the Total by row and columns:
>>> df_new.loc['Total',:]= df_new.sum(axis=0)
>>> df_new.loc[:,'Total'] = df_new.sum(axis=1)
>>> df_new
Out:
Straight A's Not Total
Graduate 60.0 440.0 500.0
Undergraduate 240.0 3760.0 4000.0
Total 300.0 4200.0 4500.0

Related

Perform unique row operation after a groupby

I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN

Using FuzzyWuzzy with pandas

I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario).
I am using fuzzywuzzy token set ratio.
For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz
test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]})
test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"]))
print (test_df.shape)
test_df.head()
Result:
City Score
0 Amsterdam 100
1 Amsterdam 100
2 Rotterdam 100
3 Zurich 100
4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam"))
print (fuzz.token_set_ratio("amsterdam","Rotterdam"))
print (fuzz.token_set_ratio("amsterdam","Zurich"))
print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100
67
13
13
Thank you in advance!
I managed to solve it via iterating through the rows:
for index,row in test_df.iterrows():
test_df.loc[index, "Score"] = fuzz.token_set_ratio("amsterdam",test_df.loc[index,"City"])
The result is:
City Country Code Score
0 Amsterdam NL 100
1 Amsterdam NL 100
2 Rotterdam NL 67
3 Zurich NL 13
4 Vienna NL 13

Pandas - number of occurances of IDs from a column in one dataframe in several columns of a second dataframe

I'm new to python and pandas, and trying to "learn by doing."
I'm currently working with two football/soccer (depending on where you're from!) dataframes:
player_table has several columns, among others 'player_name' and 'player_id'
player_id player_name
0 223 Lionel Messi
1 157 Cristiano Ronaldo
2 962 Neymar
match_table also has several columns, among others 'home_player_1', '..._2', '..._3' and so on, as well as the corresponding 'away_player_1', '...2' , '..._3' and so on. The content of these columns is a player_id, such that you can tell which 22 (2x11) players participated in a given match through their respective unique IDs.
I'll just post a 2 vs. 2 example here, because that works just as well:
match_id home_player_1 home_player_2 away_player_1 away_player_2
0 321 223 852 729 853
1 322 223 858 157 159
2 323 680 742 223 412
What I would like to do now is to add a new column to player_table which gives the number of appearances - player_table['appearances'] by counting the number of times each player_id is mentioned in the part of the dataframe match_table bound horizontally by (home player 1, away player 2) and vertically by (first match, last match)
Desired result:
player_id player_name appearances
0 223 Lionel Messi 3
1 157 Cristiano Ronaldo 1
2 962 Neymar 0
Coming from other programming languages I think my standard solution would be a nested for loop, but I understand that is frowned upon in python...
I have tried several solutions but none really work, this seems to at least give the number of appearances as "home_player_1"
player_table['appearances'] = player_table['player_id'].map(match_table['home_player_1'].value_counts())
Is there a way to expand the map function to include several columns in a dataframe? Or do I have to stack the 22 columns on top of one another in a new dataframe, and then map? Or is map not the appropriate function?
Would really appreciate your support, thanks!
Philipp
Edit: added specific input and desired output as requested
What you could do is use .melt() on the match_table player columns (so it'll turn your wide table in to a tall/long table of a single column). Then do a .value_counts on the that one column. Finally join it to the player_table on the 'player_id' column
import pandas as pd
player_table = pd.DataFrame({'player_id':[223,157,962],
'player_name':['Lionel Messi','Cristiano Ronaldo','Neymar']})
match_table = pd.DataFrame({
'match_id':[321,322,323],
'home_player_1':[223,223,680],
'home_player_2':[852,858,742],
'away_player_1':[729,157,223],
'away_player_2':[853,159,412]})
player_cols = [x for x in match_table.columns if 'player_' in x]
match_table[player_cols].value_counts(sort=True)
df1 = match_table[player_cols].melt(var_name='columns', value_name='appearances')['appearances'].value_counts(sort=True).reset_index(drop=False).rename(columns={'index':'player_id'})
appearances_df = df1.merge(player_table, how='right', on='player_id')[['player_id','player_name','appearances']].fillna(0)
Output:
print(appearances_df)
player_id player_name appearances
0 223 Lionel Messi 3.0
1 157 Cristiano Ronaldo 1.0
2 962 Neymar 0.0

Speed Up Pandas Iterations

I have DataFrame which consist of 3 columns: CustomerId, Amount and Status(success or failed).
The DataFrame is not sorted in any way. A CustomerId can repeat multiple times in DataFrame.
I want to introduce new columns into this DataFrame with below logic:
df[totalamount]= sum of amount for each customer where status was success.
I already have a running code but with df.iterrows which takes too much time. Thus requesting you to kindly provide alternate methods like pandas vectorization or numpy vectorization.
For Example, I want to create the 'totalamount' column from the first three columns:
CustomerID Amount Status totalamount
0 1 5 Success 105 # since both transatctions were successful
1 2 10 Failed 80 # since one transaction was successful
2 3 50 Success 50
3 1 100 Success 105
4 2 80 Success 80
5 4 60 Failed 0
Use where to mask the 'Failed' rows with NaN while preserving the length of the DataFrame. Then groupby the CustomerID and transform the sum of 'Amount' column to bring the result back to every row.
df['totalamount'] = (df.where(df['Status'].eq('Success'))
.groupby(df['CustomerID'])['Amount']
.transform('sum'))
CustomerID Amount Status totalamount
0 1 5 Success 105.0
1 2 10 Faled 80.0
2 3 50 Success 50.0
3 1 100 Success 105.0
4 2 80 Success 80.0
5 4 60 Failed 0.0
The reason for using where (as opposed to subsetting the DataFrame) is because groupby + sum defaults to sum an entirely NaN group to 0, so we don't need anything extra to deal with CustomerID 4, for instance.
df_new = df.groupby(['CustomerID', 'Status'], sort=False)['Amount'].sum().reset_index()
df_new = (df_new[df_new['Status'] == 'Success']
.drop(columns='Status')
.rename(columns={'Amount': 'totalamount'}))
df = pd.merge(df, df_new , on=['CustomerID'], how='left')
I'm not sure at all but I think this may work

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Resources