In pandas dataframe, how to make one column act on all the others? - python-3.x

Consider the small following dataframe:
import pandas as pd
value1 = [15, 20, 50, 70]
value2 = [15, 80, 45, 30]
base = [175, 150, 200, 125]
df = pd.DataFrame({"val1": value1, "val2": value2, "base": base})
df
val1 val2 base
0 15 15 175
1 20 80 150
2 50 45 200
3 70 30 125
Actually, there are much more rows and much more val*** columns...
I would like to express the figures given in the columns val*** as percent of their corresponding base (in the same row); as an example, 70 (last in val1) should become (70/125)*100, (which is 56), or 30 (last in val2) should become (30/125)*100 (which is 28) ; and so on for every figure.
I am sure the solution lies in a correct use of assign or apply and lambda, but I can't find how to do it ...

We can filter the val like columns then divide these columns by the base column along axis=0 followed by multiplication with 100 to calculate the percentage
df.filter(like='val').div(df['base'], axis=0).mul(100).add_suffix('%')
val1% val2%
0 8.571429 8.571429
1 13.333333 53.333333
2 25.000000 22.500000
3 56.000000 24.000000

Related

In Pandas, how to compute value counts on bins and sum value in 1 other column

I have Pandas dataframe like:
df =
col1 col2
23 75
25 78
22 120
I want to specify bins: 0-100 and 100-200 and divide col2 in those bins compute its value counts and sum col1 for the values that lie in those bins.
So:
df_output:
col2_range count col1_cum
0-100 2 48
100-200 1 22
Getting the col2_range and count is pretty simple:
import numpy as np
a = np.arange(0,200, 100)
bins = a.tolist()
counts = data['col1'].value_counts(bins=bins, sort=False)
How do I get to sum col2 though?
IIUC, try using pd.cut to create bins and groupby those bins:
g = pd.cut(df['col2'],
bins=[0, 100, 200, 300, 400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True)['col1'].agg(['count','sum']).reset_index()
Output:
col2 count sum
0 0-99 2 48
1 100-199 1 22
I think I misread the original post:
g = pd.cut(df['col2'],
bins=[0,100,200,300,400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True).agg(col1_count=('col1','count'),
col2_sum=('col2','sum'),
col1_sum=('col1','sum')).reset_index()
Output:
col2 col1_count col2_sum col1_sum
0 0-99 2 153 48
1 100-199 1 120 22

adding reversed columns to dataframe [duplicate]

This question already has an answer here:
Reversing the order of values in a single column of a Dataframe
(1 answer)
Closed 1 year ago.
Trying to add a reversed column to a data frame, but it just adds in normal order. For me, it looks like it is just following the index of the dataframe. Is it possible to reorder the index?
df_reversed = df['Buy'].iloc[::-1]
Data["newColumn"] = df_reversed
Image of the output
Image of df_reversed
This is how I want the output to be
A slight modification from #Chicodelarose, you can reverse just the values and get the result you want as follows:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"].values[::-1]
print(df)
Output will be:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420
You need to call reset_index before assigning the values to the new column so that they are added to the data frame in reverse order:
Example:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"][::-1].reset_index(drop=True)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420

Getting columns by list of substring values

I have dataframe which is mentioned below, i have large data wanted to create diffrent data frame from substring values of column
df
ID ex_srr123 ex2_srr124 ex3_srr125 ex4_srr1234 ex23_srr5323
san 12 43 0 34 0
mat 53 0 34 76 656
jon 82 223 23 32 21
jack 0 12 2 0 0
i have a list of substring of column
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I wanted
df2=
ID ex_srr123 ex2_srr12
san 12 43
mat 53 0
jon 82 223
jack 0 12
I tried
df2=df[coln1]
i didn't get what i wanted please help me how can i get desire output
Statically
df2 = df.filter(regex="srr123$|srr124$").copy()
Dynamically
coln1 = ['srr123', 'srr124']
df2 = df.filter(regex=f"{coln1[0]}$|{coln1[1]}$").copy()
The $ signifies the end of the string, so that the column ex4_srr1234 isn't also included in your result.
Look into the filter method
df.filter(regex="srr123|srr124").copy()
I am making a few assumptions:
'ID' is a column and not the index.
The third column in df2 should read 'ex2_srr124' instead of 'ex2_srr12'.
You do not want to include columns of 'df' in 'df2' if the substring does not match everything after the underscore (since 'srr123' is a substring of 'ex4_srr1234' but you did not include it in 'df2').
# set the provided data frames
df = pd.DataFrame([['san', 12, 43, 0, 34, 0],
['mat', 53, 0, 34, 76, 656],
['jon', 82, 223, 23, 32, 21],
['jack', 0, 12, 2, 0, 0]],
columns = ['ID', 'ex_srr123', 'ex2_srr124', 'ex3_srr125', 'ex4_srr1234', 'ex23_srr5323'])
# set the list of column-substrings
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I suggest to solve this as follows:
# create df2 and add the ID column
df2 = pd.DataFrame()
df2['ID'] = df['ID']
# iterate over each substring in a list of column-substrings
for substring in coln1:
# iterate over each column name in the df columns
for column_name in df.columns.values:
# check if column name ends with substring
if substring == column_name[-len(substring):]:
# assign the new column to df2
df2[column_name] = df[column_name]
This yields the desired dataframe df2:
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12
df.filter(regex = '|'.join(['ID'] + [col+ '$' for col in coln1])).copy()
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

Apply a function to every row of a dataframe and store the data to a list/Dataframe in Python

I have the following simplified version of the code:
import pandas as pd
def myFunction(portf, Val):
mydata = {portf: [Val, Val * 2, Val * 3, Val * 4]}
df = pd.DataFrame(mydata, columns=[portf])
return df
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_output = myFunction(df_input['Portfolio'][0], df_input['Value'][0])
df_output1 = myFunction(df_input['Portfolio'][1], df_input['Value'][1])
df_output2 = myFunction(df_input['Portfolio'][2], df_input['Value'][2])
df_output3 = myFunction(df_input['Portfolio'][3], df_input['Value'][3])
What I would like is concatenate all the df_output in a single list or even better in a dataframe in an efficient way as the df_input dataframe will have 100+ columns.
I tried to apply the following:
df_input.apply(lambda row : myFunction(row['Portfolio'], row['Value']), axis = 1)
However all the results return to a single column.
Any idea how to achieve that?
Thanks
You can use pd.concat to store all results in a single dataframe:
pd.concat([myFunction(row['Portfolio'], row['Value'])
for _, row in df_input.iterrows()], axis=1)
First you build a list of pd.DataFrames with a list comprehension (you could also use a normal loop). Then you concat all DataFrames along axis=1.
Output:
Book1 Book2 Book1 Book2
0 10 5 6 11
1 20 10 12 22
2 30 15 18 33
3 40 20 24 44
You mentioned df_input has many more rows in the original dataframe. To account for this you neeed another loop (minimal example):
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_input['Value2'] = df_input['Value'] * 100
pd.concat([myFunction(row['Portfolio'], row[col])
for col in df_input.columns if col != 'Portfolio'
for (_, row) in df_input.iterrows()], axis=1)
Output:
Book1 Book2 Book1 Book2 Book1 Book2 Book1 Book2
0 10 5 6 11 1000 500 600 1100
1 20 10 12 22 2000 1000 1200 2200
2 30 15 18 33 3000 1500 1800 3300
3 40 20 24 44 4000 2000 2400 4400
You might want to rename the columns or aggregate the resulting dataframe in some other way. But for this I had to guess (and I try not to guess in the face of ambiguity).

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Resources