Get column value from the column that is dynamically selected depending on row value of another column - python-3.x

I have a dataframe as below.
month fe_month_OCT re_month_APR fe_month_MAY
0 OCT 1 1 2
1 APR 4 2 2
2 MAY 1 4 3
Im trying to create a new column that gets me the value from any of the fe_month_ or re_month_ columns based on what month the row of data corresponds to (for the SAME month however we will not see 2 columns - i.e. we will never see both fe_month_APR and re_month_APR in the same df - it will either be fe or re).
Output example - for the first row, I would want this new column to have the value coming from fe_month_OCT, because month=OCT, for the second row, the value should come from re_month_APR etc.
Expected output:
month fe_month_OCT re_month_APR fe_month_MAY d_month
0 OCT 1 1 2 1
1 APR 4 2 2 2
2 MAY 1 4 3 3
Code to create input dataframe:
data = {'month': ['OCT', 'APR', 'MAY'], 'fe_month_OCT': [1, 4, 1], 're_month_APR': [1, 2, 4],'fe_month_MAY': [2, 2, 3] }
db = pd.DataFrame(data)

Assuming all the column names are in the form "fe_month_" plus the string in db["month"], you can use apply().
get_value = lambda row: row[ "fe_month_" + row["month"] ]
db["d_month"] = db.apply( get_value, axis=1 )

Related

Python: Fetch value from dataframe list with min value from another column list of same dataframe

input dataframe
Flow row count
Apple [45,46] [2,1]
Orange [13,14] [1,5]
need to find min value of each list column 'count' and fetch respective row value from row column.
Expected output:
Flow row count
Apple 46 1
Orange 13 1
A possible solution (the part .astype('int') may be unnecessary in your case):
df['row'] = list(df.explode(['row', 'count']).reset_index().groupby('flow')
.apply(lambda x: x['row'][x['count'].astype('int').idxmin()]))
df['count'] = df['count'].map(min)
A shorter solution than my previous one, based on sorted with key:
df.assign(row=df.apply(
lambda x: sorted(x['row'], key=lambda z: x['count'][x['row'].index(z)])[0],axis=1),
count=df['count'].map(min))
Output:
flow row count
0 apple 46 1
1 orange 13 1
In Python, the list has a function called index where you can get the position of the value you want to find. So, by utilizing this function with a min, you can get your desired result.
df['min_index'] = df['count'].apply(lambda x: x.index(min(x)))
df[['row_res','count_res']] = [[row[j],count[j]] for row, count, j in zip(df['row'], df['count'], df['min_index'])]
Flow row count min_index row_res count_res
0 Apple [45, 46] [2, 1] 1 46 1
1 Orange [13, 14] [1, 5] 0 13 1

Python pandas move cell value to another cell in same row

I have a dataFrame like this:
id Description Price Unit
1 Test Only 1254 12
2 Data test Fresher 4
3 Sample 3569 1
4 Sample Onces Code test
5 Sample 245 2
I want to move to the left Description column from Price column if not integer then become NaN. I have no specific word to call in or match, the only thing is if Price column have Non-integer value, that string value move to Description column.
I already tried pandas replace and concat but it doesn't work.
Desired output is like this:
id Description Price Unit
1 Test Only 1254 12
2 Fresher 4
3 Sample 3569 1
4 Code test
5 Sample 245 2
This should work
# data
df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Description': ['Test Only', 'Data test', 'Sample', 'Sample Onces', 'Sample'],
'Price': ['1254', 'Fresher', '3569', 'Code test', '245'],
'Unit': [12, 4, 1, np.nan, 2]})
# convert price column to numeric and coerce errors
price = pd.to_numeric(df.Price, errors='coerce')
# for rows where price is not numeric, replace description with these values
df.Description = df.Description.mask(price.isna(), df.Price)
# assign numeric price to price column
df.Price = price
df
Use:
#convert valeus to numeric
price = pd.to_numeric(df['Price'], errors='coerce')
#test missing values
m = price.isna()
#shifted only matched rows
df.loc[m, ['Description','Price']] = df.loc[m, ['Description','Price']].shift(-1, axis=1)
print (df)
id Description Price
0 1 Test Only 1254
1 2 Fresher NaN
2 3 Sample 3569
3 4 Code test NaN
4 5 Sample 245
If need numeric values in ouput Price column:
df = df.assign(Price=price)
print (df)
id Description Price
0 1 Test Only 1254.0
1 2 Fresher NaN
2 3 Sample 3569.0
3 4 Code test NaN
4 5 Sample 245.0

How to extract data from data frame when value of column change

I want to extract part of the data frame when value change from 0 to 1.
logic1: when value change from 0 to 1, start to save data until value again change to 0. (also points before 1 and after 1)
logic2: when value change from 0 to 1, start to save data until value again change to 0. (don't need to save points before 1 and after 1)
only save data when the first time value of flag change from 0 to 1, after this if again value change from 0 to 1 don't need to do anything
df=pd.DataFrame({'value':[3,4,7,8,11,1,15,20,15,16,87],'flag':[0,0,0,1,1,1,0,0,1,1,0]})
Desired output:
df_out_1=pd.DataFrame({'value':[7,8,11,1,15]})
Desired output:
df_out_2=pd.DataFrame({'value':[8,11,1]})
Idea is get consecutive groups of 1 and 0 consecutive groups to s, filter only 1 groups and get first 1 group by compare by minimal value:
df = df.reset_index(drop=True)
s = df['flag'].ne(df['flag'].shift()).cumsum()
m = s.eq(s[df['flag'].eq(1)].min())
df2 = df.loc[m, ['value']]
print (df2)
value
3 8
4 11
5 1
And then filter values with aff and remove 1 to default RangeIndex:
df1 = df.loc[(df2.index + 1).union(df2.index - 1), ['value']]
print (df1)
value
2 7
3 8
4 11
5 1
6 15

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Resources