Python: Fetch value from dataframe list with min value from another column list of same dataframe - python-3.x

input dataframe
Flow row count
Apple [45,46] [2,1]
Orange [13,14] [1,5]
need to find min value of each list column 'count' and fetch respective row value from row column.
Expected output:
Flow row count
Apple 46 1
Orange 13 1

A possible solution (the part .astype('int') may be unnecessary in your case):
df['row'] = list(df.explode(['row', 'count']).reset_index().groupby('flow')
.apply(lambda x: x['row'][x['count'].astype('int').idxmin()]))
df['count'] = df['count'].map(min)
A shorter solution than my previous one, based on sorted with key:
df.assign(row=df.apply(
lambda x: sorted(x['row'], key=lambda z: x['count'][x['row'].index(z)])[0],axis=1),
count=df['count'].map(min))
Output:
flow row count
0 apple 46 1
1 orange 13 1

In Python, the list has a function called index where you can get the position of the value you want to find. So, by utilizing this function with a min, you can get your desired result.
df['min_index'] = df['count'].apply(lambda x: x.index(min(x)))
df[['row_res','count_res']] = [[row[j],count[j]] for row, count, j in zip(df['row'], df['count'], df['min_index'])]
Flow row count min_index row_res count_res
0 Apple [45, 46] [2, 1] 1 46 1
1 Orange [13, 14] [1, 5] 0 13 1

Related

Python: How to use value_counts() inside .agg function in pandas?

Input dataframe df looks like:
item row
Apple 12
Apple 12
Apple 13
Orange 13
Orange 14
Lemon 14
Output dataframe need to be
item unique_row nunique_row count
Apple {12,13} 2 {2,1}
Orange {13,14} 2 {1,1}
Lemon {14} 1 {1}
Tried Code:
df.groupby('item', as_index=False)['row'].agg({'unique_row': lambda x: set(x)
,'nunique_row': lambda x: len(set(x))})
So here, not sure how to add condition inside .agg function to generate column 'count'. Column 'count' represents number of value_count for each row value.
Any help will be appreciated. Thank You!
Solution
s = df.value_counts()
g = s.reset_index(name='count').groupby('item')
g.agg(list).join(g.size().rename('nunique_row'))
Working
Calculate the groupsize per item and row using value_counts
group the preceding counts by item
agg with list to get the list of unique rows and corresponding counts
agg with size to get number of unique rows
Result
row count nunique_row
item
Apple [12, 13] [2, 1] 2
Lemon [14] [1] 1
Orange [13, 14] [1, 1] 2
You need to convert to list or set:
(df.groupby('item', as_index=False)['row']
.agg({'unique_row': lambda x: list(x.unique()),
'nunique_row': lambda x: len(set(x)),
'count': lambda x: list(x.value_counts(sort=False)), # or set(x.value_counts())
})
)
output:
item unique_row nunique_row count
0 Apple [12, 13] 2 [2, 1]
1 Lemon [14] 1 [1]
2 Orange [13, 14] 2 [1, 1]

Get column value from the column that is dynamically selected depending on row value of another column

I have a dataframe as below.
month fe_month_OCT re_month_APR fe_month_MAY
0 OCT 1 1 2
1 APR 4 2 2
2 MAY 1 4 3
Im trying to create a new column that gets me the value from any of the fe_month_ or re_month_ columns based on what month the row of data corresponds to (for the SAME month however we will not see 2 columns - i.e. we will never see both fe_month_APR and re_month_APR in the same df - it will either be fe or re).
Output example - for the first row, I would want this new column to have the value coming from fe_month_OCT, because month=OCT, for the second row, the value should come from re_month_APR etc.
Expected output:
month fe_month_OCT re_month_APR fe_month_MAY d_month
0 OCT 1 1 2 1
1 APR 4 2 2 2
2 MAY 1 4 3 3
Code to create input dataframe:
data = {'month': ['OCT', 'APR', 'MAY'], 'fe_month_OCT': [1, 4, 1], 're_month_APR': [1, 2, 4],'fe_month_MAY': [2, 2, 3] }
db = pd.DataFrame(data)
Assuming all the column names are in the form "fe_month_" plus the string in db["month"], you can use apply().
get_value = lambda row: row[ "fe_month_" + row["month"] ]
db["d_month"] = db.apply( get_value, axis=1 )

Sorting by absolute value for value different of zero for one column keeping equal value of another column together

I have the following dataframe :
A B C
============
11 x 2
11 y 0
13 x -10
13 y 0
10 x 7
10 y 0
and i would like to sort C by absolute value for value different of 0. But as i need to keep A values together it would look like below (sorted by absolute value but with 0 in between):
A B C
============
13 x -10
13 y 0
10 x 7
10 y 0
11 x 2
11 y 0
I can't manage to obtain this with sort_values(). If i sort by C, i don't have A values together.
Step 1: get absolute values
# creating a column with the absolute values
df["abs_c"] = df["c"].abs()
Step 2: sort values on absolute values of "c"
# sorting by absolute value of "c" & reseting the index & assigning it back to df
df = df.sort_values("abs_c",ascending=False).reset_index(drop=True)
Step 3: get the order of column "a" based on the sorted values, this is achieved by using drop duplicates of pandas which keeps the first instance of the value in the column a which is sorted based on "c". This will be used in the next step
# getting the order of "a" based on sorted value of "c"
order_a = df["a"].drop_duplicates()
Step 4: based on the order of "a" and the sorted values of "c" creating a data frame
# based on the order_a creating a data frame as per the order_a which is based on the sorted values of abs "c"
sorted_df = pd.DataFrame()
for i in range(len(order_a)):
sorted_df = sorted_df.append(df[df["a"]==order_a[i]])
Step 5:Assigning the sorted df back to df
# reset index of sorted values and assigning it back to df
df = sorted_df.reset_index(drop=True)
Output
a b c abs_c
0 13 x -10 10
1 13 y 0 0
2 10 x 7 7
3 10 y 0 0
4 11 x 2 2
5 11 y 0 0
Doc reference
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Sorry, it doesn't turn out very nice, but I almost never use panda. I hope everything works out the way you want it.
import pandas as pd
df = pd.DataFrame({'a': [11, 11, 13, 13, 10, 10],
'b': ['x', 'y', 'x', 'y', 'x', 'y'],
'c': [2, 0, -10, 0, 7, 0]})
mask = df[df['c'] != 0]
mask['abs'] = mask['c'].abs()
mask = mask.sort_values('abs', ascending=False).reset_index()
tempNr = 0
for index, row in df.iterrows():
if row['c'] != 0:
df.loc[index] = mask.loc[tempNr].drop('abs')
tempNr = tempNr + 1
print(df)

pandas data frame effeciently remove duplicates and keep records largest int value

I have a data frame with two columns NAME, and VALUE, where NAME contains duplicates and VALUE contains INTs. I would like to efficiently drop duplicates records of column NAME while keeping the record with the largest VALUE. I figured out how to do it will two steps, sort and drop duplicates, but I am new to pandas and am curious if there is a more efficient way to achieve this with the query function?
import pandas
import io
import json
input = """
KEY VALUE
apple 0
apple 1
apple 2
bannana 0
bannana 1
bannana 2
pear 0
pear 1
pear 2
pear 3
orange 0
orange 1
orange 2
orange 3
orange 4
"""
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df[['KEY','VALUE']].sort_values(by=['VALUE']).drop_duplicates(subset='KEY', keep='last')
dicty = dict(zip(df['KEY'], df['VALUE']))
print(json.dumps(dicty, indent=4))
running this yields the expected output:
{
"apple": 2,
"bannana": 2,
"pear": 3,
"orange": 4
}
Is there a more efficient way to achieve this transformation with pandas?
df = pandas.read_csv(io.StringIO(input), delim_whitespace=True, header=0)
df.groupby('KEY')['VALUE'].max()
If your input needs to be a dictionary, just add to_dict() :
df.groupby('KEY')['VALUE'].max().to_dict()
Also you can try:
[*df.groupby('KEY',sort=False).last().to_dict().values()][0]
{'apple': 2, 'bannana': 2, 'pear': 3, 'orange': 4}

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Resources