Pandas: find first occurrence of a matching column value including other criteria - python-3.x

I have a df,
item space rem_spc
PP Orange Valencia 1.000 0.000
Mango Calypso 0.750 0.250
Grape White 0.625 0.375
Pineapple 0.500 0.500
Plum Other 0.375 0.625
Mango Kp 0.375 0.625
Mango Keitt 0.375 0.625
Plum Croc Egg 0.250 0.750
Mango Other 0.125 0.875
so conditions to be applied:
for every item, when it's df['rem_spc'] equals any other item's df['space'] at first occurance, that item name need to be put up as a new column df['nxt_item'] and delete the original row as given in expected output.
If df['rem_spc'] value doesn't equals any other item's df['space'], then more items can be picked whose sum of df['space'] equals in maximum order. (Eg., Item Pineapple)
If df['rem_spc']==0, then df['nxt_item']=''
Item order should not be changed throughout.
Expected Output:
item space rem_spc nxt_item
PP Orange Valencia 1.000 0.000
Mango Calypso 0.750 0.250 Plum Croc Egg
Grape White 0.625 0.375 Plum Other
Pineapple 0.500 0.500 Mango Kp,Mango Other
Mango Keitt 0.375 0.625
Kindly help me through this. Thanks in Advance!

Related

How to calculate percentage change in rows by two column criteria

I have a table like this
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
randomNumbers = np.array([1, 3, 5, 4, np.NaN, 6])
data = {
"animal": ["dog", "dog", "cat", "cat", "bird", "bird"],
"data": ["red", "blue", "red", "blue", "red", "blue"],
2000: randomNumbers,
2001: randomNumbers * 1.5,
2002: randomNumbers * 1.5 * 1.75,
}
df = pd.DataFrame.from_dict(data)
I want to calculate the percentage difference year by year, for each unique 'animal', with data='red', and add it to the dataframe.
I expect a result like this
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
6 dog redPercent NaN 0.5 0.750
7 cat redPercent NaN 0.5 0.750
8 bird redPercent NaN NaN NaN
So I did this:
for animal in df["animal"].unique():
animalRow = df[(df["animal"] == animal) & (df["data"] == "red")]
percChange = animalRow.loc[:, 2000:2002].pct_change(axis=1)
newRow = [animal, "redPercent"] + percChange.values.tolist()[0]
df.loc[len(df)]=newRow
print(df)
But it looks non pythonic. There is a better way?
Use pct_change after slicing and setting non year column aside temporarily, then concat:
df = pd.concat(
[df,
(df[df['data'].eq('red')]
.assign(data=lambda d: d['data'].add('Percent'))
.set_index(['animal', 'data'])
.pct_change(axis=1)
.reset_index()
)], ignore_index=True)
Output:
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
6 dog redPercent NaN 0.5 0.750
7 cat redPercent NaN 0.5 0.750
8 bird redPercent NaN NaN NaN

Python: How to find the near multiple value in pandas?

Have got a dataframe
input_df
Item Space Size Max
Apple 0.375 0.375 0.375
Lemon 0.625 0.375 0.75
Melon 0.5 0.375 0.625
'Space' column value need to be the nearest multiple of 'Size' column. Also, final input_df['Space'] <= input_df['Max']. 'extra' column to be generated which gives how much space value added/reduced.
Expected output_df
Item Space Size Max extra
Apple 0.375 0.375 0.375 0
Lemon 0.75 0.375 0.75 0.125 #Space value changed to nearest
Melon 0.375 0.375 0.625 -0.125 #Space value changed to nearest
You can use:
# get max multiple
MAX = input_df['Max'].div(input_df['Size'])
NEW = (input_df['Space']
.div(input_df['Size']) # get multiple
.clip(upper=MAX.astype(int)) # clip to MAX integer
.round() # round to nearest integer
.mul(input_df['Size']) # multiply again by step
)
# compute the difference to original
input_df['extra'] = NEW.sub(input_df['Space'])
# update original
input_df['Space'] = NEW
output:
Item Space Size Max extra
0 Apple 0.375 0.375 0.375 0.000
1 Lemon 0.750 0.375 0.750 0.125
2 Melon 0.375 0.375 0.625 -0.125
3 Peach 0.375 0.375 0.625 -0.375
4 Grape 0.375 0.375 0.750 0.075
used input:
Item Space Size Max
0 Apple 0.375 0.375 0.375
1 Lemon 0.625 0.375 0.750
2 Melon 0.500 0.375 0.625
3 Peach 0.750 0.375 0.625
4 Grape 0.300 0.375 0.750
You can find the nearest multiple using min of floor and round operation on divisible factor of Space and Max respectively:
pd.DataFrame([np.floor(df["Max"]/df["Size"]), np.round(df["Space"]/df["Size"])]).min() * df["Size"]
Full example:
columns = ["Item","Space","Size","Max"]
data = [("Apple",0.375,0.375,0.375),
("Lemon",0.500,0.375,0.75),
("Melon",0.625,0.375,0.75),
("Peach",0.700,0.375,0.625),
("Grape",0.300 ,0.375,0.750)]
df = pd.DataFrame(data=data, columns=columns)
# Input
>> Item Space Size Max
>> 0 Apple 0.375 0.375 0.375
>> 1 Lemon 0.500 0.375 0.750
>> 2 Melon 0.625 0.375 0.750
>> 3 Peach 0.700 0.375 0.625
>> 4 Grape 0.300 0.375 0.750
df["Extra"] = df["Space"]
df["Space"] = pd.DataFrame([np.floor(df["Max"]/df["Size"]), np.round(df["Space"]/df["Size"])]).min() * df["Size"]
df["Extra"] = df["Space"] - df["Extra"]
# Output
>> Item Space Size Max Extra
>> 0 Apple 0.375 0.375 0.375 0.000
>> 1 Lemon 0.375 0.375 0.750 -0.125
>> 2 Melon 0.750 0.375 0.750 0.125
>> 3 Peach 0.375 0.375 0.625 -0.325
>> 4 Grape 0.375 0.375 0.750 0.075

How to melt a dataframe into a long form?

I have the following dataframe
recycling 1 metric tonne (1000 kilogram) per waste type Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 1 barrel oil is approximately 159 litres of oil NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 material Plastic Glass Ferrous Metal Non-Ferrous Metal Paper
3 energy_saved 5774 Kwh 42 Kwh 642 Kwh 14000 Kwh 4000 kWh
4 crude_oil saved 16 barrels NaN 1.8 barrels 40 barrels 1.7 barrels
For reference look at the image:
What I want to do is to get the rows 2, 3, 4 into cols in a new dataframe. It should be looking some like this..
material energy_saved crude_oil saved
plastic 5774Kwh 16 barrels
Glass 42 Kwh NaN
... ... ...
I tried using .melt but it was not working.
If you notice, the col name and its values are in a single row. I just want them to be in a new data frame as col and value.
IIUC, is it just:
out = df.loc[[2,3,4],:].T.reset_index(drop=True)

Find duplicate rows and move corresponding data to adjacent to original row

I have the following dataframe:
unique_id person_id fruit_name poduct guest
92 11 apple silver Miller
93 12 cherry bronze Gus
967 121 orange purple Mike
94 176 apple silver Miller
95 176 banana gold John
96 176 orange purple Mike
445 111 apple silver Miller
100 112 cherry bronze Gus
232 111 apple silver Miller
355 555 cherry bronze Gus
I want to grab any of the duplicate values found under person_id column and move them adjacent to the original row, here is an example of the output that is expected:
unique_id person_id fruit_name poduct guest unique_id_1 fruit_name poduct guest unique_id_2 fruit_name poduct guest
92 11 apple silver Miller
93 12 cherry bronze Gus
967 121 orange purple Mike
94 176 apple silver Miller 95 banana gold John 96 orange purple Mike
100 112 cherry bronze Gus
445 111 apple silver Miller 232 apple silver Miller
355 555 cherry bronze Gus
I'm not really sure what I should search online in order to achive this, any suggestions is greatly appreciated.
It's a "long to wide" transformation.
You can add a column to idenfify which group a row is part of.
df['group'] = df.groupby('person_id').cumcount() + 1
>>> df
unique_id person_id fruit_name poduct guest group
0 92 11 apple silver Miller 1
1 93 12 cherry bronze Gus 1
2 967 121 orange purple Mike 1
3 94 176 apple silver Miller 1
4 95 176 banana gold John 2
5 96 176 orange purple Mike 3
6 445 111 apple silver Miller 1
7 100 112 cherry bronze Gus 1
8 232 111 apple silver Miller 2
9 355 555 cherry bronze Gus 1
This then gets used in DataFrame.pivot()
>>> df.pivot(index='person_id', columns='group').sort_index(axis=1, level=1)
fruit_name guest poduct unique_id fruit_name guest poduct unique_id fruit_name guest poduct unique_id
group 1 1 1 1 2 2 2 2 3 3 3 3
person_id
11 apple Miller silver 92.0 NaN NaN NaN NaN NaN NaN NaN NaN
12 cherry Gus bronze 93.0 NaN NaN NaN NaN NaN NaN NaN NaN
111 apple Miller silver 445.0 apple Miller silver 232.0 NaN NaN NaN NaN
112 cherry Gus bronze 100.0 NaN NaN NaN NaN NaN NaN NaN NaN
121 orange Mike purple 967.0 NaN NaN NaN NaN NaN NaN NaN NaN
176 apple Miller silver 94.0 banana John gold 95.0 orange Mike purple 96.0
555 cherry Gus bronze 355.0 NaN NaN NaN NaN NaN NaN NaN NaN
Then you can rename the columns.
out = df.pivot(index='person_id', columns='group').sort_index(axis=1, level=1)
out.columns = [ f'{x}_{y}' for x, y in out.columns ]
>>> out.reset_index()
person_id fruit_name_1 guest_1 poduct_1 unique_id_1 fruit_name_2 guest_2 poduct_2 unique_id_2 fruit_name_3 guest_3 poduct_3 unique_id_3
0 11 apple Miller silver 92.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 12 cherry Gus bronze 93.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 111 apple Miller silver 445.0 apple Miller silver 232.0 NaN NaN NaN NaN
3 112 cherry Gus bronze 100.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 121 orange Mike purple 967.0 NaN NaN NaN NaN NaN NaN NaN NaN
5 176 apple Miller silver 94.0 banana John gold 95.0 orange Mike purple 96.0
6 555 cherry Gus bronze 355.0 NaN NaN NaN NaN NaN NaN NaN NaN
UPDATE
Custom column order example:
order = ['person_id', 'fruit_name', 'unique_id', 'guest', 'poduct']
out = df.pivot(index='person_id', columns='group')
out = out[sorted(out.columns, key=lambda idx: (idx[1], order.index(idx[0])))]
out.columns = [ f'{x}_{y}' for x, y in out.columns ]
>>> out.reset_index()
person_id fruit_name_1 unique_id_1 guest_1 poduct_1 fruit_name_2 unique_id_2 guest_2 poduct_2 fruit_name_3 unique_id_3 guest_3 poduct_3
0 11 apple 92.0 Miller silver NaN NaN NaN NaN NaN NaN NaN NaN
1 12 cherry 93.0 Gus bronze NaN NaN NaN NaN NaN NaN NaN NaN
2 111 apple 445.0 Miller silver apple 232.0 Miller silver NaN NaN NaN NaN
3 112 cherry 100.0 Gus bronze NaN NaN NaN NaN NaN NaN NaN NaN
4 121 orange 967.0 Mike purple NaN NaN NaN NaN NaN NaN NaN NaN
5 176 apple 94.0 Miller silver banana 95.0 John gold orange 96.0 Mike purple
6 555 cherry 355.0 Gus bronze NaN NaN NaN NaN NaN NaN NaN NaN
Try:
# Separate duplicated lines
dup = df.duplicated(subset=['person_id'], keep='last')
rem = df[~dup]
# Merge on "person_id"
new_df = pd.merge(
right=rem,
left=dup,
how="outer",
on=["person_id"],
suffixes=("_0", "_1"],
)

Pandas DataFrame: Substract one row to another taking into account index(name and date)

Hi I have a python pandas Dataframe where I would like to see the changes between the latest 2 dates (when available) for a 3 indexed columns (phonetype, memory and brand).
The dataframe looks like this:
"""
"I would like to have the latest change of customers' holdings per brand, memory and phonetype.
So result would be (sorted by the latest change-when available):
"""
Which means, that that change of holdings for iphone1/32go/apple was one the 17/10/19, and was a decrease of .11 (-0.11), last change for iphone2/32g0/apple was on the 19/03/19, and a decrease of -.09 (-0.09), last change for iphone3/64g0 /apple was on the 05/12/16, and was a decrease of 0.12 (-0.12).
So basically substracting the 1st row by the second row, when the second row exist (meaning 2 records containing same phonetype/memory/brand with different dates). If the second row doesn't exit just show 1st row unchanged (first row [customer_holders]-0).
iphone4 32go Apple -0.50 01/11/2019
I don't know how to do this with pandas, without iterating through rows...
Any help would be much apeciated.
Thanks
Raw data are as bellow:
phonetype memory Brand customers_holders position_date
iphone1 32go Apple 0.77 17/10/2019
iphone1 32go Apple 0.88 10/10/2019
iphone1 32go Apple 0.98 26/09/2019
iphone1 32go Apple 1 15/08/2019
iphone1 32go Apple 0.9 06/08/2019
iphone1 32go Apple 0.8 18/07/2019
iphone1 32go Apple 0.8 18/07/2019
iphone1 32go Apple 0.74 20/06/2019
iphone1 32go Apple 0.61 11/06/2019
iphone1 32go Apple 0.5 21/05/2019
iphone2 32go Apple 0.5 19/03/2019
iphone2 32go Apple 0.59 16/01/2019
iphone2 32go Apple 0.68 04/12/2018
iphone3 64go Apple 0.5 05/12/2016
iphone3 64go Apple 0.62 11/11/2016
iphone3 64go Apple 0.79 12/11/2018
iphone4 32go Apple 0.50 01/11/2019
you can try this:
First, change the date columns to type of datetime for to find the latest date.
df['position_date'] = pd.to_datetime(df['position_date'], format='%d/%m/%Y')
print(df.head(10))
phonetype memory Brand customers_holders position_date
0 iphone1 32go Apple 0.77 2019-10-17
1 iphone1 32go Apple 0.88 2019-10-10
2 iphone1 32go Apple 0.98 2019-09-26
3 iphone1 32go Apple 1.00 2019-08-15
4 iphone1 32go Apple 0.90 2019-08-06
5 iphone1 32go Apple 0.80 2019-07-18
6 iphone1 32go Apple 0.80 2019-07-18
7 iphone1 32go Apple 0.74 2019-06-20
8 iphone1 32go Apple 0.61 2019-06-11
9 iphone1 32go Apple 0.50 2019-05-21
and then
1. Sort in descending order for key columns and date columns.
2. Use pd.groupby.diff function, calculate the difference from the previous row by group. reference here!
3. I think you only need the difference between the latest date and the previous date, so use drop_duplicates to leave only the first row.
like this:
Edit
Then, if diff is nan, you can add code to populate the value using np.where.
like this:
key_col = ['phonetype','memory','Brand']
df = df.sort_values(by= key_col + ['position_date'], ascending=False)
df['diff'] = df.groupby(key_col)['customers_holders'].diff(periods=-1)
df = df.drop_duplicates(subset=key_col, keep='first')
# if diff is nan.
df['diff'] = np.where(df['diff'].isnull(), -df['customers_holders'], df['diff'])
print(df)
phonetype memory Brand customers_holders position_date diff
15 iphone3 64go Apple 0.79 2018-11-12 0.29
10 iphone2 32go Apple 0.50 2019-03-19 -0.09
0 iphone1 32go Apple 0.77 2019-10-17 -0.11
To make it look like your result.
df = df.drop('customers_holders', axis=1)\
.rename({'diff':'customers_holders'},axis=1)\
.sort_values(by='phonetype')\
.reset_index(drop=True)
print(df)
phonetype memory Brand position_date customers_holders
0 iphone1 32go Apple 2019-10-17 -0.11
1 iphone2 32go Apple 2019-03-19 -0.09
2 iphone3 64go Apple 2018-11-12 0.29

Resources