Python: How to find the near multiple value in pandas?

Python: How to find the near multiple value in pandas? - python-3.x

Have got a dataframe
input_df
Item Space Size Max
Apple 0.375 0.375 0.375
Lemon 0.625 0.375 0.75
Melon 0.5 0.375 0.625
'Space' column value need to be the nearest multiple of 'Size' column. Also, final input_df['Space'] <= input_df['Max']. 'extra' column to be generated which gives how much space value added/reduced.
Expected output_df
Item Space Size Max extra
Apple 0.375 0.375 0.375 0
Lemon 0.75 0.375 0.75 0.125 #Space value changed to nearest
Melon 0.375 0.375 0.625 -0.125 #Space value changed to nearest

You can use:
# get max multiple
MAX = input_df['Max'].div(input_df['Size'])
NEW = (input_df['Space']
.div(input_df['Size']) # get multiple
.clip(upper=MAX.astype(int)) # clip to MAX integer
.round() # round to nearest integer
.mul(input_df['Size']) # multiply again by step
)
# compute the difference to original
input_df['extra'] = NEW.sub(input_df['Space'])
# update original
input_df['Space'] = NEW
output:
Item Space Size Max extra
0 Apple 0.375 0.375 0.375 0.000
1 Lemon 0.750 0.375 0.750 0.125
2 Melon 0.375 0.375 0.625 -0.125
3 Peach 0.375 0.375 0.625 -0.375
4 Grape 0.375 0.375 0.750 0.075
used input:
Item Space Size Max
0 Apple 0.375 0.375 0.375
1 Lemon 0.625 0.375 0.750
2 Melon 0.500 0.375 0.625
3 Peach 0.750 0.375 0.625
4 Grape 0.300 0.375 0.750

You can find the nearest multiple using min of floor and round operation on divisible factor of Space and Max respectively:
pd.DataFrame([np.floor(df["Max"]/df["Size"]), np.round(df["Space"]/df["Size"])]).min() * df["Size"]
Full example:
columns = ["Item","Space","Size","Max"]
data = [("Apple",0.375,0.375,0.375),
("Lemon",0.500,0.375,0.75),
("Melon",0.625,0.375,0.75),
("Peach",0.700,0.375,0.625),
("Grape",0.300 ,0.375,0.750)]
df = pd.DataFrame(data=data, columns=columns)
# Input
>> Item Space Size Max
>> 0 Apple 0.375 0.375 0.375
>> 1 Lemon 0.500 0.375 0.750
>> 2 Melon 0.625 0.375 0.750
>> 3 Peach 0.700 0.375 0.625
>> 4 Grape 0.300 0.375 0.750
df["Extra"] = df["Space"]
df["Space"] = pd.DataFrame([np.floor(df["Max"]/df["Size"]), np.round(df["Space"]/df["Size"])]).min() * df["Size"]
df["Extra"] = df["Space"] - df["Extra"]
# Output
>> Item Space Size Max Extra
>> 0 Apple 0.375 0.375 0.375 0.000
>> 1 Lemon 0.375 0.375 0.750 -0.125
>> 2 Melon 0.750 0.375 0.750 0.125
>> 3 Peach 0.375 0.375 0.625 -0.325
>> 4 Grape 0.375 0.375 0.750 0.075

Related

Pandas: find first occurrence of a matching column value including other criteria

I have a df,
item space rem_spc
PP Orange Valencia 1.000 0.000
Mango Calypso 0.750 0.250
Grape White 0.625 0.375
Pineapple 0.500 0.500
Plum Other 0.375 0.625
Mango Kp 0.375 0.625
Mango Keitt 0.375 0.625
Plum Croc Egg 0.250 0.750
Mango Other 0.125 0.875
so conditions to be applied:
for every item, when it's df['rem_spc'] equals any other item's df['space'] at first occurance, that item name need to be put up as a new column df['nxt_item'] and delete the original row as given in expected output.
If df['rem_spc'] value doesn't equals any other item's df['space'], then more items can be picked whose sum of df['space'] equals in maximum order. (Eg., Item Pineapple)
If df['rem_spc']==0, then df['nxt_item']=''
Item order should not be changed throughout.
Expected Output:
item space rem_spc nxt_item
PP Orange Valencia 1.000 0.000
Mango Calypso 0.750 0.250 Plum Croc Egg
Grape White 0.625 0.375 Plum Other
Pineapple 0.500 0.500 Mango Kp,Mango Other
Mango Keitt 0.375 0.625
Kindly help me through this. Thanks in Advance!

How to calculate percentage change in rows by two column criteria

I have a table like this
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
randomNumbers = np.array([1, 3, 5, 4, np.NaN, 6])
data = {
"animal": ["dog", "dog", "cat", "cat", "bird", "bird"],
"data": ["red", "blue", "red", "blue", "red", "blue"],
2000: randomNumbers,
2001: randomNumbers * 1.5,
2002: randomNumbers * 1.5 * 1.75,
}
df = pd.DataFrame.from_dict(data)
I want to calculate the percentage difference year by year, for each unique 'animal', with data='red', and add it to the dataframe.
I expect a result like this
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
6 dog redPercent NaN 0.5 0.750
7 cat redPercent NaN 0.5 0.750
8 bird redPercent NaN NaN NaN
So I did this:
for animal in df["animal"].unique():
animalRow = df[(df["animal"] == animal) & (df["data"] == "red")]
percChange = animalRow.loc[:, 2000:2002].pct_change(axis=1)
newRow = [animal, "redPercent"] + percChange.values.tolist()[0]
df.loc[len(df)]=newRow
print(df)
But it looks non pythonic. There is a better way?

Use pct_change after slicing and setting non year column aside temporarily, then concat:
df = pd.concat(
[df,
(df[df['data'].eq('red')]
.assign(data=lambda d: d['data'].add('Percent'))
.set_index(['animal', 'data'])
.pct_change(axis=1)
.reset_index()
)], ignore_index=True)
Output:
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
6 dog redPercent NaN 0.5 0.750
7 cat redPercent NaN 0.5 0.750
8 bird redPercent NaN NaN NaN

Creating multiple cohort from the pivot table

I have a requirement like below.
The initial information is a list of gross adds.
201910
201911
201912
202001
202002
20000
30000
32000
40000
36000
I have a pivot table as below.
201910
201911
201912
202001
202002
1000
2000
2400
3200
1800
500
400
300
200
nan
200
150
100
nan
nan
200
100
nan
nan
nan
160
nan
nan
nan
nan
Need to generate the report like below.
Cohort01:
5%
3%
3%
1%
1%
1%
From Cohort02 onwards it will take the average of last value of cohort01.
Similarly for Cohort03 for both nan values it will take the average of corresponding values of cohort01 and cohort2.
Again while calculating for cohort04 it will take the average of previous two cohorts(cohort02 and cohort03 values) to add all three nan value.
Is there anyone who can provide me a solution on this in Python.
The report should be generated as below.
All cohorts should be created separately.

You could try it like this:
res = df.apply(lambda x: round(100/(df_gross.iloc[0]/x),1),axis=1)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.5 8.0 5.0
1 2.5 1.3 0.9 0.5 NaN
2 1.0 0.5 0.3 NaN NaN
3 1.0 0.3 NaN NaN NaN
4 0.8 NaN NaN NaN NaN
for idx,col in enumerate(res.columns[1:],1):
res[col] = res[col].fillna((res.iloc[:,max(idx-2,0)]+res.iloc[:,idx-1])/2)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.50 8.000 5.0000
1 2.5 1.3 0.90 0.500 0.7000
2 1.0 0.5 0.30 0.400 0.3500
3 1.0 0.3 0.65 0.475 0.5625
4 0.8 0.8 0.80 0.800 0.8000

How to apply masking while creating next row value which is based on previous row's value and another column in Python Pandas?

Here is the data
import numpy as np
import pandas as pd
data = {
'cases': [120, 100, np.nan, np.nan, np.nan, np.nan, np.nan],
'percent_change': [0.03, 0.01, 0.00, -0.001, 0.05, -0.1, 0.003],
'tag': [7, 6, 5, 4, 3, 2, 1],
}
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 NaN 0.000 5
3 NaN -0.001 4
4 NaN 0.050 3
5 NaN -0.100 2
6 NaN 0.003 1
I want to create next cases' value as (next value) = (previous value) * (1+current per_change). Specifically, I want it to be done in rows that has a tag value less than 6 (and I must use a mask (i.e., df.loc for this row selection). This should give me:
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 100.0 0.000 5
3 99.9 -0.001 4
4 104.9 0.050 3
5 94.4 -0.100 2
6 94.7 0.003 1
I tried this but it doesn't work:
df_index = np.where(df['tag'] == 6)
index = df_index[0][0]
df.loc[(df.tag<6), 'cases'] = (df.percent_change.shift(0).fillna(1) + 1).cumprod() * df.at[index, 'cases']
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 104.030000 0.000 5
3 103.925970 -0.001 4
4 109.122268 0.050 3
5 98.210042 -0.100 2
6 98.504672 0.003 1

I would do:
s = df.cases.isna()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
Output:
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 100.000000 0.000 5
3 99.900000 -0.001 4
4 104.895000 0.050 3
5 94.405500 -0.100 2
6 94.688716 0.003 1
Update: If you really insist on masking on the Tag==6:
s = df.tag.eq(6).shift()
s = s.where(s).ffill()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()

Update single column based on multiple 'priority' columns

Suppose you had a DataFrame with a number of columns / Series- say five for example. If the fifth column (named 'Updated Col') had values, in addition to nans, what would be the best way to insert values into 'Updated Col' from other columns in place of the nans based on a preferred column order?
e.g. my dataframe looks something like this;
Date 1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 Nan
12/03/2017 0:40 0.1 Nan
12/03/2017 0:50 0.6 0.5 Nan
12/03/2017 1:00 0.4 0.3 Nan
12/03/2017 1:10 0.3 0.2 Nan
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..and say for example I wanted the values from column 3 as a priority, followed by 2, then 1, i would expect the DataFrame to look like this;
1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 0.7
12/03/2017 0:40 0.1 0.1
12/03/2017 0:50 0.6 0.5 0.5
12/03/2017 1:00 0.4 0.3 0.3
12/03/2017 1:10 0.3 0.2 0.2
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..values would be input from the lower priority columns only if the higher priority columns were empty / NaN.
What would be the best way to do this?
I've tried numerous np.where attempts but cant work out what the best way would be?
Many thanks in advance.

You can use fillna with forward filling (ffill) and then select column:
updated_col = 'Updated Col'
#define columns for check, maybe [1,2,3,4] if integer colum names
cols = ['1','2','3','4'] + [updated_col]
print (df[cols].ffill(axis=1))
1 2 3 4 Updated Col
0 0.4 0.4 0.4 0.4 0.9
1 0.4 0.4 0.4 0.4 0.1
2 0.4 0.4 0.4 0.4 0.6
3 0.9 0.9 0.7 0.7 0.7
4 0.1 0.1 0.1 0.1 0.1
5 0.6 0.6 0.6 0.5 0.5
6 0.4 0.4 0.3 0.3 0.3
7 0.3 0.3 0.3 0.2 0.2
8 0.9 0.9 0.9 0.9 0.8
9 0.9 0.9 0.9 0.9 0.8
10 0.0 0.0 0.0 0.0 0.9
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
EDIT:
Thank you shivsn for comments.
If have Nan (string values) in DataFrame what are not NaNs (missing values) or empty string values is necessary first replace:
updated_col = 'Updated Col'
cols = ['1','2','3','4'] + ['Updated Col']
d = {'Nan':np.nan, '': np.nan}
df = df.replace(d)
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python: How to find the near multiple value in pandas? - python-3.x

Related

Pandas: find first occurrence of a matching column value including other criteria

How to calculate percentage change in rows by two column criteria

Creating multiple cohort from the pivot table

How to apply masking while creating next row value which is based on previous row's value and another column in Python Pandas?

Update single column based on multiple 'priority' columns

Categories

Resources