Have got a dataframe
input_df
Item Space Size Max
Apple 0.375 0.375 0.375
Lemon 0.625 0.375 0.75
Melon 0.5 0.375 0.625
'Space' column value need to be the nearest multiple of 'Size' column. Also, final input_df['Space'] <= input_df['Max']. 'extra' column to be generated which gives how much space value added/reduced.
Expected output_df
Item Space Size Max extra
Apple 0.375 0.375 0.375 0
Lemon 0.75 0.375 0.75 0.125 #Space value changed to nearest
Melon 0.375 0.375 0.625 -0.125 #Space value changed to nearest
You can use:
# get max multiple
MAX = input_df['Max'].div(input_df['Size'])
NEW = (input_df['Space']
.div(input_df['Size']) # get multiple
.clip(upper=MAX.astype(int)) # clip to MAX integer
.round() # round to nearest integer
.mul(input_df['Size']) # multiply again by step
)
# compute the difference to original
input_df['extra'] = NEW.sub(input_df['Space'])
# update original
input_df['Space'] = NEW
output:
Item Space Size Max extra
0 Apple 0.375 0.375 0.375 0.000
1 Lemon 0.750 0.375 0.750 0.125
2 Melon 0.375 0.375 0.625 -0.125
3 Peach 0.375 0.375 0.625 -0.375
4 Grape 0.375 0.375 0.750 0.075
used input:
Item Space Size Max
0 Apple 0.375 0.375 0.375
1 Lemon 0.625 0.375 0.750
2 Melon 0.500 0.375 0.625
3 Peach 0.750 0.375 0.625
4 Grape 0.300 0.375 0.750
You can find the nearest multiple using min of floor and round operation on divisible factor of Space and Max respectively:
pd.DataFrame([np.floor(df["Max"]/df["Size"]), np.round(df["Space"]/df["Size"])]).min() * df["Size"]
Full example:
columns = ["Item","Space","Size","Max"]
data = [("Apple",0.375,0.375,0.375),
("Lemon",0.500,0.375,0.75),
("Melon",0.625,0.375,0.75),
("Peach",0.700,0.375,0.625),
("Grape",0.300 ,0.375,0.750)]
df = pd.DataFrame(data=data, columns=columns)
# Input
>> Item Space Size Max
>> 0 Apple 0.375 0.375 0.375
>> 1 Lemon 0.500 0.375 0.750
>> 2 Melon 0.625 0.375 0.750
>> 3 Peach 0.700 0.375 0.625
>> 4 Grape 0.300 0.375 0.750
df["Extra"] = df["Space"]
df["Space"] = pd.DataFrame([np.floor(df["Max"]/df["Size"]), np.round(df["Space"]/df["Size"])]).min() * df["Size"]
df["Extra"] = df["Space"] - df["Extra"]
# Output
>> Item Space Size Max Extra
>> 0 Apple 0.375 0.375 0.375 0.000
>> 1 Lemon 0.375 0.375 0.750 -0.125
>> 2 Melon 0.750 0.375 0.750 0.125
>> 3 Peach 0.375 0.375 0.625 -0.325
>> 4 Grape 0.375 0.375 0.750 0.075
Related
I have a df,
item space rem_spc
PP Orange Valencia 1.000 0.000
Mango Calypso 0.750 0.250
Grape White 0.625 0.375
Pineapple 0.500 0.500
Plum Other 0.375 0.625
Mango Kp 0.375 0.625
Mango Keitt 0.375 0.625
Plum Croc Egg 0.250 0.750
Mango Other 0.125 0.875
so conditions to be applied:
for every item, when it's df['rem_spc'] equals any other item's df['space'] at first occurance, that item name need to be put up as a new column df['nxt_item'] and delete the original row as given in expected output.
If df['rem_spc'] value doesn't equals any other item's df['space'], then more items can be picked whose sum of df['space'] equals in maximum order. (Eg., Item Pineapple)
If df['rem_spc']==0, then df['nxt_item']=''
Item order should not be changed throughout.
Expected Output:
item space rem_spc nxt_item
PP Orange Valencia 1.000 0.000
Mango Calypso 0.750 0.250 Plum Croc Egg
Grape White 0.625 0.375 Plum Other
Pineapple 0.500 0.500 Mango Kp,Mango Other
Mango Keitt 0.375 0.625
Kindly help me through this. Thanks in Advance!
I have a table like this
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
randomNumbers = np.array([1, 3, 5, 4, np.NaN, 6])
data = {
"animal": ["dog", "dog", "cat", "cat", "bird", "bird"],
"data": ["red", "blue", "red", "blue", "red", "blue"],
2000: randomNumbers,
2001: randomNumbers * 1.5,
2002: randomNumbers * 1.5 * 1.75,
}
df = pd.DataFrame.from_dict(data)
I want to calculate the percentage difference year by year, for each unique 'animal', with data='red', and add it to the dataframe.
I expect a result like this
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
6 dog redPercent NaN 0.5 0.750
7 cat redPercent NaN 0.5 0.750
8 bird redPercent NaN NaN NaN
So I did this:
for animal in df["animal"].unique():
animalRow = df[(df["animal"] == animal) & (df["data"] == "red")]
percChange = animalRow.loc[:, 2000:2002].pct_change(axis=1)
newRow = [animal, "redPercent"] + percChange.values.tolist()[0]
df.loc[len(df)]=newRow
print(df)
But it looks non pythonic. There is a better way?
Use pct_change after slicing and setting non year column aside temporarily, then concat:
df = pd.concat(
[df,
(df[df['data'].eq('red')]
.assign(data=lambda d: d['data'].add('Percent'))
.set_index(['animal', 'data'])
.pct_change(axis=1)
.reset_index()
)], ignore_index=True)
Output:
animal data 2000 2001 2002
0 dog red 1.0 1.5 2.625
1 dog blue 3.0 4.5 7.875
2 cat red 5.0 7.5 13.125
3 cat blue 4.0 6.0 10.500
4 bird red NaN NaN NaN
5 bird blue 6.0 9.0 15.750
6 dog redPercent NaN 0.5 0.750
7 cat redPercent NaN 0.5 0.750
8 bird redPercent NaN NaN NaN
I have a requirement like below.
The initial information is a list of gross adds.
201910
201911
201912
202001
202002
20000
30000
32000
40000
36000
I have a pivot table as below.
201910
201911
201912
202001
202002
1000
2000
2400
3200
1800
500
400
300
200
nan
200
150
100
nan
nan
200
100
nan
nan
nan
160
nan
nan
nan
nan
Need to generate the report like below.
Cohort01:
5%
3%
3%
1%
1%
1%
From Cohort02 onwards it will take the average of last value of cohort01.
Similarly for Cohort03 for both nan values it will take the average of corresponding values of cohort01 and cohort2.
Again while calculating for cohort04 it will take the average of previous two cohorts(cohort02 and cohort03 values) to add all three nan value.
Is there anyone who can provide me a solution on this in Python.
The report should be generated as below.
All cohorts should be created separately.
You could try it like this:
res = df.apply(lambda x: round(100/(df_gross.iloc[0]/x),1),axis=1)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.5 8.0 5.0
1 2.5 1.3 0.9 0.5 NaN
2 1.0 0.5 0.3 NaN NaN
3 1.0 0.3 NaN NaN NaN
4 0.8 NaN NaN NaN NaN
for idx,col in enumerate(res.columns[1:],1):
res[col] = res[col].fillna((res.iloc[:,max(idx-2,0)]+res.iloc[:,idx-1])/2)
print(res)
201910 201911 201912 202001 202002
0 5.0 6.7 7.50 8.000 5.0000
1 2.5 1.3 0.90 0.500 0.7000
2 1.0 0.5 0.30 0.400 0.3500
3 1.0 0.3 0.65 0.475 0.5625
4 0.8 0.8 0.80 0.800 0.8000
Here is the data
import numpy as np
import pandas as pd
data = {
'cases': [120, 100, np.nan, np.nan, np.nan, np.nan, np.nan],
'percent_change': [0.03, 0.01, 0.00, -0.001, 0.05, -0.1, 0.003],
'tag': [7, 6, 5, 4, 3, 2, 1],
}
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 NaN 0.000 5
3 NaN -0.001 4
4 NaN 0.050 3
5 NaN -0.100 2
6 NaN 0.003 1
I want to create next cases' value as (next value) = (previous value) * (1+current per_change). Specifically, I want it to be done in rows that has a tag value less than 6 (and I must use a mask (i.e., df.loc for this row selection). This should give me:
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 100.0 0.000 5
3 99.9 -0.001 4
4 104.9 0.050 3
5 94.4 -0.100 2
6 94.7 0.003 1
I tried this but it doesn't work:
df_index = np.where(df['tag'] == 6)
index = df_index[0][0]
df.loc[(df.tag<6), 'cases'] = (df.percent_change.shift(0).fillna(1) + 1).cumprod() * df.at[index, 'cases']
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 104.030000 0.000 5
3 103.925970 -0.001 4
4 109.122268 0.050 3
5 98.210042 -0.100 2
6 98.504672 0.003 1
I would do:
s = df.cases.isna()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
Output:
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 100.000000 0.000 5
3 99.900000 -0.001 4
4 104.895000 0.050 3
5 94.405500 -0.100 2
6 94.688716 0.003 1
Update: If you really insist on masking on the Tag==6:
s = df.tag.eq(6).shift()
s = s.where(s).ffill()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
Suppose you had a DataFrame with a number of columns / Series- say five for example. If the fifth column (named 'Updated Col') had values, in addition to nans, what would be the best way to insert values into 'Updated Col' from other columns in place of the nans based on a preferred column order?
e.g. my dataframe looks something like this;
Date 1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 Nan
12/03/2017 0:40 0.1 Nan
12/03/2017 0:50 0.6 0.5 Nan
12/03/2017 1:00 0.4 0.3 Nan
12/03/2017 1:10 0.3 0.2 Nan
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..and say for example I wanted the values from column 3 as a priority, followed by 2, then 1, i would expect the DataFrame to look like this;
1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 0.7
12/03/2017 0:40 0.1 0.1
12/03/2017 0:50 0.6 0.5 0.5
12/03/2017 1:00 0.4 0.3 0.3
12/03/2017 1:10 0.3 0.2 0.2
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..values would be input from the lower priority columns only if the higher priority columns were empty / NaN.
What would be the best way to do this?
I've tried numerous np.where attempts but cant work out what the best way would be?
Many thanks in advance.
You can use fillna with forward filling (ffill) and then select column:
updated_col = 'Updated Col'
#define columns for check, maybe [1,2,3,4] if integer colum names
cols = ['1','2','3','4'] + [updated_col]
print (df[cols].ffill(axis=1))
1 2 3 4 Updated Col
0 0.4 0.4 0.4 0.4 0.9
1 0.4 0.4 0.4 0.4 0.1
2 0.4 0.4 0.4 0.4 0.6
3 0.9 0.9 0.7 0.7 0.7
4 0.1 0.1 0.1 0.1 0.1
5 0.6 0.6 0.6 0.5 0.5
6 0.4 0.4 0.3 0.3 0.3
7 0.3 0.3 0.3 0.2 0.2
8 0.9 0.9 0.9 0.9 0.8
9 0.9 0.9 0.9 0.9 0.8
10 0.0 0.0 0.0 0.0 0.9
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
EDIT:
Thank you shivsn for comments.
If have Nan (string values) in DataFrame what are not NaNs (missing values) or empty string values is necessary first replace:
updated_col = 'Updated Col'
cols = ['1','2','3','4'] + ['Updated Col']
d = {'Nan':np.nan, '': np.nan}
df = df.replace(d)
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9