Given a dataframe as follows:
id area total_price unit
0 1 185.00 14840 m
1 2 215.00 15050 m
2 3 233.23 46799 d
3 4 122.00 17000 d
4 5 540.00 70000 d
5 6 415.00 78000 d
6 7 170.00 12270 m
7 8 410.00 30750 m
8 9 196.00 13787 m
9 10 55.00 3100 m
I would like to create a new column unit_price with numerical value based on the following conditions:
a. if unit is m, then unit_price is calculated by total_price/area/30;
b. if unit is d, then unit_price is calculated by total_price/area
This code works:
m = (df['unit'] == 'm')
df['unit_price'] = np.where(m, df['total_price']/df['area']/30, df['total_price']/df['area'])
I have used also the code below but it raises an error: ValueError: Wrong number of items passed 256, placement implies 1
def unit_price(x):
if x['unit'] == 'm':
return x['total_price']/x['area']/30
if x['unit'] == 'd':
return x['total_price']/x['area']
df['unit_price'] = df.apply(unit_price, axis = 1)
Anyone know why I get this error, and how to fix it? Thanks.
With np.select you can do:
c1, c2 = df['unit']=='m', df['unit']=='d'
df['unit_price'] = np.select((c1,c2),
(df['total_price']/df['area']/30, df['total_price']/df['area']),
np.nan)
However, with this case, it's better to do a map:
units = {'m':30, 'd':1, 'y':365}
df['unit_price'] = df['total_price']/df['area'] / df['unit'].map(units)
Output:
id area total_price unit unit_price
0 1 185.00 14840 m 2.673874
1 2 215.00 15050 m 2.333333
2 3 233.23 46799 d 200.656005
3 4 122.00 17000 d 139.344262
4 5 540.00 70000 d 129.629630
5 6 415.00 78000 d 187.951807
6 7 170.00 12270 m 2.405882
7 8 410.00 30750 m 2.500000
8 9 196.00 13787 m 2.344728
9 10 55.00 3100 m 1.878788
Related
I have a dataframe like below in which I need to replace the 0s with the mean of the rows where the parent_key matches the self_key.
Input DataFrame: df= pd.DataFrame ({'self_key':['a','b','c','d','e','e','e','f','f','f'],'parent_key':[np.nan,'a','b','b','c','c','c','d','d','d'], 'value':[0,0,0,0,4,6,14,12,8,22],'level':[1,2,3,3,4,4,4,4,4,4]})
The row 3 has self_key of 'd' so I would need to replace its 0 value in column 'value' with the mean of rows 7,8,9 to fill with the correct value of 14. Since the lower levels feed into the higher levels I would need to do it from lowest level to highest to fill out the dataframe as well but when I do the below code it doesn't work and I get the error "ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional". How can I fill in the 0s with the means from lowest level to highest?
df['value']=np.where((df['value']==0) & (df['level']==3), df['value'].groupby(df.where(df['parent_key']==df['self_key'])).transform('mean'), df['value'])
Input
self_key parent_key value level
0 a NaN 0 1
1 b a 0 2
2 c b 0 3
3 d b 0 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
My approach is to repeat the above code 3 times and change the level from 3 to 2 to 1, but its not working for even level 3.
Expected Ouput:
self_key parent_key value level
0 a NaN 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
If I understand your problem correctly, you are trying to compute mean in a bottom-up fashion by filtering dataframe on certain keys. If so, then following should solve it:
for l in range(df["level"].max()-1, 0, -1):
df_sub = df[(df["level"] == l) & (df["value"] == 0)]
self_keys = df_sub["self_key"].tolist()
for k in self_keys:
df.loc[df_sub[df_sub["self_key"] == k].index, "value"] = df[df["parent_key"] == k]["value"].mean()
[Out]:
self_key parent_key value level
0 a 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
Suppose the below simplified dataframe. (The actual df is much, much bigger.) How does one assign values to a new column f such that f is a function of another column (e.,g. e)? I'm pretty sure one needs to use apply or map but never done this with a dataframe that has multiindex columns?
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
df.columns = pd.MultiIndex.from_tuples((("a", "d"), ("a", "e"), ("b", "d"), ("b","e")))
df
a b
d e d e
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Desired output:
a b
d e f d e f
0 1 2 1 3 4 1
1 5 6 1 7 8 -1
2 9 10 -1 11 12 -1
3 13 14 -1 15 16 -1
Would like to be able to apply the following lines and assign them to a new column f. Two problems: First, the last line that contains the apply doesn't work but hopefully my intent is clear. Second, I'm unsure how to assign values to a new column of a dataframe with a multi index column structure. Would like to be able use functional programming methods.
lt = df.loc(axis=1)[:,'e'] < 8
gt = df.loc(axis=1)[:,'e'] >= 8
conditions = [lt, gt]
choices = [1, -1]
df.loc(axis=1)[:,'f'] = df.loc(axis=1)[:,'e'].apply(np.select(conditions, choices))
nms = [(i, 'f')for i, j in df.columns if j == 'e']
df[nms] = (df.iloc[:, [j == 'e' for i, j in df.columns]] < 8) * 2 - 1
df = df.sort_index(axis=1)
df
a b
d e f d e f
0 1 2 1 3 4 1
1 5 6 1 7 8 -1
2 9 10 -1 11 12 -1
3 13 14 -1 15 16 -1
EDIT:
for a custom ordering:
d = {i:j for j, i in enumerate(df.columns.levels[0])}
df1 = df.loc[:, sorted(df.columns, key = lambda x: d[x[0]])]
IF the whole data is in a way symmetric, you could do:
df.stack(0).assign(f = lambda x: 2*(x.e < 8) - 1).stack().unstack([1,2])
Out[]:
a b
d e f d e f
0 1 2 1 3 4 1
1 5 6 1 7 8 -1
2 9 10 -1 11 12 -1
3 13 14 -1 15 16 -1
Here is the raw data:
Date Name Score
25/2/2021 A 10
25/2/2021 B 8
25/2/2021 C 8
25/2/2021 D 4
25/2/2021 E 1
24/2/2021 A 0
24/2/2021 B 20
24/2/2021 C 7
24/2/2021 D 10
24/2/2021 E 4
I would love to assign consecutive rank (preferably ascending order) to the students by each date, as follows:
Date Name Score Rank
25/2/2021 A 10 1
25/2/2021 B 8 2
25/2/2021 C 8 2
25/2/2021 D 4 3
25/2/2021 E 1 4
24/2/2021 A 0 5
24/2/2021 B 20 1
24/2/2021 C 7 3
24/2/2021 D 10 2
24/2/2021 E 6 4
I've tried customised rank function but it's hard to output this result, how could I do that? Thanks in advance!
You can try below formula with Excel365. It will also work on unsorted data.
=XMATCH(C2,SORT(FILTER($C$2:$C$11,$A$2:$A$11=A2),1,-1))
In D2 use:
=COUNTIFS(A$2:A$11,A2,C$2:C$11,">"&C2)+1
EDIT: Based on your comment, try:
Formula in D2:
=SUM(--(UNIQUE(FILTER(C$2:C$11,A$2:A$11=A2))>C2))+1
I have this data
ID Value1 Value2 Type Type2
1 3 1 A X
2 2 2 A X
3 5 3 B Y
4 2 4 B Z
5 6 8 C Z
6 7 9 C Z
7 8 0 C L
8 3 2 D M
9 4 3 D M
10 6 5 D M
11 8 7 D M
Right now i am able to generate this output using this code
pandabook.groupby(['Type','Type2'],as_index=False)['Value1', 'Value2'].agg({'Value1': 'sum','Value2': 'sum'})
ID Value 1 Value2 Type Type2
1 5 3 A X
2 5 3 B Y
3 2 5 B Z
4 13 17 C Z
5 8 0 C L
6 21 17 D M
I want to show the Aggregated count as well, as show in this example
How can i achieve this output ?
Add new value to dictionary with size function, remove as_index=False for prevent:
ValueError: cannot insert Type, already exists
and last rename with reset_index:
df = pandabook.groupby(['Type','Type2']).agg({'Value1': 'sum','Value2': 'sum', 'Type':'size'})
df = df.rename(columns={'Type':'Count'}).reset_index()
print (df)
Type Type2 Value1 Value2 Count
0 A X 5 3 2
1 B Y 5 3 1
2 B Z 2 4 1
3 C L 8 0 1
4 C Z 13 17 2
5 D M 21 17 4
I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])