I have the following data frame
item_id group price
0 1 A 10
1 3 A 30
2 4 A 40
3 6 A 60
4 2 B 20
5 5 B 50
I am looking to add a quantile column based on the price for each group like below:
item_id group price quantile
01 A 10 0.25
03 A 30 0.5
04 A 40 0.75
06 A 60 1.0
02 B 20 0.5
05 B 50 1.0
I could loop over entire data frame and perform computation for each group. However, I am wondering is there a more elegant way to resolve this? Thanks!
You need df.rank() with pct=True:
pct : bool, default False
Whether or not to display the returned rankings in percentile form.
df['quantile']=df.groupby('group')['price'].rank(pct=True)
print(df)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Although the df.rank method above is probably the way to go for this problem. Here's another solution using pd.qcut with GroupBy:
df['quantile'] = (
df.groupby('group')['price']
.apply(lambda x: pd.qcut(x, q=len(x), labels=False)
.add(1)
.div(len(x))
)
)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Related
input dataframe
row_df
item uni_row count row
Apple 46 1 [45,46]
tab_df
row table_no table_bit item space
45 1 0 Lemon 0.25
45 1 1 Melon 0.2
45 1 2 Banana 0.1
45 1 3 Banana 0.1
45 2 1 Apple 0.25
45 2 2 Apple 0.25
46 1 1 Apple 0.25
46 1 2 Orange 0.3
Here, as per row_df data, 46th row Apple present in tab_df need to be moved to row 45(left out row value from row_df.row on comparing row_df.uni_row) replacing the tab_df.item which matches the respective row_df.count value. Selection of replacing item from tab_df need to figured out from the reverse order from the point where the row 45th Apple starts.
Expected output:
row table_no table_bit item space
45 1 0 Lemon 0.25
45 1 1 Apple 0.25 #rearranged
45 1 2 Banana 0.1
45 1 3 Banana 0.1
45 2 1 Apple 0.25
45 2 2 Apple 0.25
46 1 1 Melon 0.2 #rearranged
46 1 2 Orange 0.3
Thanks in Advance!
I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.
dat.columns = range(dat.shape[1])
There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2
I have a dataset as shown:
student_id course_marks
1234 10
9887 30
9881 20
5634 40
5634 50
1234 60
1234 70
I want to sort them using course_marks, then rank them within their student_id
expected:
student_id course_marks rank
1234 10 1
1234 60 2
1234 70 3
5634 40 1
5634 50 2
9887 20 1
9887 30 2
df['rank'] = df.groupby('student_id')['course_marks'].rank()
student_id course_marks rank
0 1234 10 1.0
1 9887 30 1.0
2 9881 20 1.0
3 5634 40 1.0
4 5634 50 2.0
5 1234 60 2.0
6 1234 70 3.0
or, sorted:
student_id course_marks rank
0 1234 10 1.0
5 1234 60 2.0
6 1234 70 3.0
3 5634 40 1.0
4 5634 50 2.0
2 9881 20 1.0
1 9887 30 1.0
(Note that you have 9881 and 9887 in your example data, and 9887 twice in your expected output.)
Below is my dataframe, I believe I need to use groupby or pivot but haven't gotten anything to work correctly.
LOGIN MANAGER 7 8 9 10 11 UNITS HOURS UPH
0 joeblow MSmith 1 21 1 47.01
1 joeblow MSmith 0.25 18 0.25 75.83
2 joeblow MSmith 1 12 1 87.05
3 joeblow MSmith 0.26 13 0.26 206.9
4 joeblow MSmith 0.43 23 0.43 53.18
My expected output would look like below, where the UNITS and HOURS are summed and UPH is averaged and the other columns groupby:
LOGIN MANAGER 7 8 9 10 11 UNITS HOURS UPH
0 joeblow MSmith 1 0.25 1 0.26 0.43 66 2.94 93.994
First Create your columns dict with functions
d={'7':'first','8':'first','9':'first','10':'first','11':'first','UNITS':'sum','HOURS':'sum','UPH':'mean'}
Then do with agg
yourdf=df.groupby(['LOGIN','MANAGER']).agg(d)
I have an Excel file with normalized structure:
[table1]
id percentage region year
1 0.10 A 01-01-1995
2 0.61 A 01-02-1995
3 0.97 A 01-03-1995
4 0.11 B 01-01-1995
5 0.21 B 01-02-1995
6 0.99 B 01-03-1995
7 0.02 A 01-01-1996
8 0.61 A 01-02-1996
9 0.96 A 01-03-1996
10 0.05 B 01-01-1996
11 0.55 B 01-02-1996
12 0.99 B 01-03-1996
and a second table, with
[table2]
other_id region value year
1 A 99 1995
2 A 76 1996
3 B 102 1995
4 B 50 1996
and now I need to multiply the values from the first table to the second table, to get the total by month by year evolution. I tried to do a calculated field like
[table1] * [table2]
but I get the not aggregated error. The calc
avg[table1] * avg[table2]
or
attr[table1] * [table2]
is valid in Tableau, but the result is not correct. How can I accomplish this directly on Tableau?
EDIT: The expected result would be
[table1]
id percentage region year
1 0.10 * 99 A 01-01-1995
2 0.61 * 99 A 01-02-1995
3 0.97 * 99 A 01-03-1995
4 0.11 * 102 B 01-01-1995
5 0.21 * 102 B 01-02-1995
6 0.99 * 102 B 01-03-1995
7 0.02 * 76 A 01-01-1996
8 0.61 * 76 A 01-02-1996
9 0.96 * 76 A 01-03-1996
10 0.05 * 50 B 01-01-1996
11 0.55 * 50 B 01-02-1996
You can try creating a formula. It can go something like this-
For each row in table1,
Find a corresponding row in table2 (with same year and region).
Multiply the percentage in table1 with value in table2