How to keep only the rows that cumulate 80% of a column whitin of each group of a pandas dataframe? - python-3.x

I have a dataframe like this:
df_dict = dict(
group = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3],
model = [model 2,model 4,model 7,model 6,model 5,model 2,model 5,model 7,model 3,model 4,model 3,model 6,model 5,model 1,model 2,model 7,model 4],
value = [10,4.6,2,1.3,1,10,3,3,2,0.9,4,2.7,2,1,1,1,0.9],
)
df = pd.DataFrame(df_dict)
For each group, I want to keep the models that cover 80% of the total of the variable "value".
In this example, what the output should be:
Group | Model | value |
-------------------------
1 | model 2 | 10 |
1 | model 4 | 4.6 |
1 | model 7 | 2 |
2 | model 2 | 10 |
2 | model 5 | 3 |
2 | model 7 | 3 |
3 | model 3 | 4 |
3 | model 6 | 2.7 |
3 | model 5 | 2 |
3 | model 1 | 1 |
3 | model 2 | 1 |
--------------------------

Let us try multiple groupby
df = df.sort_values(['group','value'],ascending=[True,False])
g = df.groupby('group')['value']
df = df[df.index<=((g.cumsum()/g.transform('sum'))>0.8).groupby(df['group']).transform('idxmax')]
df
Out[120]:
group model value
0 1 model 2 10.0
1 1 model 4 4.6
2 1 model 7 2.0
5 2 model 2 10.0
6 2 model 5 3.0
7 2 model 7 3.0
10 3 model 3 4.0
11 3 model 6 2.7
12 3 model 5 2.0
13 3 model 1 1.0
14 3 model 2 1.0

Related

Create new column and calculate values to the column in python row wise

I need to create a new column as Billing and Non-Billing based on the Billable column. If the Billable is 'Yes' then i should create a new column as Billing and if its 'No' then need to create a new column as 'Non-Billable' and need to calculate it. Calculation should be in row axis.
Calculation for Billing in row:
Billing = df[Billing] * sum/168 * 100
Calculation for Non-Billing in row:
Non-Billing = df[Non-Billing] * sum/ 168 * 100
Data
Employee Name | Java | Python| .Net | React | Billable|
----------------------------------------------------------------
|Priya | 10 | | 5 | | Yes |
|Krithi | | 10 | 20 | | No |
|Surthi | | 5 | | | yes |
|Meena | | 20 | | 10 | No |
|Manju | 20 | 10 | 10 | | Yes |
Output
I have tried using insert statement but i cannot keep on inserting it. I tried append also but its not working.
Bill_amt = []
Non_Bill_amt = []
for i in df['Billable']:
if i == "Yes" or i == None:
Bill_amt = (df[Bill_amt].sum(axis=1)/168 * 100).round(2)
df.insert (len( df.columns ), column='Billable Amount', value=Bill_amt )#inserting the column and it name
#CANNOT INSERT ROW AFTER IT AND CANNOT APPEND IT TOO
else:
Non_Bill_amt = (DF[Non_Bill_amt].sum ( axis=1 ) / 168 * 100).round ( 2 )
df.insert ( len ( df.columns ), column='Non Billable Amount', value=Non_Bill_amt ) #inserting the column and its name
#CANNOT INSERT ROW AFTER IT.
Use .sum(axis=1) and then np.where() to put the values in respective columns. For example:
x = df.loc[:, "Java":"React"].sum(axis=1) / 168 * 100
df["Bill"] = np.where(df["Billable"].str.lower() == "yes", x, "")
df["Non_Bill"] = np.where(df["Billable"].str.lower() == "no", x, "")
print(df)
Prints:
Employee_Name Java Python .Net React Billable Bill Non_Bill
0 Priya 10.0 NaN 5.0 NaN Yes 8.928571428571429
1 Krithi NaN 10.0 20.0 NaN No 17.857142857142858
2 Surthi NaN 5.0 NaN NaN yes 2.976190476190476
3 Meena NaN 20.0 NaN 10.0 No 17.857142857142858
4 Manju 20.0 10.0 10.0 NaN Yes 23.809523809523807

I have pandas dataframe with 3 columns and want output like this

DataFrame of 3 Column
a b c
1 2 4
1 2 4
1 2 4
Want Output like this
a b c a+b a+c b+c a+b+c
1 2 4 3 5 6 7
1 2 4 3 5 6 7
1 2 4 3 5 6 7
Create all combinations with length 2 or more by columns and then assign sum:
from itertools import chain, combinations
#https://stackoverflow.com/a/5898031
comb = chain(*map(lambda x: combinations(df.columns, x), range(2, len(df.columns)+1)))
for c in comb:
df[f'{"+".join(c)}'] = df.loc[:, c].sum(axis=1)
print (df)
a b c a+b a+c b+c a+b+c
0 1 2 4 3 5 6 7
1 1 2 4 3 5 6 7
2 1 2 4 3 5 6 7
You should always post your approach while asking a question. However, here it goes. This the easiest but probably not the most elegant way to solve it. For a more elegant approach, you should follow jezrael's answer.
Make your pandas dataframe here:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 1], "b": [2, 2, 2], "c": [4, 4, 4]})
Now make your desired dataframe like this:
df["a+b"] = df["a"] + df["b"]
df["a+c"] = df["a"] + df["c"]
df["b+c"] = df["b"] + df["c"]
df["a" + "b" + "c"] = df["a"] + df["b"] + df["c"]
This gives you:
| | a | b | c | a+b | a+c | b+c | abc |
|---:|----:|----:|----:|------:|------:|------:|------:|
| 0 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 1 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |
| 2 | 1 | 2 | 4 | 3 | 5 | 6 | 7 |

How can I get the count of sequential events pairs from a Pandas dataframe?

I have a dataframe that looks like this:
ID EVENT DATE
1 1 142
1 5 167
1 3 245
2 1 54
2 5 87
3 3 165
3 2 178
And I would like to generate something like this:
EVENT_1 EVENT_2 COUNT
1 5 2
5 3 1
3 2 1
The idea is how many items (ID) go from one event to the next one. Don't care about previous states, I just want to consider the next state from the current state (e.g.: for ID 1, I don't want to count a transition from 1 to 3 because first, it goes to event 5 and then to 3).
The date format is the number of days from a specific date (sort of like SAS format).
Is there a clean way to achieve this?
Let's try this:
(df.groupby([df['EVENT'].rename('EVENT_1'),
df.groupby('ID')['EVENT'].shift(-1).rename('EVENT_2')])['ID']
.count()).rename('COUNT').reset_index().astype(int)
Output:
| | EVENT_1 | EVENT_2 | COUNT |
|---:|----------:|----------:|--------:|
| 0 | 1 | 5 | 2 |
| 1 | 3 | 2 | 1 |
| 2 | 5 | 3 | 1 |
Details: Groupby on 'EVENT' and shifted 'EVENT' within each ID, then count.
You could use groupby and shift. We'll also use rename_axis and reset_index to tidy up the final output:
(pd.concat([f.groupby([f['EVENT'], f['EVENT'].shift(-1).astype('Int64')]).size()
for _, f in df.groupby('ID')])
.groupby(level=[0, 1]).sum()
.rename_axis(['EVENT_1', 'EVENT_2']).reset_index(name='COUNT'))
[out]
EVENT_1 EVENT_2 COUNT
0 1 5 2
1 3 2 1
2 5 3 1

Creating A new column based on other columns' values with specific requirement in Python Dataframe

I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0

Create "leakage-free" Variables in Python?

I have a pandas data frame with several thousand observations and I would like to create "leakage-free" variables in Python. So I am looking for a way to calculate e.g. a group-specific mean of a variable without the single observation in row i.
For example:
| Group | Price | leakage-free Group Mean |
-------------------------------------------
| 1 | 20 | 25 |
| 1 | 40 | 15 |
| 1 | 10 | 30 |
| 2 | ... | ... |
I would like to do that with several variables and I would like to create mean, median and variance in such a way, so a computationally fast method might be good. If a group has only one row I would like to enter 0s in the leakage-free Variable.
As I am rather a beginner in Python, some piece of code might be very helpful. Thank You!!
With one-liner:
df = pd.DataFrame({'Group': [1,1,1,2], 'Price':[20,40,10,30]})
df['lfgm'] = df.groupby('Group').transform(lambda x: (x.sum()-x)/(len(x)-1)).fillna(0)
print(df)
Output:
Group Price lfgm
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
3 2 30 0.0
Update:
For median and variance (not one-liners unfortunately):
df = pd.DataFrame({'Group': [1,1,1,1,2], 'Price':[20,100,10,70,30]})
def f(x):
for i in x.index:
z = x.loc[x.index!=i, 'Price']
x.at[i, 'mean'] = z.mean()
x.at[i, 'median'] = z.median()
x.at[i, 'var'] = z.var()
return x[['mean', 'median', 'var']]
df = df.join(df.groupby('Group').apply(f))
print(df)
Output:
Group Price mean median var
0 1 20 60.000000 70.0 2100.000000
1 1 100 33.333333 20.0 1033.333333
2 1 10 63.333333 70.0 1633.333333
3 1 70 43.333333 20.0 2433.333333
4 2 30 NaN NaN NaN
Use:
grp = df.groupby('Group')
n = grp['Price'].transform('count')
mean = grp['Price'].transform('mean')
df['new_col'] = (mean*n - df['Price'])/(n-1)
print(df)
Group Price new_col
0 1 20 25.0
1 1 40 15.0
2 1 10 30.0
Note: This solution will be faster than using apply, you can test using %%timeit followed by the codes.

Resources