Extract specific value in pandas dataframe based on column condition - python-3.x

I am faced with a small problem, the solution of which is certainly very simple, but I cannot find how to do it.
Let's say I have the following pandas dataframe df:
import pandas as pd
X = [0.78, 0.82, 1.03, 1.06, 1.21]
Y = [0.0, 0.2521, 0.4905, 0.5003, 1.0]
df = pd.DataFrame({'X':X, 'Y':Y})
df
X Y
0 0.78 0.0000
1 0.82 0.2521
2 1.03 0.4905
3 1.06 0.5003
4 1.21 1.0000
I want to recover the value of X for which Y exceeds 0.5; in other words, I am looking for a piece of program which creates a new variable val such as:
print (val)
1.06
I imagine only complicated things, style:
df['Z'] = df.apply(lambda row: 0 if row.Y <= 0.5 else 1, axis = 1)
df
X Y Z
0 0.78 0.0000 0
1 0.82 0.2521 0
2 1.03 0.4905 0
3 1.06 0.5003 1
4 1.21 1.0000 1
But this shows me where is the X value I want (first appearance of 1 in Z), but it doesn't extract that value.
How could I do that in a simple way?

We can check with idxmax, notice it will need have one value less than 0.5
df.loc[df.Y.gt(0.5).idxmax(),'Z']=1
df.Z.fillna(0,inplace=True)
df
X Y Z
0 0.78 0.0000 0.0
1 0.82 0.2521 0.0
2 1.03 0.4905 0.0
3 1.06 0.5003 1.0
4 1.21 1.0000 0.0
If would like separated dataframe
df1=df.loc[df.Y.gt(0.5)]

Related

Pandas - group by function and sum columns to extract rows where sum of other columns is 0

I have a data frame with over three million rows. I am trying to Group values in Bar_Code column and extract only those rows where sum of all rows in SOH, Cost and Sold_Date is zero.
My dataframe is as under:
Location Bar_Code SOH Cost Sold_Date
1 00000003589823 0 0.00 NULL
2 00000003589823 0 0.00 NULL
3 00000003589823 0 0.00 NULL
1 0000000151818 -102 0.00 NULL
2 0000000151818 0 8.00 NULL
3 0000000151818 0 0.00 2020-10-06T16:35:25.000
1 0000131604108 0 0.00 NULL
2 0000131604108 0 0.00 NULL
3 0000131604108 0 0.00 NULL
1 0000141073505 -53 3.00 2020-10-06T16:35:25.000
2 0000141073505 0 0.00 NULL
3 0000141073505 -20 20.00 2020-09-25T10:11:30.000
I have tried the below code:
df.groupby(['Bar_Code','SOH','Cost','Sold_Date']).sum()
but I am getting the below output:
Bar_Code SOH Cost Sold_Date
0000000151818 -102.0 0.0000 2021-12-13T10:01:59.000
0.0 8.0000 2020-10-06T16:35:25.000
0000131604108 0.0 0.0000 NULL
0000141073505 -53.0 0.0000 2021-11-28T16:57:59.000
3.0000 2021-12-05T11:23:02.000
0.0 0.0000 2020-04-14T08:02:45.000
0000161604109 -8.0 4.1000 2020-09-25T10:11:30.000
00000003589823 0 0.00 NULL
I need to check if it is possible to get the below desired output to get only the specific rows where sum of SOH, Cost & Sold_Date is 0 or NULL, its safe that the code ignores first Column (Locations):
Bar_Code SOH Cost Sold_Date
00000003589823 0 0.00 NULL
0000131604108 0.0 0.0000 NULL
Idea is filter all groups if SOH, Cost and Sold_Date is 0 or NaN by filter rows if not match first, get Bar_Code and last invert mask for filter all groups in isin:
g = df.loc[df[['SOH','Cost','Sold_Date']].fillna(0).ne(0).any(axis=1), 'Bar_Code']
df1 = df[~df['Bar_Code'].isin(g)].drop_duplicates('Bar_Code').drop('Location', axis=1)
print (df1)
Bar_Code SOH Cost Sold_Date
0 00000003589823 0 0.0 NaN
6 0000131604108 0 0.0 NaN

How to apply masking while creating next row value which is based on previous row's value and another column in Python Pandas?

Here is the data
import numpy as np
import pandas as pd
data = {
'cases': [120, 100, np.nan, np.nan, np.nan, np.nan, np.nan],
'percent_change': [0.03, 0.01, 0.00, -0.001, 0.05, -0.1, 0.003],
'tag': [7, 6, 5, 4, 3, 2, 1],
}
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 NaN 0.000 5
3 NaN -0.001 4
4 NaN 0.050 3
5 NaN -0.100 2
6 NaN 0.003 1
I want to create next cases' value as (next value) = (previous value) * (1+current per_change). Specifically, I want it to be done in rows that has a tag value less than 6 (and I must use a mask (i.e., df.loc for this row selection). This should give me:
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 100.0 0.000 5
3 99.9 -0.001 4
4 104.9 0.050 3
5 94.4 -0.100 2
6 94.7 0.003 1
I tried this but it doesn't work:
df_index = np.where(df['tag'] == 6)
index = df_index[0][0]
df.loc[(df.tag<6), 'cases'] = (df.percent_change.shift(0).fillna(1) + 1).cumprod() * df.at[index, 'cases']
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 104.030000 0.000 5
3 103.925970 -0.001 4
4 109.122268 0.050 3
5 98.210042 -0.100 2
6 98.504672 0.003 1
I would do:
s = df.cases.isna()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
Output:
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 100.000000 0.000 5
3 99.900000 -0.001 4
4 104.895000 0.050 3
5 94.405500 -0.100 2
6 94.688716 0.003 1
Update: If you really insist on masking on the Tag==6:
s = df.tag.eq(6).shift()
s = s.where(s).ffill()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()

Get first and last Index of a Pandas DataFrame subset

I do got some data within a pandas DataFrame looking like this.
df =
A B
time
0.1 10.0 1
0.15 12.1 2
0.19 4.0 2
0.21 5.0 2
0.22 6.0 2
0.25 7.0 1
0.3 8.1 1
0.4 9.45 2
0.5 3.0 1
Based on the following condition I look for a generic solution to find the first and last index of every subset.
cond = df.B == 2
So far I tried using the groupby concept but without the expected result.
df_1 = cond.reset_index()
df_2 = df_1.groupby(df_1['B']).agg(['first','last']).reset_index()
This is the output I got.
B time
first last
0 False 0.1 0.5
1 True 0.15 0.4
This is the output I like to get.
B time
first last
0 False 0.1 0.1
1 True 0.15 0.22
2 False 0.25 0.3
3 True 0.4 0.4
3 False 0.5 0.5
How can I accomplish this by a more or less generic approach?
Create helper Series by Series.shift with Series.ne and cumulative sum by Series.cumsum for groups by consecutive values, then for aggregation is used dictionary:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg({'B':'first','time': ['first','last']}).reset_index(drop=True)
print (df_2)
B time
first first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
If want avoid MultiIndex use named aggregations:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg(B=('B','first'),
first=('time','first'),
last=('time','last')).reset_index(drop=True)
print (df_2)
B first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50

Make a pandas data-frame with percentage between two referential values in a set

I have pandas data-frame (df) and a list (df_values).
I want to make another data-frame which contains the distribution/percentage a data-point in df belongs to values in the list df_values.
data-frame df is:
A
0 100
1 300
2 150
List df_values (set of referential values) is:
df_values = [[0,200,400,600]]
Desired data-frame:
Here number 100 in df is 0.50 towards 0 and 0.50 towards 200 in df_values. Similarly, 300 in df is 0.50 towards 200 and 0.50 towards 400 in df_values and so on.
0 200 400 600
0 0.50 0.50 0.0 0
1 0.00 0.50 0.5 0
2 0.25 0.75 0.0 0

Pandas: How to sum (dynamic) columns that are between two specific columns?

I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)

Resources