I am faced with a small problem, the solution of which is certainly very simple, but I cannot find how to do it.
Let's say I have the following pandas dataframe df:
import pandas as pd
X = [0.78, 0.82, 1.03, 1.06, 1.21]
Y = [0.0, 0.2521, 0.4905, 0.5003, 1.0]
df = pd.DataFrame({'X':X, 'Y':Y})
df
X Y
0 0.78 0.0000
1 0.82 0.2521
2 1.03 0.4905
3 1.06 0.5003
4 1.21 1.0000
I want to recover the value of X for which Y exceeds 0.5; in other words, I am looking for a piece of program which creates a new variable val such as:
print (val)
1.06
I imagine only complicated things, style:
df['Z'] = df.apply(lambda row: 0 if row.Y <= 0.5 else 1, axis = 1)
df
X Y Z
0 0.78 0.0000 0
1 0.82 0.2521 0
2 1.03 0.4905 0
3 1.06 0.5003 1
4 1.21 1.0000 1
But this shows me where is the X value I want (first appearance of 1 in Z), but it doesn't extract that value.
How could I do that in a simple way?
We can check with idxmax, notice it will need have one value less than 0.5
df.loc[df.Y.gt(0.5).idxmax(),'Z']=1
df.Z.fillna(0,inplace=True)
df
X Y Z
0 0.78 0.0000 0.0
1 0.82 0.2521 0.0
2 1.03 0.4905 0.0
3 1.06 0.5003 1.0
4 1.21 1.0000 0.0
If would like separated dataframe
df1=df.loc[df.Y.gt(0.5)]
Related
I have a data frame with over three million rows. I am trying to Group values in Bar_Code column and extract only those rows where sum of all rows in SOH, Cost and Sold_Date is zero.
My dataframe is as under:
Location Bar_Code SOH Cost Sold_Date
1 00000003589823 0 0.00 NULL
2 00000003589823 0 0.00 NULL
3 00000003589823 0 0.00 NULL
1 0000000151818 -102 0.00 NULL
2 0000000151818 0 8.00 NULL
3 0000000151818 0 0.00 2020-10-06T16:35:25.000
1 0000131604108 0 0.00 NULL
2 0000131604108 0 0.00 NULL
3 0000131604108 0 0.00 NULL
1 0000141073505 -53 3.00 2020-10-06T16:35:25.000
2 0000141073505 0 0.00 NULL
3 0000141073505 -20 20.00 2020-09-25T10:11:30.000
I have tried the below code:
df.groupby(['Bar_Code','SOH','Cost','Sold_Date']).sum()
but I am getting the below output:
Bar_Code SOH Cost Sold_Date
0000000151818 -102.0 0.0000 2021-12-13T10:01:59.000
0.0 8.0000 2020-10-06T16:35:25.000
0000131604108 0.0 0.0000 NULL
0000141073505 -53.0 0.0000 2021-11-28T16:57:59.000
3.0000 2021-12-05T11:23:02.000
0.0 0.0000 2020-04-14T08:02:45.000
0000161604109 -8.0 4.1000 2020-09-25T10:11:30.000
00000003589823 0 0.00 NULL
I need to check if it is possible to get the below desired output to get only the specific rows where sum of SOH, Cost & Sold_Date is 0 or NULL, its safe that the code ignores first Column (Locations):
Bar_Code SOH Cost Sold_Date
00000003589823 0 0.00 NULL
0000131604108 0.0 0.0000 NULL
Idea is filter all groups if SOH, Cost and Sold_Date is 0 or NaN by filter rows if not match first, get Bar_Code and last invert mask for filter all groups in isin:
g = df.loc[df[['SOH','Cost','Sold_Date']].fillna(0).ne(0).any(axis=1), 'Bar_Code']
df1 = df[~df['Bar_Code'].isin(g)].drop_duplicates('Bar_Code').drop('Location', axis=1)
print (df1)
Bar_Code SOH Cost Sold_Date
0 00000003589823 0 0.0 NaN
6 0000131604108 0 0.0 NaN
Here is the data
import numpy as np
import pandas as pd
data = {
'cases': [120, 100, np.nan, np.nan, np.nan, np.nan, np.nan],
'percent_change': [0.03, 0.01, 0.00, -0.001, 0.05, -0.1, 0.003],
'tag': [7, 6, 5, 4, 3, 2, 1],
}
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 NaN 0.000 5
3 NaN -0.001 4
4 NaN 0.050 3
5 NaN -0.100 2
6 NaN 0.003 1
I want to create next cases' value as (next value) = (previous value) * (1+current per_change). Specifically, I want it to be done in rows that has a tag value less than 6 (and I must use a mask (i.e., df.loc for this row selection). This should give me:
cases percent_change tag
0 120.0 0.030 7
1 100.0 0.010 6
2 100.0 0.000 5
3 99.9 -0.001 4
4 104.9 0.050 3
5 94.4 -0.100 2
6 94.7 0.003 1
I tried this but it doesn't work:
df_index = np.where(df['tag'] == 6)
index = df_index[0][0]
df.loc[(df.tag<6), 'cases'] = (df.percent_change.shift(0).fillna(1) + 1).cumprod() * df.at[index, 'cases']
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 104.030000 0.000 5
3 103.925970 -0.001 4
4 109.122268 0.050 3
5 98.210042 -0.100 2
6 98.504672 0.003 1
I would do:
s = df.cases.isna()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
Output:
cases percent_change tag
0 120.000000 0.030 7
1 100.000000 0.010 6
2 100.000000 0.000 5
3 99.900000 -0.001 4
4 104.895000 0.050 3
5 94.405500 -0.100 2
6 94.688716 0.003 1
Update: If you really insist on masking on the Tag==6:
s = df.tag.eq(6).shift()
s = s.where(s).ffill()
percents = df.percent_change.where(s,0)+1
df['cases'] = df.cases.ffill()*percents.cumprod()
I do got some data within a pandas DataFrame looking like this.
df =
A B
time
0.1 10.0 1
0.15 12.1 2
0.19 4.0 2
0.21 5.0 2
0.22 6.0 2
0.25 7.0 1
0.3 8.1 1
0.4 9.45 2
0.5 3.0 1
Based on the following condition I look for a generic solution to find the first and last index of every subset.
cond = df.B == 2
So far I tried using the groupby concept but without the expected result.
df_1 = cond.reset_index()
df_2 = df_1.groupby(df_1['B']).agg(['first','last']).reset_index()
This is the output I got.
B time
first last
0 False 0.1 0.5
1 True 0.15 0.4
This is the output I like to get.
B time
first last
0 False 0.1 0.1
1 True 0.15 0.22
2 False 0.25 0.3
3 True 0.4 0.4
3 False 0.5 0.5
How can I accomplish this by a more or less generic approach?
Create helper Series by Series.shift with Series.ne and cumulative sum by Series.cumsum for groups by consecutive values, then for aggregation is used dictionary:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg({'B':'first','time': ['first','last']}).reset_index(drop=True)
print (df_2)
B time
first first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
If want avoid MultiIndex use named aggregations:
df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg(B=('B','first'),
first=('time','first'),
last=('time','last')).reset_index(drop=True)
print (df_2)
B first last
0 False 0.10 0.10
1 True 0.15 0.22
2 False 0.25 0.30
3 True 0.40 0.40
4 False 0.50 0.50
I have pandas data-frame (df) and a list (df_values).
I want to make another data-frame which contains the distribution/percentage a data-point in df belongs to values in the list df_values.
data-frame df is:
A
0 100
1 300
2 150
List df_values (set of referential values) is:
df_values = [[0,200,400,600]]
Desired data-frame:
Here number 100 in df is 0.50 towards 0 and 0.50 towards 200 in df_values. Similarly, 300 in df is 0.50 towards 200 and 0.50 towards 400 in df_values and so on.
0 200 400 600
0 0.50 0.50 0.0 0
1 0.00 0.50 0.5 0
2 0.25 0.75 0.0 0
I'm working with dynamic .csvs. So I never know what will be the column names. Example of those:
1)
ETC META A B C D E %
0 2.0 A 0.0 24.564 0.000 0.0 0.0 -0.00%
1 4.2 B 0.0 2.150 0.000 0.0 0.0 3.55%
2 5.0 C 0.0 0.000 15.226 0.0 0.0 6.14%
2)
META A C D E %
0 A 0.00 0.00 2.90 0.0 -0.00%
1 B 3.00 0.00 0.00 0.0 3.55%
2 C 0.00 21.56 0.00 0.0 6.14%
3)
FILL ETC META G F %
0 T 2.0 A 0.00 6.70 -0.00%
1 F 4.2 B 2.90 0.00 3.55%
2 T 5.0 C 0.00 34.53 6.14%
As I would like to create a new column with the SUM of all columns between META and %, I need to get all the names of each column, so I can create something like that:
a = df['Total'] = df['A'] + df['B'] + df['C'] + df['D'] + df['E']
As the columns name changes, the code below will work just for the example 1). So I need: 1) identify all the columns; 2) and then, sum them.
The solution has to work for the 3 examples above (1, 2 and 3).
Note that the only certainty is the columns are between META and %, but even they are not fixed.
Select all columns without first and last by DataFrame.iloc and then sum:
df['Total'] = df.iloc[:, 1:-1].sum(axis=1)
Or remove META and % columns by DataFrame.drop before sum:
df['Total'] = df.drop(['META','%'], axis=1).sum(axis=1)
print (df)
META A B C D E % Total
0 A 0.0 24.564 0.000 0.0 0.0 -0.00% 24.564
1 B 0.0 2.150 0.000 0.0 0.0 3.55% 2.150
2 C 0.0 0.000 15.226 0.0 0.0 6.14% 15.226
EDIT: You can select columns between META and %:
#META, % are not numeric
df['Total'] = df.loc[:, 'META':'%'].sum(axis=1)
#META is not numeric
df['Total'] = df.iloc[:, df.columns.get_loc('META'):df.columns.get_loc('%')].sum(axis=1)
#more general, META is before % column
df['Total'] = df.iloc[:, df.columns.get_loc('META')+1:df.columns.get_loc('%')].sum(axis=1)