I am trying to efficiently add another element to this code below, that takes into account the value of another column in this df.
Below I have a filter if the value column is >= 0, but I want to add an element that says if the column called day = 'Friday', thanks.
df[df['value']] >= 0
use this
df[(df['value']>=0) & (df['day']=='friday') ]
Chain another condition with & for bitwise AND or | for bitwise OR in boolean indexing, here are necessary ():
df1 = df[(df['value'] >= 0) & (df['day'] == 'friday')]
Or use Series.gt and Series.eq functions for compare:
df1 = df[df['value'].gt(0) & df['day'].eq('friday')]
Or use DataFrame.query:
df1 = df.query("(value >= 0) & (day == 'friday')")
Related
I have a Spark dataframe with date columns. The "date1col" last entry is today and the "date2col" has the last entry of 10 days ago. I need to filter the dates for the last two weeks up to yesterday.
I used df.filter(col('date1col').between(current_date()-1,current_date()-15), and it worked fine. However, when I used the same syntax for the second date column, i.e.: df.filter(col('date2col').between(current_date()-1,current_date()-15) , it returned an empty sdf. When I used df.filter(col('startdate')>current_date()-15), it worked. But my dataframe is dynamic, meaning it updates daily at 9am. How can I force the between function to return the same sdf like I am using the > logic?
Switch the order:
first - earlier date
second - later date
df.filter(col('date2col').between(current_date()-15, current_date()-1))
They are not the same, it can be proved using sameSemantics
df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter(col('date2col').between(current_date()-1, current_date()-15))
df1.sameSemantics(df2)
# False
If you still need - .between translated to "<>" logic:
df.filter(col('date2col').between(current_date()-15, current_date()-1))
=
df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))
df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))
print(df1.sameSemantics(df2)) # `True` when the logical query plans are equal, thus same results
# True
"<>" translated to .between
df.filter(col('date2col') > current_date()-15)
= df.filter(col('date2col').between(current_date()-14, '9999'))
sameSemantics result would apparently be False, but for any practical case, results would be same.
Want to solve this kind of problem in python:
tran_df['bad_debt']=train_df.frame_apply(lambda x: 1 if (x['second_mortgage']!=0 and x['home_equity']!=0) else x['debt'])
I want be able to create a new column and iterate over index row for specific columns.
in excel it's really easy I did:
if(AND(col_name1<>0,col_name2<>0),1,col_name5)
Any help will be very appreciated.
To iterate over rows only for certain columns:
for rowIndex, row in df[['col1','col2']].iterrows(): #iterate over rows
To create a new column:
df['new'] = 0 # Initialise as 0
As a rule, iterating over rows in pandas is wrong. Use the np.where function from NumPy to select the right values for the rows:
tran_df['bad_debt'] = np.where(
(tran_df['second_mortgage'] != 0) & (tran_df['home_equity'] != 0),
1, tran_df['debt'])
First to create a new column with initial value, then to use .loc to locate rows that match certain condition and assign new value:
tran_df['bad_debt']=tran_df['debt']
tran_df.loc[(tran_df['second_mortgage']!=0)&(tran_df['home_equity']!=0),'bad_debt']=1
Or
tran_df['bad_debt']=1
tran_df.loc[(tran_df['second_mortgage']==0)|(tran_df['home_equity']==0),'bad_debt']=tran_df['debt']
Remember to put round brackets for each condition between bitwise operators (& |)
I want to double the value in the distance column in the rows which have the value of 'one-way' in the hike_type column. I am iterating through the df and finding all of the proper rows but I am having trouble getting the multiplication to stick.
This is finding the proper rows but will not put the change into effect
for index, row in df.iterrows():
if row['hike_type'] == 'one-way':
row['distance'] * 2
This hasn't worked either
for index, row in df.iterrows():
if row['hike_type'] == 'one-way':
row['distance'] = row['distance'] * 2
for some reason when I do (below) it prints what I want.
for index, row in df.iterrows():
if row['hike_type'] == 'one-way':
print(row['distance'] * 2)
IIUC, what you want could be achieved with just one line as below
df['distance']= np.where (df['hike_type'] == 'one-way', df['distance'].astype(int)*2,df['distance'])
OR you can use df.loc as below
df.update(df.loc[df['hike_type'] == 'one-way','distance'].astype(int)*2)
OR
df.update(df[df['hike_type'] == 'one-way']['distance'].astype(int)*2)
I have a dataframe as represented here
A
0.001216
0.000453
0.00506
0.004556
0.005266
I want to create a new column B something according to this formula presented in the code below.
column_key = 'B'
factor = 'A'
df[column_key] = np.nan
df[column_key][0] = (df[factor][0] + 1) * 100
for i in range(1, len(df)):
df[column_key][i] = (df[factor][i] + 1) * df[column_key][i-1]
I have been trying to fill the current cell value using the previous cell of a column and adjacent cell of a column.
This is what I have tried but I don't think this is going to be effective.
Can anyone help me with best efficient approach of solving this problem?
Using pandas.cumprod(), it can be done in following way:
df['B'] = df['A'] + 1
df['B'][0] = df['B'][0] * 100
df['B'] = df['B'].cumprod()
Basically I need to calculate a column that sum's previous columns only if:
The number in column Q > 4,
Column S = 'N' (its a 'Y' or 'N' flag in this column and I only want to sum if there is an 'N'
P != 0 (There is a number in column P.)
This is the incorrect statement which I came up with:
=IF((Q2>4) AND (S2='N') AND (P2 != 0), V2)
(Column V contains the calculation I want to make, given that these conditions are met.
Kinda lost here guys, would really appreciate any help.
All the best.
EDIT: fixed single-quote>>double and != >> <>
=IF(AND(Q2>4, S2="N", P2<>0), V2, "")