Date filtering in PySpark using between - apache-spark

I have a Spark dataframe with date columns. The "date1col" last entry is today and the "date2col" has the last entry of 10 days ago. I need to filter the dates for the last two weeks up to yesterday.
I used df.filter(col('date1col').between(current_date()-1,current_date()-15), and it worked fine. However, when I used the same syntax for the second date column, i.e.: df.filter(col('date2col').between(current_date()-1,current_date()-15) , it returned an empty sdf. When I used df.filter(col('startdate')>current_date()-15), it worked. But my dataframe is dynamic, meaning it updates daily at 9am. How can I force the between function to return the same sdf like I am using the > logic?

Switch the order:
first - earlier date
second - later date
df.filter(col('date2col').between(current_date()-15, current_date()-1))
They are not the same, it can be proved using sameSemantics
df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter(col('date2col').between(current_date()-1, current_date()-15))
df1.sameSemantics(df2)
# False
If you still need - .between translated to "<>" logic:
df.filter(col('date2col').between(current_date()-15, current_date()-1))
=
df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))
df1 = df.filter(col('date2col').between(current_date()-15, current_date()-1))
df2 = df.filter((col('date2col') >= current_date()-15) & (col('date2col') <= current_date()-1))
print(df1.sameSemantics(df2)) # `True` when the logical query plans are equal, thus same results
# True
"<>" translated to .between
df.filter(col('date2col') > current_date()-15)
= df.filter(col('date2col').between(current_date()-14, '9999'))
sameSemantics result would apparently be False, but for any practical case, results would be same.

Related

Incorrect result for nested condition in MEDIAN-IF excel

I have a following excel spreadsheet which consist of following fields:
Col A: Timestamp
Col B: Numerical result
Col C: Time duration taken for calculation of result
Now, I'm trying to find the median value of col C (Duration) for various month and year combinations.
e.g. For the month of march in 2019, what's the median value of duration?
I could've used the MEDIANIFS, but sadly it didn't exists. I'm trying the below thing also, but it's not giving the correct result(G1 is a drop-down which consists numerical valued years i.e. 2019, 2020 and so on)
MEDIAN(IF(YEAR(A3:A100) = G1, IF(MONTH(A3:A100) = 3, C3:C100)))
I also tried ANDing the conditions but it also didn't worked:
MEDIAN(IF((YEAR(A3:A100) = G1) * (MONTH(A3:A100) = 3), C3:C100))
If I put one condition inside the Median(If()), it's working fine. But, whenever I nest or concat conditions, it's not giving the correct result.
Any help/pointers will be highly appreciated.

Power BI first IF-Statement then the DAX-Formula

I am new to Power BI and have the following issue:
I tried to build a formula for a frequency counter. I got some examples from the web and I was able to build this working formula. The basic idea behind is to categorize an item with the values: daily, weekly or first time.
I tried to add an IF-Statement to the formula, that is checking a calculated column "Time frame", which shows the duration of an item in minutes.
Basically it should run this formula only if the Column "Time frame" is equal or bigger 1.
Now the formula gives to items with a Time frame of 0, the value first time. But they should be ignored or blanked.
Calculated column =
Var freqcount =
COUNTAX(FILTER(ALL('Count'),
AND([Date]>=DATEADD('Count'[Date],-6,DAY)&&[Date]<=EARLIER([Date]),[ID]=EARLIER('Count'[ID]))),ID])
return
if(freqcount>=4,"Daily",if(freqcount>=2,"Weekly",if(freqcount>=1,"First time","Inactive")))
I would be thankful, if someone could support me with this issue.
Edit: an ID can occur multiple times in my table but with different dates. But only once with the same date. For example:
ID 1, Date 01.01.2020
ID 1, Date 02.01.2020
ID 1, Date 03.01.2020
it is easier to use calculate:
Calculated column =
var rDate = yourTable[Date]
var rID = yourTable[ID]
var freqCount = CALCULATE(yourTable('Count'), FILTER(yourTable, rDate >= DATEADD(yourTable[Date], -6 , DAY) && rID = yourTable[ID] && yourTable['Time frame'] > 0))
return if(freqcount>=4,"Daily",if(freqcount>=2,"Weekly",if(freqcount>=1,"First time","Inactive")))
you see how I simply added the Time frame to the expression. Also I removed the use of earlier by using var's so it is better readable.

How to assign calculations done in a for loop by row to blank columns by index?

I have a dataframe that looks like this (with more rows):
Buy Sell acct_stock usd_account
0 True False
1 False True
I assigned two numerical values to two variables like so:
account = 1000
stock_amount = 0
For each row in the df that has the combination of True AND False, I want to update columns acct_stock and usd_account by using the numerical variables to do calculations:
I have come up with the following FOR loop:
for index,row in df.iterrows():
if (row['Buy'] == True & row['Sell'] == False):
row['acct_stock'] = (account * 0.02)/row[4]
row['usd_account'] = account - (stock_amount * 0.02)
When I run this the columns acct_stock and usd_account do not get updated at all, even for rows where the conditions are met.
What am I doing wrong that the calculations are not being assigned by index to the columns/rows in question and the variables 'account' and 'stock_amount' are also being updated continously?
The answer of why is didn't work is included in this page, The code you need to look at is here
However keep in mind you need to have those column first before you can set a value

Conditional Statement based on value of a different column

I am trying to efficiently add another element to this code below, that takes into account the value of another column in this df.
Below I have a filter if the value column is >= 0, but I want to add an element that says if the column called day = 'Friday', thanks.
df[df['value']] >= 0
use this
df[(df['value']>=0) & (df['day']=='friday') ]
Chain another condition with & for bitwise AND or | for bitwise OR in boolean indexing, here are necessary ():
df1 = df[(df['value'] >= 0) & (df['day'] == 'friday')]
Or use Series.gt and Series.eq functions for compare:
df1 = df[df['value'].gt(0) & df['day'].eq('friday')]
Or use DataFrame.query:
df1 = df.query("(value >= 0) & (day == 'friday')")

Creating new column using for loop returns NaN value in Python/Pandas

I am using Python/Pandas to manipulate a data frame. I have a column 'month' (values from 1.0 to 12.0). Now I want to create another column 'quarter'. When I write -
for x in data['month']:
print ((x-1)//3+1)
I get proper output that is quarter number (1,2,3,4 etc).
But I am not being able to assign the output to the new column.
for x in data['month']:
data['quarter'] = ((x-1)//3 + 1)
This creates the quarter column with missing or 'NaN' value -
My question is why I am getting missing value while creating the column ?
Note: I am using python 3.6 and Anaconda 1.7.0. 'data' is the data frame I am using. Initially I had only the date which I converted to month and year using
data['month'] = pd.DatetimeIndex(data['first_approval']).month
Interestingly this month column shows dtype: float64 . I have read somewhere "dtype('float64') is equivalent to None" but I didn't understand that statement clearly. Any suggestion or help will be highly appreciated.
This is what I had in the beginning:
This is what I am getting after running the for loop:
The easiest way to get the quarter from the date would be to
data['quarter'] = pd.DatetimeIndex(data['date']).quarter
the same way as how you achieved the month information.
The below line would set the entire column to the last value achieved from the calculation. (There could have been some value which is not of a proper date format, hence the NaNs)
data['quarter'] = ((x-1)//3 + 1)
Try with below:
df['quarter'] = df['month'].apply(lambda x: ((x-1)//3 + 1))

Resources