calculate average difference between dates using pyspark - apache-spark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!

You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

Related

How to split a Series into time intervals? (python)

I have this dataframe:
And I should split the rows of the ''Time.s'' column into intervals, calculate the average of each interval, and finally the deviation of each average.
I can't split the lines that have Volt.mv > 0.95 into a group for each second. I tried with GroupBy, but it creates problems with the second table:
I used this code, calculating the average directly, but I certainly did something wrong:
ecg.groupby("Time.s").apply(lambda x: x["Volt.mv"].mean())
Can anyone help me?
Before doing the groupby, you need to map Time.s to an interval. Otherwise each group will have only a single row (most of the time).
Here is how to group into intervals of 0.1 seconds and compute the mean and standard deviation for each interval:
interval_length = 0.1
df_aggregated = (
df
.assign(interval=df["Time.s"].div(interval_length).astype("int").mul(interval_length))
.groupby("interval")
.agg(volt_mean=("Volt.mv", "mean"), volt_std=("Volt.mv", "std"))
)

How to build a simple moving average measure

I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Spotfire: Show and calculate difference of two values from selected dates in the plot

I am showing a data of pressure in a graph by date which can be selected from the filter (days, months, years).
I would like to calculate the difference between the two data extrema in the plot [last Value - first Value] (when user changes a filter I show the new calculation as the graph will change)
PropertyName AverageReading Date
LevelPressure 1 1/1/2018
LevelPressure 5 1/3/2018
LevelPressure 24 1/2/2018
LevelPressure 4 1/5/2018
LevelPressure 3 2/2/2018
LevelPressure 2 2/3/2018
LevelPressure 1 2/4/2018
LevelPressure 77 2/1/2018
LevelPressure 33 2/2/2018
Here is my custom expression but it's not working properly (date is X axis values, level pressure Y axis):
Abs(if([Property Name]="LevelPressure",[Average Reading]))
- sum(if([Property Name]="LevelPressure",[Average Reading]))
over (PreviousPeriod([Date]))
If you are inserting a calculated column, it will always take into account the entire data set. It will not take filtering into account. You can create a calculated value and apply data limiting or filtering OR write an expression on the axis of a visualization. Based on the expression you gave, it seems like you are inserting a calculated column. This will not work.
Here is a solution that may or may not work for your use case. Your explanation did not specify what type of visualization you are working with. I assumed a scatter plot. This solution will work with any visualization type.
Go to Properties > Lines and Curves > Add a Horizontal Line configured with a custom expression > Abs(Max([Y]) - Min([Y])). This will put a line on the chart that is absolute value of the max and min average reading (average reading is your y axis value). It will update with filtering.

Grab minimum time in year

As an athlete I want to keep track of my progression in Excel.
I need a formula that looks for the fastest time ran in a given season. (The lowest value in E for a given year. For 2017, for example, this is 13.32, for 2018 12 and so on.
Can you help me further?
Instead of formula you can use PIVOT
Keep the Year in Report Filter and Time into Value. Then on value field setting select min as summarize value by.
So every you change the year in the Filter the min value will show up.
=AGGREGATE(15,6,E3:E6/(B3:B6=2017),1)
15 tell aggregate to sort the results in ascending order
6 tells aggregate to ignore any errors such as when you divide by 0
E3:E6 is your time range
B3:B6 is you Year as an integer.
B3:B6=2017 when true will be 1 and false will be 0 (provide it goes through a math operation like divide.
1 tells aggregate to return the 1st value in the sorted list of results

Resources