How to split a Series into time intervals? (python) - pandas-groupby

I have this dataframe:
And I should split the rows of the ''Time.s'' column into intervals, calculate the average of each interval, and finally the deviation of each average.
I can't split the lines that have Volt.mv > 0.95 into a group for each second. I tried with GroupBy, but it creates problems with the second table:
I used this code, calculating the average directly, but I certainly did something wrong:
ecg.groupby("Time.s").apply(lambda x: x["Volt.mv"].mean())
Can anyone help me?

Before doing the groupby, you need to map Time.s to an interval. Otherwise each group will have only a single row (most of the time).
Here is how to group into intervals of 0.1 seconds and compute the mean and standard deviation for each interval:
interval_length = 0.1
df_aggregated = (
df
.assign(interval=df["Time.s"].div(interval_length).astype("int").mul(interval_length))
.groupby("interval")
.agg(volt_mean=("Volt.mv", "mean"), volt_std=("Volt.mv", "std"))
)

Related

How would I write a program in Python that creates a formatted table of wind chill values?

The National Weather Service computes the wind chill index using the following formula:
35.74 + 0.6215T – 35.75(V0.16) + 0.4275T(V0.16)
where T is the temperature in degrees Fahrenheit, and V is the wind speed in miles per hour. Rows should represent wind speed from 0 to 50 in 5 mph increments, and the columns represent temperatures from -20 to +60 in 10 degree increments.
Create a function windChill(temperature, velocity) which will calculate and return the wind chill for a combination of temperature and wind speed. If the velocity (wind speed) is less than 3 miles per hour, the function should return the provided temperature. If the velocity is greater than or equal to 3 MPH use the formula to calculate and return the wind chill.
Your main() function should use nested loops to create the table, with repeated calls to your windChill() function to determine the value for each wind/temperature combination.
Evaluation:
Use of function to calculate wind chill
Use of nested loops (for or while) to print the rows
Format of table is important
numbers are integers and column aligned
header aligns with values

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

How to build a simple moving average measure

I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))

Faster rolling apply std and slope on unevenly-spaced timeseries

I have a Pandas (1.0.*) dataframe, which contains the record of several physical variables (say Temperature, Pressure and Humidity for example).
The time space between two record is roughly 1s but varies between 0.8s and 4s.
I want to calculate the standard deviation and the slope (of the linear regression) in a 5-minutes rolling window.
Here is how I do it:
import numpy as np
import pandas as pd
import datetime
np.random.seed(1)
# Build the dummy dataset for testing
rows, cols = 1000, 3
datetimes_sec = pd.date_range('2020-01-01', periods=rows, freq='1s').astype(np.int64) / 1e9
shifts = np.random.rand(rows) - 0.5 # Create random shift between -0.5s and +0.5s
datetimes = [sum(x) * 1e9 for x in zip(datetimes_sec, shifts)]
df = pd.DataFrame(np.random.rand(rows,cols),
columns=['temperature', 'pressure', 'humidity'],
index=pd.to_datetime(datetimes))
# Custom function to calculate the slope
def get_slope(series):
hours_since_epoch = series.index.astype(np.int64) / 3.6e12 # nanosecond to hour, I want the slope to be in [variable's unit] per hour
slope = np.polyfit(hours_since_epoch, series, 1)[0]
return slope
# Get the result
df = df.rolling("5min").agg(["std", get_slope])
This works, but it is too slow: the last line takes more than 2s for 1000 rows.
I can see that my custom get_slope function is responsible, if I replace it by a standard function (e.g. min()), it takes 0.007s. But I can find how to make it faster.
If it is not possible to get the same result faster, a workaround solution could be to skip some data lines: do not roll the window on every line (i.e. for 0.8 to 4 second) but do the calculation only every 30s:
calculate sd and slope of all (~300) data between 00:00:00 and 00:05:00
calculate sd and slope of all (~300) data between 00:00:30 and 00:05:30
calculate sd and slope of all (~300) data between 00:01:00 and 00:06:00
etc.
Instead of:
calculate sd and slope of all (~300) data between 00:00:00 and 00:05:00
calculate sd and slope of all (~300) data between 00:00:01 and 00:05:01
calculate sd and slope of all (~300) data between 00:00:02 and 00:05:02
etc.
I don't know how to do that (in the proper pandas' way) with unevenly-spaced data.
It would speed up the process by 30, in exchange of a loss of precision.

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Resources