Resample a distribution conditional on another value - python-3.x

I would like to create a series of simulated values by resampling from empirical observations. The data I have are time series of 1-minute frequency. The simulations should be made on an arbitrary number of days with the same times each day. The twist is, that I need to sample conditional on the time, i.e. when sampling for a time of 8:00, it should be more probable to sample a value around 8:00 (but not limited to 8:00) from the original serie.
I have made a small sketch to show, how the draw-distribution changes depending on which time the a value is simulated for:
I.e. for T=0 it is more probable to draw a value from the actual distribution where the time of day is close to 0 and not probable to draw a value from the original distribution at the time of day of T=n/2 or later, where n is the number of unique timestamps in a day.
Here is a code snippet to generate sample data (I am aware that there is no need to sample conditional on this test data, but it is just to show the structure of the data)
import numpy as np
import pandas as pd
# Create a test data frame (only for illustration)
df = pd.DataFrame(index=pd.date_range(start='2020-01-01', end='2020-12-31', freq='T'))
df['MyValue'] = np.random.normal(0, scale=1, size=len(df))
print(df)
MyValue
2020-01-01 00:00:00 0.635688
2020-01-01 00:01:00 0.246370
2020-01-01 00:02:00 1.424229
2020-01-01 00:03:00 0.173026
2020-01-01 00:04:00 -1.122581
...
2020-12-30 23:56:00 -0.331882
2020-12-30 23:57:00 -2.463465
2020-12-30 23:58:00 -0.039647
2020-12-30 23:59:00 0.906604
2020-12-31 00:00:00 -0.912604
[525601 rows x 1 columns]
# Objective: Create a new time series, where each time the values are
# drawn conditional on the time of the day
I have not been able to find an answer on here, that fits my requirements. All help are appreciated.

I consider this sentence:
need to sample conditional on the time, i.e. when sampling for a time of 8:00, it should be more probable to sample a value around 8:00 (but not limited to 8:00) from the original serie.
Then, assuming the standard deviation is one sixth of the day (given your drawing)
value = np.random.normal(loc=current_time_sample, scale=total_samples/6)

Related

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

Function for finding duration where wave height <3m where time is between 5:00am and 6:00PM

I am trying to find duration for time where wave height is under 3m and time period is between 5:00am and 6:00pm. Trying to find this duration for a month of tidal data.
I have raw data for wave height and timestamps when it is high and low.
eg.
Timestamp Wave_Height
1/01/2022 3:16 0.68
1/01/2022 9:37 6.62
1/01/2022 16:14 1.07
1/01/2022 21:54 5.37
2/01/2022 4:06 0.59
etc…
So far I have got linear interpolation to find points where wave height=3. I am struggling to get a function to find the durations for my limits on time.
Included a picture to explain
Graph of wave data over time
The timestamps occur over different days in the month so difference between times must consider the changed dates in some cases(see rev 2 errors ####### where errors occur for changing of dates)
rev 2 error
The following should work. I have added some columns to avoid complicated formulas.
interpolate when the wave_height = 3 (column G)
add column H which is True when wave_height increases and False if it decreases (at the time in column G):
so cell H6 = F7<3 gives TRUE
add column E to limit the time window to 5:00-18:00.
E7 is =IF(D7<$G$2;$G$2;IF(D7>$H$2;$H$2;D7))
Added column I to calculate the time during wich wave_height < 3. The sum of that column is what you need.
I8 is =H8*(G8-E7)+NOT(H8)*(D8-G8)

How to plot pandas dataframe in 24 hour intervals? (multiple plots)

I have a pandas dataframe of about 3 years with the resolution of 6 seconds and I want to group the data into 24-hour bins and plot each day using matplotlib in a loop.
This is my dataframe's head:
timestamp consumption
0 2012-11-11 12:00:03 468
1 2012-11-11 12:00:09 476
2 2012-11-11 12:00:16 463
3 2012-11-11 12:00:22 449
4 2012-11-11 12:00:28 449
It includes the power consumption of a house from 2012 till 2015. After the pre-processing, the dataframe starts at about 12 pm of the first day. I need to plot all of the dataframe in 24-hour intervals and each plot must represent for a single day that starts from about 12 pm and ends at about 12 pm of the next day
So, I need about 1500 plots that show the power consumption of each day starting from 12 pm, for about 1500 days of my dataframe.
Thanks in advance.
Update: The reason I want to plot 1500 days separately, is I want to check each night's power consumption and label the occupant's sleep pattern. And I considered each day from 12 pm to 12 pm to have a complete sleep cycle in one plot. And after preparing the labels I'll be able to use them as train and test data for classification
Consider this not only an answer but also a suggestion. First, convert the column 'timestamp' into the index (DatetimeIndex)
df.set_index(df['timestamp'], inplace=True, drop=True)
Then, get all the unique days that happen in your DataFrame
unique_days = list(set(df.index.to_period('D').strftime('%Y-%m-%d')))
We then squeeze the DataFrame into a Series
del df['timestamp']
df = df.squeeze()
Now, just plot unique days in your series in separate subplots.
import matplotlib.pyplot as plt
unique_days = list(set(df.index.to_period('D').strftime('%Y-%m-%d')))
fig, axes = plt.subplots(nrows=len(unique_days), ncols=1)
row = 0
for day in unique_days:
df[day].plot(ax=axes[row], figsize=(50,10))
row += 1
plt.show()
Now, it's time for you to play around with the parameters of plots so that you can customize them to your needs.
This is kind of a strange request. If we knew what your end objective is, it might be easier to understand, but I'm going to assume you want to plot and then save figures for each of the days.
df['day'] = (df['timestamp'] + pd.Timedelta('12h')).dt.date
for day in df['day'].unique():
mask = (df['day'] == day)
#<the code for the plot that you want>
plt.plot(x=df[mask]['timestamp'].dt.time,y=df[mask]['consumption'])
plt.savefig('filename'+str(day)+'.png')
plt.close()

Apply a function on a pandas dataframe through a list of datetime windows

I'd like to calculate some aggregates over a pandas DataFrame. My DataFrame has no an uniform sample time but I want to calculate aggregates over uniform sample times.
For example: I want to calculate the mean of the last 7 days, each day.
So I want the resulting dataframe to have a datetime index daily.
I tried to implement it with a map-reduce structure but it was 40 times slower than a pandas implementation (using some specific cases).
My question is, do you know if there is any way to do this using pandas built-in functions?
An input DataFrame could be:
a b c
2010-01-01 00:00:00 0.957828 0.784962 0.801670
2010-01-01 00:00:06 0.669214 0.484439 0.479857
2010-01-01 00:00:18 0.537689 0.222179 0.995624
2010-01-01 00:01:15 0.339822 0.787626 0.144389
2010-01-01 00:02:21 0.705167 0.163373 0.317012
... ... ... ...
2010-03-14 23:57:35 0.799490 0.692932 0.050606
2010-03-14 23:58:40 0.380406 0.825227 0.643480
2010-03-14 23:58:43 0.838390 0.701595 0.632378
2010-03-14 23:59:01 0.604610 0.965274 0.503141
2010-03-14 23:59:02 0.320855 0.937064 0.192669
And the function should output something like this:
a b c
2010-01-01 0.957828 0.784962 0.801670
2010-01-02 0.499331 0.499944 0.505271
2010-01-03 0.499731 0.498455 0.503352
2010-01-04 0.499632 0.499328 0.502895
2010-01-05 0.500119 0.500299 0.502169
... ... ... ...
2010-03-10 0.499813 0.499680 0.501154
2010-03-11 0.499965 0.500226 0.501582
2010-03-12 0.500644 0.500720 0.501243
2010-03-13 0.500439 0.500264 0.501203
2010-03-14 0.500776 0.500334 0.501048
Where each value corresponds to the mean of the last 7 days.
I know it's an old thread but maybe this answer would be helpful to someone.
I think it's:
df.rolling(window=your_window_value).mean() # Simple Moving Average
That could be used.
But it will work only if for each day you have the same number of rows. In such case you could use window value equal to 7 x number of rows per day.
If for each day you have different number of rows then we would probably need different solution. But scenario was not fully defined in the question...

Getting end of period change with different index

I’ve got a Panel of DataFrames containing yahoo finance stocks.
I want to re-index the stocks by Date inside the Panel:
To be more specific:
What I mean in simple words is to get the difference from the beginning of the Q1 2013 for e.g. stock quote ‘AA’ from the end of the Quarter.
The expected result from:
AA[‘Close’][‘March28,2013’] - AA[‘Open’][‘January2,2013’]
is 8.52(march closing price)- 8.99(January closing price) = -0.47(difference).
I want to do this for all the Quarters that I have in single day data from Q1 2010 until Q3 2013 to get the difference. That is to change the index from daily to quarter.
What is the best way to do it?
Thanks everybody.
You should post what you expect your solution to look like so that I know I'm answering the correct question. I don't think the way I've interpreted your question makes sense. Edit your question to clarify and I'll edit my answer accordingly.
When you're working with Time Series data, there are some special resampling methods that are handy in cases like this.
Conceptually they're similar to groupby operations.
First of all, fetch the data:
pan = DataReader(['AAPL', 'GM'], data_source='yahoo')
For demonstration, focus on just AAPL.
df = pan.xs('AAPL', axis='minor')
In [24]: df.head()
Out[24]:
Open High Low Close Volume Adj Close
Date
2010-01-04 213.43 214.50 212.38 214.01 17633200 209.51
2010-01-05 214.60 215.59 213.25 214.38 21496600 209.87
2010-01-06 214.38 215.23 210.75 210.97 19720000 206.53
2010-01-07 211.75 212.00 209.05 210.58 17040400 206.15
2010-01-08 210.30 212.00 209.06 211.98 15986100 207.52
Now use the resample method to get to the frequency you're looking for: I'm going to demonstart quarterly, but you can substitue the appropriate code. We'll use BQS for *B*usiness *Q*uarterly *S*tart of quarter. To aggregate, we take the sum.
In [33]: df.resample('BQS', how='sum').head()
Out[33]:
Open High Low Close Volume Adj Close
Date
2010-01-01 12866.86 12989.62 12720.40 12862.16 1360687400 12591.73
2010-04-01 16083.11 16255.98 15791.50 16048.55 1682179900 15711.11
2010-07-01 16630.16 16801.60 16437.74 16633.93 1325312300 16284.18
2010-10-01 19929.19 20069.74 19775.96 19935.66 1025567800 19516.49
2011-01-03 21413.54 21584.60 21219.88 21432.36 1122998000 20981.76
Ok so now we want the Open for today minus the Close for yesterday for the gross change. Or (today / yesterday) - 1 for the percentage change. To do this, use the shift method, which shifts all the data down one row.
In [34]: df.resample('BQS', how='sum')['Open'] - df.resample('BQS', how='sum').shift()['Close']
Out[34]:
Date
2010-01-01 NaN
2010-04-01 3220.95
2010-07-01 581.61
2010-10-01 3295.26
2011-01-03 1477.88
2011-04-01 -119.37
2011-07-01 3058.69
2011-10-03 338.77
2012-01-02 6487.65
2012-04-02 5479.15
2012-07-02 3698.52
2012-10-01 -4367.70
2013-01-01 -7767.81
2013-04-01 -355.53
2013-07-01 -19491.64
Freq: BQS-JAN, dtype: float64
You could write a function and apply it to each DataFrame in the panel you get from Yahoo.

Resources