I have a Pandas (1.0.*) dataframe, which contains the record of several physical variables (say Temperature, Pressure and Humidity for example).
The time space between two record is roughly 1s but varies between 0.8s and 4s.
I want to calculate the standard deviation and the slope (of the linear regression) in a 5-minutes rolling window.
Here is how I do it:
import numpy as np
import pandas as pd
import datetime
np.random.seed(1)
# Build the dummy dataset for testing
rows, cols = 1000, 3
datetimes_sec = pd.date_range('2020-01-01', periods=rows, freq='1s').astype(np.int64) / 1e9
shifts = np.random.rand(rows) - 0.5 # Create random shift between -0.5s and +0.5s
datetimes = [sum(x) * 1e9 for x in zip(datetimes_sec, shifts)]
df = pd.DataFrame(np.random.rand(rows,cols),
columns=['temperature', 'pressure', 'humidity'],
index=pd.to_datetime(datetimes))
# Custom function to calculate the slope
def get_slope(series):
hours_since_epoch = series.index.astype(np.int64) / 3.6e12 # nanosecond to hour, I want the slope to be in [variable's unit] per hour
slope = np.polyfit(hours_since_epoch, series, 1)[0]
return slope
# Get the result
df = df.rolling("5min").agg(["std", get_slope])
This works, but it is too slow: the last line takes more than 2s for 1000 rows.
I can see that my custom get_slope function is responsible, if I replace it by a standard function (e.g. min()), it takes 0.007s. But I can find how to make it faster.
If it is not possible to get the same result faster, a workaround solution could be to skip some data lines: do not roll the window on every line (i.e. for 0.8 to 4 second) but do the calculation only every 30s:
calculate sd and slope of all (~300) data between 00:00:00 and 00:05:00
calculate sd and slope of all (~300) data between 00:00:30 and 00:05:30
calculate sd and slope of all (~300) data between 00:01:00 and 00:06:00
etc.
Instead of:
calculate sd and slope of all (~300) data between 00:00:00 and 00:05:00
calculate sd and slope of all (~300) data between 00:00:01 and 00:05:01
calculate sd and slope of all (~300) data between 00:00:02 and 00:05:02
etc.
I don't know how to do that (in the proper pandas' way) with unevenly-spaced data.
It would speed up the process by 30, in exchange of a loss of precision.
Related
The National Weather Service computes the wind chill index using the following formula:
35.74 + 0.6215T – 35.75(V0.16) + 0.4275T(V0.16)
where T is the temperature in degrees Fahrenheit, and V is the wind speed in miles per hour. Rows should represent wind speed from 0 to 50 in 5 mph increments, and the columns represent temperatures from -20 to +60 in 10 degree increments.
Create a function windChill(temperature, velocity) which will calculate and return the wind chill for a combination of temperature and wind speed. If the velocity (wind speed) is less than 3 miles per hour, the function should return the provided temperature. If the velocity is greater than or equal to 3 MPH use the formula to calculate and return the wind chill.
Your main() function should use nested loops to create the table, with repeated calls to your windChill() function to determine the value for each wind/temperature combination.
Evaluation:
Use of function to calculate wind chill
Use of nested loops (for or while) to print the rows
Format of table is important
numbers are integers and column aligned
header aligns with values
The requirement is to generate a sine wave with frequencies ranging from 1 Hz to 10 Hz.
The conditions are
The amplitude is one volt for all sine wave frequencies
The 1 Hz signal must be sampled into 100 wave points.
Similarly 2 Hz signal must be sampled into 200 samples per second and 3 Hz signal must be sampled into 300 samples per second and 4 Hz signal to 400 samples/sec and so on up to 10 Hz.
4.All these values must be written into Csv file and also converted to excel file.
Thus excel has 10 columns named 1 Hz to 10 Hz. and each column have values accordingly.
1st column has values of 100 samples
2nd column has values of 200 samples and so on up to 10th column.
can anyone help to solve this. I coded for 1 Hz to 10 Hz with same sampling ratio and with all columns with same number of rows.
import numpy as np
import pandas as pd
def get_values_for_frequency(freq):
# sampling information
Fs = 100 #sample rate- no of samples per second
T = 1/Fs #sampling period %sample per second
t = 1 #seconds of sampling
N = Fs*t #total points in signal
# signal information
omega = 2*np.pi*freq # angular frequency for sine waves
t_vec = np.arange(N)*T # time
y = np.sin(omega*t_vec) #sine wave generation
return y
#columns are created for sine wave frequencies
df = pd.DataFrame(columns =['1Hz','2Hz', '3Hz', '4Hz', '5Hz', '6Hz', '7Hz'])
df['1Hz']=pd.Series(get_values_for_frequency(1))
df['2Hz']=pd.Series(get_values_for_frequency(2))
df['3Hz']=pd.Series(get_values_for_frequency(3))
df['4Hz']=pd.Series(get_values_for_frequency(4))
df['5Hz']=pd.Series(get_values_for_frequency(5))
df['6Hz']=pd.Series(get_values_for_frequency(6))
df['7Hz']=pd.Series(get_values_for_frequency(7))
df = df.round(decimals = 3) #Round a table values in DataFrame to3 decimal places
print("created table datatype\n",df.dtypes) #viewing the datatype of table
df.to_csv("float_csvfile.csv",index=False) #the raw values are written into the csv file.
I tired to give two inputs to the function as def
get_values_for_frequency(freq,samplepoints):
and called the function with
df = pd.DataFrame(columns =['1Hz','2Hz', '3Hz', '4Hz', '5Hz', '6Hz', '7Hz'])
df['1Hz']=pd.Series(get_values_for_frequency(1,100))
df['2Hz']=pd.Series(get_values_for_frequency(2,200))
df['3Hz']=pd.Series(get_values_for_frequency(3,300))
df['4Hz']=pd.Series(get_values_for_frequency(4,400))
df['5Hz']=pd.Series(get_values_for_frequency(5,500))
df['6Hz']=pd.Series(get_values_for_frequency(6,600))
df['7Hz']=pd.Series(get_values_for_frequency(7,700))
yet I receive the same column length for each column.
Can anyone help me to solve for different column lengths. Thanks
I have this dataframe:
And I should split the rows of the ''Time.s'' column into intervals, calculate the average of each interval, and finally the deviation of each average.
I can't split the lines that have Volt.mv > 0.95 into a group for each second. I tried with GroupBy, but it creates problems with the second table:
I used this code, calculating the average directly, but I certainly did something wrong:
ecg.groupby("Time.s").apply(lambda x: x["Volt.mv"].mean())
Can anyone help me?
Before doing the groupby, you need to map Time.s to an interval. Otherwise each group will have only a single row (most of the time).
Here is how to group into intervals of 0.1 seconds and compute the mean and standard deviation for each interval:
interval_length = 0.1
df_aggregated = (
df
.assign(interval=df["Time.s"].div(interval_length).astype("int").mul(interval_length))
.groupby("interval")
.agg(volt_mean=("Volt.mv", "mean"), volt_std=("Volt.mv", "std"))
)
I have 3 different columns of timestamps in a pandas dataframe, two of which have fraction of seconds recorded while the third does not have fraction of seconds. I would like to calculate an average of these 3 columns.
I have already tried to compute the average using the mean function on the columns and consistently received nan as the result
import pandas as pd
data = [{'time1': '2018-07-22 04:34:10.8966', 'time2': '2017-07-22 04:34:10.8966', 'time3': '2018-07-27 00:10:04'}]
df = pd.DataFrame(data)
df['estimate'] = df[['time1', 'time2', 'time3']].mean(axis=1)
df
Expected : An average of the 3 timestamps
Actual : While there is no error, it also always evaluates to nan which is not what is desired.
As far as I know you can't to it directly on datetime values, you need to convert them, average, and then convert back:
data = [{'time1': '2018-07-22 04:34:10.8966', 'time2': '2017-07-22 04:34:10.8966', 'time3': '2018-07-27 00:10:04'}]
df = pd.DataFrame(data).apply(pd.to_datetime)
df['estimate'] = pd.to_datetime(df[['time1', 'time2', 'time3']].values.astype(pd.np.int64).mean(axis=1))
Result:
time1 time2 time3 estimate
0 2018-07-22 04:34:10.896600 2017-07-22 04:34:10.896600 2018-07-27 00:10:04 2018-03-24 03:06:08.597733376
I have a pandas dataframe of about 3 years with the resolution of 6 seconds and I want to group the data into 24-hour bins and plot each day using matplotlib in a loop.
This is my dataframe's head:
timestamp consumption
0 2012-11-11 12:00:03 468
1 2012-11-11 12:00:09 476
2 2012-11-11 12:00:16 463
3 2012-11-11 12:00:22 449
4 2012-11-11 12:00:28 449
It includes the power consumption of a house from 2012 till 2015. After the pre-processing, the dataframe starts at about 12 pm of the first day. I need to plot all of the dataframe in 24-hour intervals and each plot must represent for a single day that starts from about 12 pm and ends at about 12 pm of the next day
So, I need about 1500 plots that show the power consumption of each day starting from 12 pm, for about 1500 days of my dataframe.
Thanks in advance.
Update: The reason I want to plot 1500 days separately, is I want to check each night's power consumption and label the occupant's sleep pattern. And I considered each day from 12 pm to 12 pm to have a complete sleep cycle in one plot. And after preparing the labels I'll be able to use them as train and test data for classification
Consider this not only an answer but also a suggestion. First, convert the column 'timestamp' into the index (DatetimeIndex)
df.set_index(df['timestamp'], inplace=True, drop=True)
Then, get all the unique days that happen in your DataFrame
unique_days = list(set(df.index.to_period('D').strftime('%Y-%m-%d')))
We then squeeze the DataFrame into a Series
del df['timestamp']
df = df.squeeze()
Now, just plot unique days in your series in separate subplots.
import matplotlib.pyplot as plt
unique_days = list(set(df.index.to_period('D').strftime('%Y-%m-%d')))
fig, axes = plt.subplots(nrows=len(unique_days), ncols=1)
row = 0
for day in unique_days:
df[day].plot(ax=axes[row], figsize=(50,10))
row += 1
plt.show()
Now, it's time for you to play around with the parameters of plots so that you can customize them to your needs.
This is kind of a strange request. If we knew what your end objective is, it might be easier to understand, but I'm going to assume you want to plot and then save figures for each of the days.
df['day'] = (df['timestamp'] + pd.Timedelta('12h')).dt.date
for day in df['day'].unique():
mask = (df['day'] == day)
#<the code for the plot that you want>
plt.plot(x=df[mask]['timestamp'].dt.time,y=df[mask]['consumption'])
plt.savefig('filename'+str(day)+'.png')
plt.close()