Resample (or loop) using log mean - python-3.x

Is there a way to resample using log mean? I have read the resample documentation and cannot find any options for log mean resampling.
I have a large dataframe with datetime index, with observations for every minute. I need to calculate the log mean for every 5 minutes for a range of variables (columns).
I have provided some code below showing some example data and the calculation i want to carry out. It might be, that if there isnt a log mean resampling function 'out of the box', that i will need to code a loop to do this...?
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'db' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ]}, index=pd.date_range('2019-05-02T00:00:00', '2019-05-02T00:14:00', freq='1T'))
df1 = df1.resample('5T').mean() # <------ is there a way to do log mean for this?
# The calculation i am need to do is:
df2 = np.log10(10**((df1[observation minute 1]/10)) + 10**((df1[observation minute 2]/10)) + 10**((df1[observation minute 3]/10)) + 10**((df1[observation minute 4]/10)) + 10**((df1[observation minute 5]/10)))
# Where 'observation minute 1,2,3,4,5' are the 5 minutes i want to resample for.
# The resulting df i need is:
df_result = pd.DataFrame({'log_mean' : [np.log10(10**((1/10)) + 10**((2/10)) + 10**((3/10)) + 10**((4/10)) + 10**((5/10))), np.log10(10**((6/10)) + 10**((7/10)) + 10**((8/10)) + 10**((9/10)) + 10**((10/10))), np.log10(10**((11/10)) + 10**((12/10)) + 10**((13/10)) + 10**((14/10)) + 10**((15/10)))]}, index=pd.date_range('2019-05-02T00:00:00', '2019-05-02T00:14:00', freq='5T'))
Any guidance would be gratefully received.

Turns out you can resample using any function of your choosing using apply:
df1 = df1.resample('5T').apply(lambda spl: 10*np.log10(np.mean(np.power(10, spl/10))))
or you can define it separately
def log_avg(spl_arraylike):
return 10*np.log10(np.mean(np.power(10, spl_arraylike/10))))
df1 = df1.resample('5T').apply(log_avg)
this returns a dataframe with the following values
2019-05-02 00:00:00 3.227668
2019-05-02 00:05:00 8.227668
2019-05-02 00:10:00 13.227668

Related

Why Pandas stack operation fails depending on the column type?

I have an issue with an unstack / stack operation on pandas 1.1.5, with python 3.8.10.
Let's say that i have a dataframe pandas looking like this :
import datetime as dt
import pandas as pd
# Data
category_1 = ["cat1"] * 18
category_2 = ["CAT1", "CAT2", "CAT3", "CAT4", "CAT5", "CAT6"] * 3
dates = [dt.datetime(2022, 11, 1)] * 6 + [dt.datetime(2022, 11, 2)] * 6 + [dt.datetime(2022, 11, 3)] * 6
numbers = [50] * 18
# Dict
df_dict = {
"category_1": category_1,
"category_2": category_2,
"dates": dates,
"numbers": numbers,
}
df = (pd.DataFrame(df_dict))
df = df.astype({"numbers": "Int64"}) # specific needs to handle NaN values with int column
df = df.set_index(["category_1", "category_2", "dates"])
df.head()
and I want to unstack on category 1 & 2 to manipulate the dates index (i.e missing dates filling, other operations, whatever which is not really revelant for the specific question).
df= df.unstack(["category_1", "category_2"])
df
If i want to stack the same columns (even without manipulation), I have the following error :
df.stack(["category_1", "category_2"])
IndexError: index 3 is out of bounds for axis 0 with size 3
I was able to solve this issue by removing the type forcing in Int64
df = df.astype({"numbers": "int64"}) # No longer an issue to unstack / stack
I was expecting to have the same bug but I didn't.
Can someone help me on how to solve this issue what could be the problem 'under the hood' ?
thanks

Matplotlib Annotate using values from DataFrame

I'm trying to annotate a chart to include the plotted values of the x-axis as well as additional information from the DataFrame. I am able to annotate the values from the x-axis but not sure how I can add additional information from the data frame. In my example below I am annotating the x-axis which are the values from the Completion column but also want to add the Completed and Participants values from the DataFrame.
For example the Running Completion is 20% but I want my annotation to show the Completed and Participants values in the format - 20% (2/10). Below is sample code that can reproduce my scenario as well as current and desired results. Any help is appreciated.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydict = {
'Event': ['Running', 'Swimming', 'Biking', 'Hiking'],
'Completed': [2, 4, 3, 7],
'Participants': [10, 20, 35, 10]}
df = pd.DataFrame(mydict).set_index('Event')
df = df.assign(Completion=(df.Completed/df.Participants) * 100)
print(df)
plt.subplots(figsize=(5, 3))
ax = sns.barplot(x=df.Completion, y=df.index, color="cyan", orient='h')
for i in ax.patches:
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%', fontsize=10)
plt.tight_layout()
plt.show()
DataFrame:
Completed Participants Completion
Event
Running 2 10 20.000000
Swimming 4 20 20.000000
Biking 3 35 8.571429
Hiking 7 10 70.000000
Current Output:
Desired Output:
Loop through the columns Completed and Participants as well when you annotate:
for (c,p), i in zip(df[["Completed","Participants"]].values, ax.patches):
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%' + f" ({c}/{p})", fontsize=10)

computing the mean for python datetime

I have a datetime attribute:
d = {
'DOB': pd.Series([
datetime.datetime(2014, 7, 9),
datetime.datetime(2014, 7, 15),
np.datetime64('NaT')
], index=['a', 'b', 'c'])
}
df_test = pd.DataFrame(d)
I would like to compute the mean for that attribute. Running mean() causes an error:
TypeError: reduction operation 'mean' not allowed for this dtype
I also tried the solution proposed elsewhere. It doesn't work as running the function proposed there causes
OverflowError: Python int too large to convert to C long
What would you propose? The result for the above dataframe should be equivalent to
datetime.datetime(2014, 7, 12).
You can take the mean of Timedelta. So find the minimum value and subtract it from the series to get a series of Timedelta. Then take the mean and add it back to the minimum.
dob = df_test.DOB
m = dob.min()
(m + (dob - m).mean()).to_pydatetime()
datetime.datetime(2014, 7, 12, 0, 0)
One-line
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(d.min())).to_pydatetime()
To #ALollz point
I use the epoch pd.Timestamp(0) instead of min
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(pd.Timestamp(0))).to_pydatetime()
You can convert epoch time using astype with np.int64 and converting back to datetime with pd.to_datetime:
pd.to_datetime(df_test.DOB.dropna().astype(np.int64).mean())
Output:
Timestamp('2014-07-12 00:00:00')
You could work with unix time if you want. This is defined as the total number of seconds (for instance) since 1970-01-01. With that, all of your times are simply floats, so it's very easy to do simple math on the columns.
import pandas as pd
df_test['unix_time'] = (df_test.DOB - pd.to_datetime('1970-01-01')).dt.total_seconds()
df_test['unix_time'].mean()
#1405123200.0
# You want it in date, so just convert back
pd.to_datetime(df_test['unix_time'].mean(), origin='unix', unit='s')
#Timestamp('2014-07-12 00:00:00')
Datetime math supports some standard operations:
a = datetime.datetime(2014, 7, 9)
b = datetime.datetime(2014, 7, 15)
c = (b - a)/2
# here c will be datetime.timedelta(3)
a + c
Out[7]: datetime.datetime(2014, 7, 12, 0, 0)
So you can write a function that, given two datetimes, subtracts the lesser form the greater and adds half of the difference to the lesser. Apply this function to your dataframe, and shazam!
As of pandas=0.25, it is possible to compute the mean of a datetime series.
In [1]: import pandas as pd
...: import numpy as np
In [2]: s = pd.Series([
...: pd.datetime(2014, 7, 9),
...: pd.datetime(2014, 7, 15),
...: np.datetime64('NaT')])
In [3]: s.mean()
Out[3]: Timestamp('2014-07-12 00:00:00')
However, note that applying mean to a pandas dataframe currently ignores columns with a datetime series.

Why is my data not recognized as time series?

I have daily (day) data on calories intake for one person (cal2), which I get from a Stata dta file.
I run the code below:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import read_csv
from matplotlib.pylab import rcParams
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True,
index = 'day', convert_dates=True)
print(d.dtypes)
print(d.shape)
print(d.index)
print(d.head)
plt.plot(d)
This is how the data looks like:
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
The prints reveal the following:
day datetime64[ns]
cal2 float32
dtype: object
(251, 2)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
dtype='int64', length=251)
And here is the problem - the data should identify as dtype='datatime64[ns]'.
However, it clearly does not. Why not?
There is a discrepancy between the code provided, the data and the types shown.
This is because irrespective of the type of cal2, the index = 'day' argument
in pd.read_stata() should always render day the index, albeit not as the
desired type.
With that said, the problem can be reproduce as follows.
First, create the dataset in Stata:
clear
input double day float cal2
15350 3668.433
15351 3652.25
15352 3647.866
15353 3646.684
15354 3661.9414
15355 3656.952
end
format %td day
save time_series_calories
describe
Contains data from time_series_calories.dta
obs: 6
vars: 2
size: 72
----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------
day double %td
cal2 float %9.0g
----------------------------------------------------------------------------------------------------
Sorted by:
Second, load the data in Pandas:
import pandas as pd
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, convert_dates=True)
print(d.head)
day cal2
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
print(d.dtypes)
day datetime64[ns]
cal2 float32
dtype: object
print(d.shape)
(6, 2)
print(d.index)
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
In order to change the index as desired, you can use pd.set_index():
d = d.set_index('day')
print(d.head)
cal2
day
2002-01-10 3668.433350
2002-01-11 3652.249756
2002-01-12 3647.866211
2002-01-13 3646.684326
2002-01-14 3661.941406
2002-01-15 3656.951660
print(d.index)
DatetimeIndex(['2002-01-10', '2002-01-11', '2002-01-12', '2002-01-13',
'2002-01-14', '2002-01-15'],
dtype='datetime64[ns]', name='day', freq=None)
If day is a string in the Stata dataset, then you can do the following:
d['day'] = pd.to_datetime(d.day)
d = d.set_index('day')

Iterating through values to find average equation of a line (Python3)

I am trying to find the equation of a line within a DF
Here is a fake data set to explain:
Clicks Sales
5 10
5 11
10 16
10 20
10 18
15 28
15 26
... ...
100 200
What I am trying to do:
Calculate the equation of the line between so that I am able to input a number of clicks and have an output of sales at any predicted level. The thing I am trying to wrap my brain around is that I have many different line functions (e.g. there are multiple sales for each amount of clicks). How can I iterate through my DF to just to calculate one aggregate line function?
Here's what I have but it only accept ONE input at a time, I would like to create an average or aggregate...
def slope(self, target):
return slope(target.x - self.x, target.y - self.y)
def y_int(self, target): # <= here's the magic
return self.y - self.slope(target)*self.x
def line_function(self, target):
slope = self.slope(target)
y_int = self.y_int(target)
def fn(x):
return slope*x + y_int
return fn
a = Point(5, 10) # I am stuck here since - what to input!?
b = Point(10, 16) # I am stuck here since - what to input!?
line = a.line_function(b)
print(line(x=10))
Use the scipy function scipy.stats.linregress to fit your data.
Maybe also check https://en.wikipedia.org/wiki/Linear_regression to better understand linear regression.
You could group by Clicks and take the average of the Sales per group:
In [307]: sales = df.groupby('Clicks')['Sales'].mean(); sales
Out[307]:
Clicks
5 10.5
10 18.0
15 27.0
100 200.0
Name: Sales, dtype: float64
Then form the piecewise linear interpolating function based on
the groupwise-averaged data above using interpolate.interp1d:
from scipy import interpolate
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
For example,
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt
df = pd.DataFrame({'Clicks': [5, 5, 10, 10, 10, 15, 15, 100],
'Sales': [10, 11, 16, 20, 18, 28, 26, 200]})
sales = df.groupby('Clicks')['Sales'].mean()
Once you have the groupwise-averaged sales, you can compute the interpolated sales
a number of ways. One way is to use np.interp:
newx = [10]
print(np.interp(newx, sales.index, sales.values))
# [ 18.] <-- The interpolated sales when the number of clicks is 10 (newx)
The problem with np.interp is that you are passing sales.index and sales.values to np.interp every time you call it -- it has no memory of the interpolating function. It is re-computing the interpolating function every time you call it.
If you have scipy, then you could create the interpolating function once and then use it as many times as you like later:
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
print(fn(newx))
# [ 18.]
For example, you could evaluate the interpolating function at a whole bunch of points (and plot the result) like this:
newx = np.linspace(5, 100, 100)
plt.plot(newx, fn(newx))
plt.plot(df['Clicks'], df['Sales'], 'o')
plt.show()
Pandas Series (and DataFrames) have an iterpolate method too. To use it, you reindex the Series to include the points where you wish to interpolate:
In [308]: sales.reindex(sales.index.union([14]))
Out[308]:
5 10.5
10 18.0
14 NaN
15 27.0
100 200.0
Name: Sales, dtype: float64
and then interpolate fills in the interpolated values where the Series is NaN:
In [295]: sales.reindex(sales.index.union([14])).interpolate('values')
Out[295]:
5 10.5
10 18.0
14 25.2 # <-- interpolated value
15 27.0
100 200.0
Name: Sales, dtype: float64
But I think it is perhaps not appropriate for your problem since it does not
return just the interpolated values you are looking for; it returns a whole
Series.

Resources