computing the mean for python datetime - python-3.x

I have a datetime attribute:
d = {
'DOB': pd.Series([
datetime.datetime(2014, 7, 9),
datetime.datetime(2014, 7, 15),
np.datetime64('NaT')
], index=['a', 'b', 'c'])
}
df_test = pd.DataFrame(d)
I would like to compute the mean for that attribute. Running mean() causes an error:
TypeError: reduction operation 'mean' not allowed for this dtype
I also tried the solution proposed elsewhere. It doesn't work as running the function proposed there causes
OverflowError: Python int too large to convert to C long
What would you propose? The result for the above dataframe should be equivalent to
datetime.datetime(2014, 7, 12).

You can take the mean of Timedelta. So find the minimum value and subtract it from the series to get a series of Timedelta. Then take the mean and add it back to the minimum.
dob = df_test.DOB
m = dob.min()
(m + (dob - m).mean()).to_pydatetime()
datetime.datetime(2014, 7, 12, 0, 0)
One-line
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(d.min())).to_pydatetime()
To #ALollz point
I use the epoch pd.Timestamp(0) instead of min
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(pd.Timestamp(0))).to_pydatetime()

You can convert epoch time using astype with np.int64 and converting back to datetime with pd.to_datetime:
pd.to_datetime(df_test.DOB.dropna().astype(np.int64).mean())
Output:
Timestamp('2014-07-12 00:00:00')

You could work with unix time if you want. This is defined as the total number of seconds (for instance) since 1970-01-01. With that, all of your times are simply floats, so it's very easy to do simple math on the columns.
import pandas as pd
df_test['unix_time'] = (df_test.DOB - pd.to_datetime('1970-01-01')).dt.total_seconds()
df_test['unix_time'].mean()
#1405123200.0
# You want it in date, so just convert back
pd.to_datetime(df_test['unix_time'].mean(), origin='unix', unit='s')
#Timestamp('2014-07-12 00:00:00')

Datetime math supports some standard operations:
a = datetime.datetime(2014, 7, 9)
b = datetime.datetime(2014, 7, 15)
c = (b - a)/2
# here c will be datetime.timedelta(3)
a + c
Out[7]: datetime.datetime(2014, 7, 12, 0, 0)
So you can write a function that, given two datetimes, subtracts the lesser form the greater and adds half of the difference to the lesser. Apply this function to your dataframe, and shazam!

As of pandas=0.25, it is possible to compute the mean of a datetime series.
In [1]: import pandas as pd
...: import numpy as np
In [2]: s = pd.Series([
...: pd.datetime(2014, 7, 9),
...: pd.datetime(2014, 7, 15),
...: np.datetime64('NaT')])
In [3]: s.mean()
Out[3]: Timestamp('2014-07-12 00:00:00')
However, note that applying mean to a pandas dataframe currently ignores columns with a datetime series.

Related

Spark 3 with Pandas Vectorised UDF's

I'm looking at using Pandas UDF's in PySpark (v3). For a number of reasons, I understand iterating and UDF's in general are bad and I understand that the simple examples I show here can be done PySpark using SQL functions - all of that is besides the point!
I've been following this guide: https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
I have a simple example working from the docs:
import pandas as pd
from typing import Iterator, Tuple
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, pandas_udf
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame(([1, 2, 3], [4, 5, 6], [8, 9, 0]), columns=["x", "y", "z"])
df = spark.createDataFrame(pdf)
#pandas_udf('long')
def test1(x: pd.Series, y: pd.Series) -> pd.Series:
return x + y
df.select(test1(col("x"), col("y"))).show()
And this works well for performing basic arithmetic - if I want to add, multiply etc this is straight forward (but it is also straightforward in PySpark without functions).
I want to do a comparison between the values for example:
#pandas_udf('long')
def test2(x: pd.Series, y: pd.Series) -> pd.Series:
return x if x > y else y
df.select(test2(col("x"), col("y"))).show()
This will error with ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. I understand that it is evaluating the series rather than the row value.
So there is an iterator example. Again this works fine for the basic arithmetic example they provide. But if I try to apply logic:
#pandas_udf("long")
def test3(batch_iter: Iterator[Tuple[pd.Series, pd.Series]]) -> Iterator[pd.Series]:
for x, y in batch_iter:
yield x if x > y else y
df.select(test3(col("x"), col("y"))).show()
I get the same ValueError as before.
So my question is how should I perform row by row comparisons like this? Is it possible in a vectorised function? And if not then what are the use cases for them?
I figured this out. So simple after you write it down and publish the problem to the world.
All that needs to happen is to return an array and then convert to a Pandas Series:
#pandas_udf('long')
def test4(x: pd.Series, y: pd.Series) -> pd.Series:
return pd.Series([a if a > b else b for a, b in zip(x, y)])
df.select(test4(col("x"),col("y"))).show()
I've spent the last two days looking for this answer, thank you simon_dmorias!
I needed a slightly modified example here. I'm breaking out the single pandas_udf into multiple components for easier management. Here is an example of what I'm using for others to reference:
xdf = pd.DataFrame(([1, 2, 3,'Fixed'], [4, 5, 6,'Variable'], [8, 9, 0,'Adjustable']), columns=["x", "y", "z", "Description"])
df = spark.createDataFrame(xdf)
def fnRate(x):
return pd.Series(['Fixed' if 'Fixed' in str(v) else 'Variable' if 'Variable' in str(v) else 'Other' for v in zip(x)])
#pandas_udf('string')
def fnRateRecommended(Description: pd.Series) -> pd.Series:
varProduct = fnRate(Description)
return varProduct
# call function
df.withColumn("Recommendation", fnRateRecommended(sf.col("Description"))).show()

Resample (or loop) using log mean

Is there a way to resample using log mean? I have read the resample documentation and cannot find any options for log mean resampling.
I have a large dataframe with datetime index, with observations for every minute. I need to calculate the log mean for every 5 minutes for a range of variables (columns).
I have provided some code below showing some example data and the calculation i want to carry out. It might be, that if there isnt a log mean resampling function 'out of the box', that i will need to code a loop to do this...?
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'db' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ]}, index=pd.date_range('2019-05-02T00:00:00', '2019-05-02T00:14:00', freq='1T'))
df1 = df1.resample('5T').mean() # <------ is there a way to do log mean for this?
# The calculation i am need to do is:
df2 = np.log10(10**((df1[observation minute 1]/10)) + 10**((df1[observation minute 2]/10)) + 10**((df1[observation minute 3]/10)) + 10**((df1[observation minute 4]/10)) + 10**((df1[observation minute 5]/10)))
# Where 'observation minute 1,2,3,4,5' are the 5 minutes i want to resample for.
# The resulting df i need is:
df_result = pd.DataFrame({'log_mean' : [np.log10(10**((1/10)) + 10**((2/10)) + 10**((3/10)) + 10**((4/10)) + 10**((5/10))), np.log10(10**((6/10)) + 10**((7/10)) + 10**((8/10)) + 10**((9/10)) + 10**((10/10))), np.log10(10**((11/10)) + 10**((12/10)) + 10**((13/10)) + 10**((14/10)) + 10**((15/10)))]}, index=pd.date_range('2019-05-02T00:00:00', '2019-05-02T00:14:00', freq='5T'))
Any guidance would be gratefully received.
Turns out you can resample using any function of your choosing using apply:
df1 = df1.resample('5T').apply(lambda spl: 10*np.log10(np.mean(np.power(10, spl/10))))
or you can define it separately
def log_avg(spl_arraylike):
return 10*np.log10(np.mean(np.power(10, spl_arraylike/10))))
df1 = df1.resample('5T').apply(log_avg)
this returns a dataframe with the following values
2019-05-02 00:00:00 3.227668
2019-05-02 00:05:00 8.227668
2019-05-02 00:10:00 13.227668

How to filter this type of data?

If I have some numpy arrays like
a = np.array([1,2,3,4,5])
b = np.array([4,5,7,8])
c = np.array([4,5])
I need to combine these arrays without repeating a number. My expected output is [1,2,3,4,5,7,8].
How do I combine them? Which function should I use?
One more approach you can give a try is using reduce from functools and union1d from numpy.
For eg -
from functools import reduce
reduce(np.union1d, (a, b, c))
Output -
array([1,2,3,4,5,7,8])
You can use numpy.concatenate with numpy.unique:
d = np.unique(np.concatenate((a,b,c)))
print(d)
Output:
[1 2 3 4 5 7 8]
Python has a datatype called set:
A set is an unordered collection with no duplicate elements
The easiest way to create a set out of your array would be unpacking your arrays into the set:
>>> import numpy as np
>>> a=np.array([1,2,3,4,5])
>>> b=np.array([4,5,7,8])
>>> c=np.array([4,5])
>>> {*a, *b, *c}
{1, 2, 3, 4, 5, 7, 8}
Please note, that the set is unordered. This is not the right answer for you, if the order of the elements in your array is important.

How to create the datetime object equivalent of numpy.nan without using Pandas?

Given a numpy array of numerical data, one can retrieve the indices of the array that match a particular condition. By accepting np.nan as type <float>, these conditions can still be met. I would like to use a similar method using datetime objects instead of numerical data.
As an example, consider the following:
import numpy as np
import datetime
# sample data
x = np.linspace(-5, 5, 11).astype(int)
y = x**2
z = y.copy().astype(float)
z[0] = np.nan
# check that nan condition is correctly satisfied
print("\nZ:\n{}\n".format(z))
print("\nZ > 0:\n{}\n".format(z > 0))
print("\nZ < 0:\n{}\n".format(z < 0))
print("\nZ = 0:\n{}\n".format(z == 0))
# get number of elapsed hours between two datetime objects
dt0 = datetime.datetime(2018, 1, 11, 23, 59, 59)
dt1 = datetime.datetime(2018, 1, 12, 23, 59, 59)
hr_delta = (dt1 - dt0).total_seconds() // 3600
print("\n{} - {} = {} hours".format(dt1, dt0, hr_delta))
Since np.nan - 5 = np.nan, it would be very convenient for me to use this idea to create an invalid datetime object. However, datetime objects require the inputs of type <int>, whereas np.nan is type <float>. The example below recreates the source of my problem:
# attempt to instantiate invalid datetime object
dt2 = datetime.datetime(np.nan, 1, 11, 23, 59, 59)
which throws TypeError: integer argument expected, got float. Is there a way to do this without using pandas, as suggested in this alternate post? Just to clarify, my goal is to create an invalid datetime object such that its subtraction or addition with a valid datetime object returns a nan-like value.
Anything pandas does is usually something Numpy does.
In this instance, you need to store the proper type, which is not datetime, but np.datetime64. Once you have this type, you can fill NaT values with np.datetime64("NaT") and then use np.isnat to test if a value if a time or not.
datetime itself doesn't provide a concept of "not a time", you have to use a type that does. You should be able to use np.datetime64 in a close fashion as a datetime.

Using Theano.scan with multidimensional arrays

To speed up my code I am converting a multidimensional sumproduct function from Python to Theano. My Theano code reaches the same result, but only calculates the result for one dimension at a time, so that I have to use a Python for-loop to get the end result. I assume that would make the code slow, because Theano cannot optimize memory usage and transfer (for the gpu) between multiple function calls. Or is this a wrong assumption?
So how can I change the Theano code, so that the sumprod is calculated in one function call?
The original Python function:
def sumprod(a1, a2):
"""Sum the element-wise products of the `a1` and `a2`."""
result = numpy.zeros_like(a1[0])
for i, j in zip(a1, a2):
result += i*j
return result
For the following input
a1 = ([1, 2, 4], [5, 6, 7])
a2 = ([1, 2, 4], [5, 6, 7])
the output would be: [ 26. 40. 65.] that is 1*1 + 5*5, 2*2 + 6*6 and 4*4 + 7*7
The Theano version of the code:
import theano
import theano.tensor as T
import numpy
a1 = ([1, 2, 4], [5, 6, 7])
a2 = ([1, 2, 4], [5, 6, 7])
# wanted result: [ 26. 40. 65.]
# that is 1*1 + 5*5, 2*2 + 6*6 and 4*4 + 7*7
Tk = T.iscalar('Tk')
Ta1_shared = theano.shared(numpy.array(a1).T)
Ta2_shared = theano.shared(numpy.array(a2).T)
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1_shared, Ta2_shared, prior_value:
prior_value + Ta1_shared * Ta2_shared,
outputs_info=outputs_info,
sequences=[Ta1_shared[Tk], Ta2_shared[Tk]])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function([Tk], outputs=Tsumprod_result)
result = numpy.zeros_like(a1[0])
for i in range(len(a1[0])):
result[i] = Tsumprod(i)
print result
First, there is more people that will answer your questions on theano mailing list then on stackoverflow. But I'm here:)
First, your function isn't a good fit for GPU. Even if everything was well optimized, the transfer of the input to the gpu just to add and sum the result will take more time to run then the python version.
Your python code is slow, here is a version that should be faster:
def sumprod(a1, a2):
"""Sum the element-wise products of the `a1` and `a2`."""
a1 = numpy.asarray(a1)
a2 = numpy.asarray(a2)
result (a1 * a2).sum(axis=0)
return result
For the theano code, here is the equivalent of this faster python version(no need of scan)
m1 = theano.tensor.matrix()
m2 = theano.tensor.matrix()
f = theano.function([m1, m2], (m1 * m2).sum(axis=0))
The think to remember from this is that you need to "vectorize" your code. The "vectorize" is used in the NumPy context and it mean to use numpy.ndarray and use function that work on the full tensor at a time. This is always faster then doing it with loop (python loop or theano scan). Also, Theano optimize some of thoses cases by moving the computation outside the scan, but it don't always do it.

Resources