Why is my data not recognized as time series? - python-3.x

I have daily (day) data on calories intake for one person (cal2), which I get from a Stata dta file.
I run the code below:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import read_csv
from matplotlib.pylab import rcParams
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True,
index = 'day', convert_dates=True)
print(d.dtypes)
print(d.shape)
print(d.index)
print(d.head)
plt.plot(d)
This is how the data looks like:
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
The prints reveal the following:
day datetime64[ns]
cal2 float32
dtype: object
(251, 2)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
dtype='int64', length=251)
And here is the problem - the data should identify as dtype='datatime64[ns]'.
However, it clearly does not. Why not?

There is a discrepancy between the code provided, the data and the types shown.
This is because irrespective of the type of cal2, the index = 'day' argument
in pd.read_stata() should always render day the index, albeit not as the
desired type.
With that said, the problem can be reproduce as follows.
First, create the dataset in Stata:
clear
input double day float cal2
15350 3668.433
15351 3652.25
15352 3647.866
15353 3646.684
15354 3661.9414
15355 3656.952
end
format %td day
save time_series_calories
describe
Contains data from time_series_calories.dta
obs: 6
vars: 2
size: 72
----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------
day double %td
cal2 float %9.0g
----------------------------------------------------------------------------------------------------
Sorted by:
Second, load the data in Pandas:
import pandas as pd
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, convert_dates=True)
print(d.head)
day cal2
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
print(d.dtypes)
day datetime64[ns]
cal2 float32
dtype: object
print(d.shape)
(6, 2)
print(d.index)
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
In order to change the index as desired, you can use pd.set_index():
d = d.set_index('day')
print(d.head)
cal2
day
2002-01-10 3668.433350
2002-01-11 3652.249756
2002-01-12 3647.866211
2002-01-13 3646.684326
2002-01-14 3661.941406
2002-01-15 3656.951660
print(d.index)
DatetimeIndex(['2002-01-10', '2002-01-11', '2002-01-12', '2002-01-13',
'2002-01-14', '2002-01-15'],
dtype='datetime64[ns]', name='day', freq=None)
If day is a string in the Stata dataset, then you can do the following:
d['day'] = pd.to_datetime(d.day)
d = d.set_index('day')

Related

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

Is the are form to compact this statements, need a for loop?

I am new in the pandas dataFrames and I have this repetitive code, how can I improve that?
# Import the libraries
from random import random
import seaborn as sns
import pandas as pd
import numpy as np
# Make different arrays of random values
cells_1 = pd.DataFrame({"mRNA Transcripts": np.random.randint(0, 10, 12000)})
cells_2 = pd.DataFrame({"mRNA Transcripts": np.random.randint(30, 40, 12000)})
cells_3 = pd.DataFrame({"mRNA Transcripts": np.random.randint(11, 14, 24000)})
cells_4 = pd.DataFrame({"mRNA Transcripts": np.random.randint(26, 29, 24000)})
cells_5 = pd.DataFrame({"mRNA Transcripts": np.random.randint(15, 25, 168000)})
# Add in the previous DataFrames to make one only
cells_6 = cells_1.append(cells_2)
cells_7 = cells_6.append(cells_3)
cells_8 = cells_7.append(cells_4)
cells = cells_8.append(cells_5)
cells
I want to create a normal fit with only unique dataframe, here's the reason why I start with lesser values than in the mean of the future graphic.
Result:
mRNA Transcripts
0 4
1 9
2 0
3 4
4 4
... ...
167995 16
167996 22
167997 20
167998 17
167999 24
240000 rows × 1 columns
Since you seemingly have consecutive, non-overlapping integer ranges, you can define the bins in advance as a list and then concatenate the random numpy arrays for the dataframe:
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng()
#define the bins 1-10, 11-14, 15-25, ....
#the right value is always excluded, so the last one must be +1
bins = [0, 11, 15, 26, 30, 41]
#define how many random elements you want to draw for each bin
#0-10: 2, 11-14: 4,....
bin_elem = [2, 4, 3, 4, 5]
cells = pd.DataFrame({"mRNA Transcripts":
np.concatenate([rng.integers(start, stop-1, num)
for start, stop, num in zip(bins[:-1], bins[1:], bin_elem)])})
print(cells)
This uses the new numpy random generator.

How can i print year,month as 0 in datetime less than 30day? or 1year?

i am trying to get the month and year of this and it raises error
from datetime import datetime
x = datetime(2020, 9, 8, 19, 42, 39, 264658) - datetime.fromtimestamp(1598192097.728026)
print(x.year) or (x.month) #AttributeError: 'datetime.timedelta' object has no attribute 'year'
how i can get month , year as 0 ? any ideas without use exception?
Besides that 'datetime.timedelta' object has no attribute 'year' - note that time spans greater than a week are ambiguous; e.g. the number of days in a months varies.
What you could use here is dateutil's relativdelta:
from datetime import datetime
from dateutil.relativedelta import relativedelta
timestamp = 1598192097.728026
dtobj = datetime(2020, 9, 8, 19, 42, 39, 264658)
x = relativedelta(dtobj, datetime.fromtimestamp(timestamp))
print(x.years, x.months)
# 0 0
your x variable does not have an attribute year or month here x stores the difference between the two times it can only return the difference. You have performed arithmetic operation the two and x is a class date type known as date.timedelta. And as for your needs you want it to return zero years and zero months. You can convert x to string and use if statement to check for years and months.
from datetime import datetime
x = datetime(2020, 9, 8, 19, 42, 39, 264658) -
datetime.fromtimestamp(1598192097.728026)
print(type(x))
print(x)
if 'month' in str(x):
if 'year' in str(x):
print(x)
else:
print(f'0 years {x}')
else:
if 'year' in str(x):
print(x)
else:
print(f' 0 years 0 months {x}')

computing the mean for python datetime

I have a datetime attribute:
d = {
'DOB': pd.Series([
datetime.datetime(2014, 7, 9),
datetime.datetime(2014, 7, 15),
np.datetime64('NaT')
], index=['a', 'b', 'c'])
}
df_test = pd.DataFrame(d)
I would like to compute the mean for that attribute. Running mean() causes an error:
TypeError: reduction operation 'mean' not allowed for this dtype
I also tried the solution proposed elsewhere. It doesn't work as running the function proposed there causes
OverflowError: Python int too large to convert to C long
What would you propose? The result for the above dataframe should be equivalent to
datetime.datetime(2014, 7, 12).
You can take the mean of Timedelta. So find the minimum value and subtract it from the series to get a series of Timedelta. Then take the mean and add it back to the minimum.
dob = df_test.DOB
m = dob.min()
(m + (dob - m).mean()).to_pydatetime()
datetime.datetime(2014, 7, 12, 0, 0)
One-line
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(d.min())).to_pydatetime()
To #ALollz point
I use the epoch pd.Timestamp(0) instead of min
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(pd.Timestamp(0))).to_pydatetime()
You can convert epoch time using astype with np.int64 and converting back to datetime with pd.to_datetime:
pd.to_datetime(df_test.DOB.dropna().astype(np.int64).mean())
Output:
Timestamp('2014-07-12 00:00:00')
You could work with unix time if you want. This is defined as the total number of seconds (for instance) since 1970-01-01. With that, all of your times are simply floats, so it's very easy to do simple math on the columns.
import pandas as pd
df_test['unix_time'] = (df_test.DOB - pd.to_datetime('1970-01-01')).dt.total_seconds()
df_test['unix_time'].mean()
#1405123200.0
# You want it in date, so just convert back
pd.to_datetime(df_test['unix_time'].mean(), origin='unix', unit='s')
#Timestamp('2014-07-12 00:00:00')
Datetime math supports some standard operations:
a = datetime.datetime(2014, 7, 9)
b = datetime.datetime(2014, 7, 15)
c = (b - a)/2
# here c will be datetime.timedelta(3)
a + c
Out[7]: datetime.datetime(2014, 7, 12, 0, 0)
So you can write a function that, given two datetimes, subtracts the lesser form the greater and adds half of the difference to the lesser. Apply this function to your dataframe, and shazam!
As of pandas=0.25, it is possible to compute the mean of a datetime series.
In [1]: import pandas as pd
...: import numpy as np
In [2]: s = pd.Series([
...: pd.datetime(2014, 7, 9),
...: pd.datetime(2014, 7, 15),
...: np.datetime64('NaT')])
In [3]: s.mean()
Out[3]: Timestamp('2014-07-12 00:00:00')
However, note that applying mean to a pandas dataframe currently ignores columns with a datetime series.

Iterating through values to find average equation of a line (Python3)

I am trying to find the equation of a line within a DF
Here is a fake data set to explain:
Clicks Sales
5 10
5 11
10 16
10 20
10 18
15 28
15 26
... ...
100 200
What I am trying to do:
Calculate the equation of the line between so that I am able to input a number of clicks and have an output of sales at any predicted level. The thing I am trying to wrap my brain around is that I have many different line functions (e.g. there are multiple sales for each amount of clicks). How can I iterate through my DF to just to calculate one aggregate line function?
Here's what I have but it only accept ONE input at a time, I would like to create an average or aggregate...
def slope(self, target):
return slope(target.x - self.x, target.y - self.y)
def y_int(self, target): # <= here's the magic
return self.y - self.slope(target)*self.x
def line_function(self, target):
slope = self.slope(target)
y_int = self.y_int(target)
def fn(x):
return slope*x + y_int
return fn
a = Point(5, 10) # I am stuck here since - what to input!?
b = Point(10, 16) # I am stuck here since - what to input!?
line = a.line_function(b)
print(line(x=10))
Use the scipy function scipy.stats.linregress to fit your data.
Maybe also check https://en.wikipedia.org/wiki/Linear_regression to better understand linear regression.
You could group by Clicks and take the average of the Sales per group:
In [307]: sales = df.groupby('Clicks')['Sales'].mean(); sales
Out[307]:
Clicks
5 10.5
10 18.0
15 27.0
100 200.0
Name: Sales, dtype: float64
Then form the piecewise linear interpolating function based on
the groupwise-averaged data above using interpolate.interp1d:
from scipy import interpolate
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
For example,
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt
df = pd.DataFrame({'Clicks': [5, 5, 10, 10, 10, 15, 15, 100],
'Sales': [10, 11, 16, 20, 18, 28, 26, 200]})
sales = df.groupby('Clicks')['Sales'].mean()
Once you have the groupwise-averaged sales, you can compute the interpolated sales
a number of ways. One way is to use np.interp:
newx = [10]
print(np.interp(newx, sales.index, sales.values))
# [ 18.] <-- The interpolated sales when the number of clicks is 10 (newx)
The problem with np.interp is that you are passing sales.index and sales.values to np.interp every time you call it -- it has no memory of the interpolating function. It is re-computing the interpolating function every time you call it.
If you have scipy, then you could create the interpolating function once and then use it as many times as you like later:
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
print(fn(newx))
# [ 18.]
For example, you could evaluate the interpolating function at a whole bunch of points (and plot the result) like this:
newx = np.linspace(5, 100, 100)
plt.plot(newx, fn(newx))
plt.plot(df['Clicks'], df['Sales'], 'o')
plt.show()
Pandas Series (and DataFrames) have an iterpolate method too. To use it, you reindex the Series to include the points where you wish to interpolate:
In [308]: sales.reindex(sales.index.union([14]))
Out[308]:
5 10.5
10 18.0
14 NaN
15 27.0
100 200.0
Name: Sales, dtype: float64
and then interpolate fills in the interpolated values where the Series is NaN:
In [295]: sales.reindex(sales.index.union([14])).interpolate('values')
Out[295]:
5 10.5
10 18.0
14 25.2 # <-- interpolated value
15 27.0
100 200.0
Name: Sales, dtype: float64
But I think it is perhaps not appropriate for your problem since it does not
return just the interpolated values you are looking for; it returns a whole
Series.

Resources