I and to create a heatmap that will have year across the x axis and month across the y axis. In the heatmap will be % returns. Here's kinda what I am after.
So I have some data and I turn them into pct_change() series.
import pandas_datareader.data as web
import pandas as pd
from datetime import datetime as dt
import numpy as np
import seaborn as sns
start = dt(year = 2000, month = 1, day = 1)
df = web.DataReader('GDP', 'fred', start = '2000')
df.pct_change()
df.tail()
So here's what we are working with. Important to note that the index is a Datetime object.
GDP
DATE
2016-10-01 18905.545
2017-01-01 19057.705
2017-04-01 19250.009
2017-07-01 19500.602
2017-10-01 19736.491
I want to do something like this, but I dont know how to implement it with the datetime index
gdp = df.pivot(df.index.month, df.index.year, "GDP")
ax = sns.heatmap(gdp)
Which (expectedly) doesn't work...
KeyError: "Int64Index([ 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1,\n 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4,\n 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7,\n 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10, 1, 4, 7, 10,\n 1, 4, 7, 10],\n dtype='int64', name='DATE') not in index"
It's not working because you are extracting the month and year in place within the pivot function, and those information is not in the original df you specified.
You can specify them beforehand:
df["Year"] = df.DATE.apply(lambda x: x.year)
df["Month"] = df.DATE.apply(lambda x: x.strftime("%B"))
df.pivot_table(index="Month",columns="Year",values="GDP", aggfunc="sum").fillna(0)
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
pt = pt.reindex_axis(months)
sns.heatmap(pt, annot=True)
I'm reindexing the rows because when calling pivot_table, it sorts columns or rows in ascending order, which is not how the month names are usually sorted.
Above gives me:
Related
I'm doing some beginner python exercises and one of them is to remove duplicates from a list. I've successfully done it, but the strange thing is that it is returning a dictionary instead of a list.
This is my code.
import random
a = []
b = []
for i in range(0,20):
n = random.randint(0,10)
a.append(n)
for i in range(0,20):
n = random.randint(0,10)
b.append(n)
print(sorted(a))
print(sorted(b))
c = set(list(a+b))
print(c)
and this is what it's spitting out
[0, 0, 1, 1, 1, 1, 2, 3, 4, 4, 6, 6, 7, 7, 7, 8, 9, 9, 10, 10]
[0, 1, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 6, 7, 8, 9, 9, 10, 10, 10]
{0, 1, 2, 3, 4, 6, 7, 8, 9, 10}
thanks in advance!
{0, 1, 2, 3, 4, 6, 7, 8, 9, 10} is a set, not a dictionary, a dictionary would be printed as {key:value, key:value, ...}
Try print(type(c)) and you'll see it prints <class 'set'> rather than <class 'dict'>
Also try the following
s = {1,2,3}
print(type(s))
d = {'a':1,'b':2,'c':3}
print(type(d))
You'll see the type is different
I have a numpy array which contains hours from 4 days:
s = np.array([0.0, 1.0, 2.0, 3.0, 4.0 ....96.0])
I want to create a datetime object from that.
I know that the first element is at timestamp 2021-03-21 00:00,
so:
start_date = datetime.datetime.strptime('2021-03-21 00:00', '%Y-%m-%d %H:%M')
How can I create a new array which contains datetimes, incremented by an hour from the s array.
Use timedelta to build your new array:
>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> s = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 96.0])
>>> start_date = datetime.strptime('2021-03-21 00:00', '%Y-%m-%d %H:%M')
>>> [start_date + timedelta(hours=diff) for diff in s]
[datetime.datetime(2021, 3, 21, 0, 0), datetime.datetime(2021, 3, 21, 1, 0), datetime.datetime(2021, 3, 21, 2, 0), datetime.datetime(2021, 3, 21, 3, 0), datetime.datetime(2021, 3, 21, 4, 0), datetime.datetime(2021, 3, 25, 0, 0)]
I have a dataframe like this:
mid value label
ID
192 3 176.6 [9, 6, 8, 0, 8, 8, 7, 9, 2, 19...
192 4 73.6 [9, 6, 8, 0, 8, 8, 7, 9, 2, 19...
192 5 15.8 [9, 6, 8, 0, 8, 8, 7, 9, 2, 19...
194 3 9603.2 [0, 0, 0, 0, 0, 9, 6, 1, 8, ...
I want to implement MultiLabelBinarizer after removing the duplicate values in each list of label column.
I have tried by looping the frame and removing duplicates. and also, the multilabel binarizer doesnt work and throws an exception
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(y_train.data)
X_train includes the mid and value columns
y_train includes label values
id is the index
I expect a prediction from the above values after the duplicate values are removed from each list of label column
Let's assume your dataframe is named df:
df2 = pd.DataFrame(df.groupby(['ID','mid', 'value'])['label'].apply(lambda x: tuple(x.values)))
df2.reset_index(inplace=True)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit(df2['label'])
mlb.transform(df2['label'])
I want to pick the indices of number 8 without knowing its position in the array.
a = np.arange(10)
You can use np.where like :
>>> import numpy as np
>>> a = np.array([1,4,8,2,6,7,9,8,7,8,8,9,1,0])
>>> a
array([1, 4, 8, 2, 6, 7, 9, 8, 7, 8, 8, 9, 1, 0])
>>> np.where(a==8)[0]
array([ 2, 7, 9, 10], dtype=int64)
I have a numpy array of milliseconds in integers, which I want to convert to an array of Python datetimes via a timedelta operation.
The following MWE works, but I'm convinced there is a more elegant approach or with better performence than multiplication by 1 ms.
start = pd.Timestamp('2016-01-02 03:04:56.789101').to_pydatetime()
dt = np.array([ 19, 14980, 19620, 54964615, 54964655, 86433958])
time_arr = start + dt * timedelta(milliseconds=1)
So your approach produces:
In [56]: start = pd.Timestamp('2016-01-02 03:04:56.789101').to_pydatetime()
In [57]: start
Out[57]: datetime.datetime(2016, 1, 2, 3, 4, 56, 789101)
In [58]: dt = np.array([ 19, 14980, 19620, 54964615, 54964655, 86433958])
In [59]: time_arr = start + dt * timedelta(milliseconds=1)
In [60]: time_arr
Out[60]:
array([datetime.datetime(2016, 1, 2, 3, 4, 56, 808101),
datetime.datetime(2016, 1, 2, 3, 5, 11, 769101),
datetime.datetime(2016, 1, 2, 3, 5, 16, 409101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 404101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 444101),
datetime.datetime(2016, 1, 3, 3, 5, 30, 747101)], dtype=object)
The equivalent using np.datetime64 types:
In [61]: dt.astype('timedelta64[ms]')
Out[61]: array([ 19, 14980, 19620, 54964615, 54964655, 86433958], dtype='timedelta64[ms]')
In [62]: np.datetime64(start)
Out[62]: numpy.datetime64('2016-01-02T03:04:56.789101')
In [63]: np.datetime64(start) + dt.astype('timedelta64[ms]')
Out[63]:
array(['2016-01-02T03:04:56.808101', '2016-01-02T03:05:11.769101',
'2016-01-02T03:05:16.409101', '2016-01-02T18:21:01.404101',
'2016-01-02T18:21:01.444101', '2016-01-03T03:05:30.747101'], dtype='datetime64[us]')
I can produce the same array from your time_arr with np.array(time_arr, dtype='datetime64[us]').
tolist converts these datetime64 items to datetime objects:
In [97]: t1=np.datetime64(start) + dt.astype('timedelta64[ms]')
In [98]: t1.tolist()
Out[98]:
[datetime.datetime(2016, 1, 2, 3, 4, 56, 808101),
datetime.datetime(2016, 1, 2, 3, 5, 11, 769101),
datetime.datetime(2016, 1, 2, 3, 5, 16, 409101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 404101),
datetime.datetime(2016, 1, 2, 18, 21, 1, 444101),
datetime.datetime(2016, 1, 3, 3, 5, 30, 747101)]
or wrap it back in an array to get your time_arr:
In [99]: np.array(t1.tolist())
Out[99]:
array([datetime.datetime(2016, 1, 2, 3, 4, 56, 808101),
...
datetime.datetime(2016, 1, 3, 3, 5, 30, 747101)], dtype=object)
Just for the calculation datatime64 is faster, but with the conversions, it may not be the fastest overall.
https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html