convert datetime to date python --> error: unhashable type: 'numpy.ndarray' - python-3.x

Pandas by default represent dates with datetime64 [ns], so I have in my columns this format [2016-02-05 00:00:00] but I just want the date 2016-02-05, so I applied this code for a few columns:
df3a['MA'] = pd.to_datetime(df3a['MA'])
df3a['BA'] = pd.to_datetime(df3a['BA'])
df3a['FF'] = pd.to_datetime(df3a['FF'])
df3a['JJ'] = pd.to_datetime(df3a['JJ'])
.....
but it gives me as result this error: TypeError: type unhashable: 'numpy.ndarray'
my question is: why i got this error and how do i convert datetime to date for multiple columns (around 50)?
i will be grateful for your help

One way to achieve what you'd like is with a DatetimeIndex. I've first created an Example DataFrame with 'date' and 'values' columns and tried from there on to reproduce the error you've got.
import pandas as pd
import numpy as np
# Example DataFrame with a DatetimeIndex (dti)
dti = pd.date_range('2020-12-01','2020-12-17') # dates from first of december up to date
values = np.random.choice(range(1, 101), len(dti)) # random values between 1 and 100
df = pd.DataFrame({'date':dti,'values':values}, index=range(len(dti)))
print(df.head())
>>> date values
0 2020-12-01 85
1 2020-12-02 100
2 2020-12-03 96
3 2020-12-04 40
4 2020-12-05 27
In the example, just the dates are already shown without the time in the 'date' column, I guess since it is a DatetimeIndex.
What I haven't tested but might can work for you is:
# Your dataframe
df3a['MA'] = pd.DatetimeIndex(df3a['MA'])
...
# automated transform for all columns (if all columns are datetimes!)
for label in df3a.columns:
df3a[label] = pd.DatetimeIndex(df3a[label])

Use DataFrame.apply:
cols = ['MA', 'BA', 'FF', 'JJ']
df3a[cols] = df3a[cols].apply(pd.to_datetime)

Related

Apply a custom rolling function with arguments on Pandas DataFrame

I have this df (here is the df.head()):
date colA
0 2018-01-05 0.6191
1 2018-01-20 0.5645
2 2018-01-25 0.5641
3 2018-01-27 0.5404
4 2018-01-30 0.4933
I would like to apply a function to every 3 rows recursively, meaning for rows: 1,2,3 then for rows: 2,3,4 then rows 3,4,5, etc.
This is what I wrote:
def my_rolling_func(df, val):
p1 = (df['date']-df['date'].min()).dt.days.tolist()[0],df[val].tolist()[0]
p2 = (df['date']-df['date'].min()).dt.days.tolist()[1],df[val].tolist()[1]
p3 = (df['date']-df['date'].min()).dt.days.tolist()[2],df[val].tolist()[2]
return sum([i*j for i,j in [p1,p2,p3]])
df.rolling(3,center=False,axis=1).apply(my_rolling_func, args=('colA'))
But I get this error:
ValueError: Length of passed values is 1, index implies 494.
494 is the number of rows in my df.
I'm not sure why it says I passed a length of 1, I thought the rolling generate slices of df according to the window size I defined (3), and then it applied the function for that subset of df.
First, you specified the wrong axis. Axis 1 means that the window will slide along the columns. You want the window to slide along the indexes, so you need to specify axis=0. Secondly, you misunderstand a little about how rolling works. It will apply your function to each column independently, so you cannot operate on both the date and colA columns at the same time inside your function.
I rewrote your code to make it work:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range('2018-01-05', '2018-01-30', freq='D'), 'A': np.random.random((26,))})
df = df.set_index('date')
def my_rolling_func(s):
days = (s.index - s.index[0]).days
return sum(s*days)
res = df.rolling(3, center=False, axis=0).apply(my_rolling_func)
print(res)
Out:
A
date
2018-01-05 NaN
2018-01-06 NaN
2018-01-07 1.123872
2018-01-08 1.121119
2018-01-09 1.782860
2018-01-10 0.900717
2018-01-11 0.999509
2018-01-12 1.755408
2018-01-13 2.344914
.....

pandas split all list column and get first value

I am trying to get first element in the list for all rows and column into a single dataframe. All of the rows and columns have list format. It contains 2 elements in each list. Here is what I tried. What syntax should I use to apply entire dataframe in pandas?
import pandas as pd
import numpy as np
def my_function(x):
return x.replace('\[','').replace('\]','').split(',')[0]
t = pd.DataFrame(data={'col1': ['[blah,blah]','[test,bing]',np.NaN], 'col2': ['[math,sci]',np.NaN,['number','4']]})
print(t)
not working:
t.apply(my_function) # AttributeError: 'Series' object has no attribute 'split'
t.apply(lambda x: str(x).replace('\[','').replace('\]','').split(',')[0]) # does not work
t.apply(lambda x: list(x)[0]) # gives first column and doesn't split
trying to get this:
col1 col2
blah math
test NaN
NaN number
Use replace:
>>> t.replace(r'\[([^,]*).*', r'\1', regex=True)
col1 col2
0 blah math
1 test NaN
2 NaN number
But I think you have an error when you create your sample dataframe. I changed to:
t = pd.DataFrame(data={'col1': ['[blah,blah]','[test,bing]',np.NaN],
'col2': ['[math,sci]',np.NaN,'[number,4]']})
# The problem is here ------------------------------^^^^^^^^^^^^
Link to regex101

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

How to define time column in in a pandas dataframe?

I have a "pandas.core.frame.DataFrame" "seied_log":
seied_log
Out[155]:
0
0 5.264761
1 5.719328
2 6.420809
3 6.129704
...
What I run is ARIMA model:
model = ARIMA(seied_log, order=(2, 1, 0))
Hovewer I receive the following mistake:
ValueError: Given a pandas object and the index does not contain dates
What I need, is to define a "date" column. These are yearly observations. How can I define a column with date starting from 1978?
If your index is 0 through n_obs-1, then simply
from datetime import datetime
seied_log["date"] = seied_log.index.map(lambda idx: datetime(year=1978+idx, month=1, day=1)

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources