Converting Pandas datetime attribute to integer - python-3.x

I have a CSV dataset with the 'date' attribute as follows:
2012-04-29
2012-04-29
2012-04-29
2012-05-05
2012-05-05
Name: date, dtype: datetime64[ns]
I want to convert the unique date values to integer values. So the first 3 values for same date '2012-04-29' become 1, the second two values for same date '2012-05-05' becomes 2 and so on.
How can I do this conversion of 'date' attribute to a new integer attribute/column say 'date_int'?
Thanks

We can do
df['date'].rank(method='dense')

You are looking at factorize:
df['date'].factorize()[0] + 1

Related

Date stuck as unformattable in pandas dataframe

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.
When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

Issue while converting string to datetime in pandas

I am having dataframe like
Input
Date
2020-12-21
2019-09-30
2019-12-04
I want to convert this specific date time format.
Expected Format
Date
2020-12-21T00:00:00Z
2019-09-30T00:00:00Z
2019-12-04T00:00:00Z
My current code
df.loc[:,'Date'] = pd.to_datetime(df.loc[:,'Date'])
Its not working correctly. How can this be fixed.
I'm not sure there's a shortcut for ISO time format. Here's a hack around:
pd.to_datetime(df['Date']).dt.strftime("%Y-%m-%dT%H:%M:%SZ")
Output:
0 2020-12-21T00:00:00Z
1 2019-09-30T00:00:00Z
2 2019-12-04T00:00:00Z
Name: Date, dtype: object

Why is call to sum() on a data frame generating wrong numbers?

I want to sum the numerical values in each row (Store A to Store D) for the month of June and place them in an appended column 'Sum'. But the results generate very huge sum values which are wrong. How to get correct sum?
This code was run using Python 3.6 :
import pandas as pd
import numpy as np
data = np.array([
['', 'week','storeA','storeB','storeC','storeD'],
[0,"2014-05-04",2643,8257,3893,6231],
[1,"2014-05-11",6444,5736,5634,7092],
[2,"2014-05-18",9646,2552,4253,5447],
[3,"2014-05-25",5960,10740,8264,6063],
[4,"2014-06-04",5960,10740,8264,6063],
[5,"2014-06-12",7412,7374,3208,3985]
])
df= pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
# get rows of table which match Year,Month for last month
df2 = df[df['week'].str.contains("2014-06")].copy()
print(df2)
# generate col summing up each row
col_list = list(df2)
print(col_list)
col_list.remove('week')
print(col_list)
df2['Sum'] = df2[col_list].sum(axis=1)
print(df2)
Output of Sum column for rows 4 and 5:
Row4 - 5.960107e+16
Row5 - 7.412737e+15
Use astype, to convert those strings to ints and sum works properly:
df2['Sum'] = df2[col_list].astype(int).sum(axis=1)
Output:
week storeA storeB storeC storeD Sum
4 2014-06-04 5960 10740 8264 6063 31027
5 2014-06-12 7412 7374 3208 3985 21979
What was happening,you were summing (concatenating) strings.
Because of the way your array is defined, with mixed strings and objects, everything is coerced to string. Take a look at this:
df.dtypes
week object
storeA object
storeB object
storeC object
storeD object
dtype: object
You have columns of strings, and sum on string dataframes results in concatenation.
The solution is to convert these to integers first -
df2[col_list] = df2[col_list].astype(int)
Your code then works.
df2[col_list].sum(axis=1)
4 31027
5 21979
dtype: int64
Alternatively, declare data as a object array -
data = np.array([[...], [...], ...], dtype=object)
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])
Next, perform a soft conversion using infer_objects (new in v0.22):
df = df.infer_objects()
df.dtypes
week object
storeA int64
storeB int64
storeC int64
storeD int64
dtype: object
Works like a charm.

Number to Date Conversion using Pandas in Python?

When I try to convert from number format to Date I'm not getting the same result what I get in Excel.
I need to convert a Number to date format and get the same result what I get in Excel.
For Example in Excel for the below Number I get the following:
Input - 42970.73819
Output- 8/23/2017 17:43
I tried using the date conversion in Pandas but not getting the same result as of Excel.
Thank you
Madan
I think you need convert serial date:
df = pd.DataFrame({'date':[42970.73819,42970.73819]})
print (df)
date
0 42970.73819
1 42970.73819
df = pd.to_datetime((df['date'] - 25569) * 86400.0, unit='s')
print (df)
0 2017-08-23 17:42:59.616
1 2017-08-23 17:42:59.616
Name: date, dtype: datetime64[ns]

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources