Pandas Equivalent to SQL YEAR(GETDATE()) - python-3.x

I'm a Pandas newbie but decent at SQL. A function I often leverage in SQL is this:
YEAR(date_format_data) = (YEAR(GETDATE())-1)
This will get me all the data from last year. Can someone please help me understand how to do the equivalent in Pandas?
Here's some example data:
Date Number
01/01/15 1
01/02/15 2
01/01/15 3
01/01/16 2
01/01/16 1
And here's my best guess at the code:
df = df[YEAR('Date') == (YEAR(GETDATE()) -1)].agg(['sum'])
And this code would return a value of '3'.
Thank you in advance for your help, I'm having a really hard time figuring out what I'm sure is simple.
Me

I think you can do it this way:
prev_year = pd.datetime.today().year - 1
df.loc[df['Date'].dt.year == prev_year]
PS .dt.year accessor will work only if Date column is of datetime dtype. If it's not the case you may want to convert that column to datetime dtype first:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

For pandas, first convert your date column to timestamp by pd.to_datetime
df['Date2'] = pd.to_datetime(df['Date'])
(pd.to_datetime has a format parameter to specify your input date format) Then you have
df['Date2'].year

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Date stuck as unformattable in pandas dataframe

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.
When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

Extract row from pandas dateframe

I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']
I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.
You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05

PANDAS date summarisation

I have a pandas dataframe that looks like:
import pandas as pd
df= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22']})
df
I want to aggregate the dates so it shows the first four quarters and then the year end. For the dataset above, I would expect:
df1= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22'],'Mat Group':['1Q18','1Q18','2Q18','FY19','FY20']})
df1
right now I achieve this using a set of loc statements such as :
df.loc[(df['Maturity Date'] >'2018-01-01') & (df['Maturity Date'] <='2018-03-31'),'Mat Group']="1Q18"
df.loc[(df['Maturity Date'] >'2018-04-01') & (df['Maturity Date'] <='2018-06-30'),'Mat Group']="2Q18"
I was wondering if there is a more elegant way to achieve the same result? Perhaps have the buckets in a list and parse through the list so that the bucketing can be made more flexible ?
A bit specific. I would use.
the strftime format %y to get the short
the pandas built-in quarter to get the quarter
the python format function to construct strings
a lambda to apply it to the column
Here is the result. Maybe there is a better answer, but this one is pretty concise.
df['Mat Group'] = df['Maturity Date'].apply(
lambda x: '{}Q{:%y}'.format(x.quarter, x) if x.year < 2019
else 'FY{:%y}'.format(x))
df
# Amount Maturity Date type Mat Group
# 0 10 2018-01-22 Asset 1Q18
# 1 -10 2018-02-22 Liability 1Q18
# 2 20 2018-06-22 Asset 2Q18
# 3 -20 2019-06-22 Liability FY19
# 4 5 2020-01-22 Asset FY20

Efficient way of converting String column to Date in Pandas (in Python), but without Timestamp

I am having a DataFrame which contains two String columns df['month'] and df['year']. I want to create a new column df['date'] by combining month and the year column. I have done that successfully using the structure below -
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y')
where by for df['month'] = '08' and df['year']='1968'
we get df['date']=1968-08-01
This is exactly what I wanted.
Problem at hand: My DataFrame has more than 200,000 rows and I notice that sometimes, in addition, I also get Timestamp like the one below for a few rows and I want to avoid that -
1972-03-01 00:00:00
I solved this issue by using the .dt acessor, which can be used to manipulate the Series, whereby I explicitly extracted only the date using the code below-
df['date']=pd.to_datetime((df['month']+df['year']),format='%m%Y') #Line 1
df['date']=df['date']=.dt.date #Line 2
The problem was solved, just that the Line 2 took 5 times more time than Line 1.
Question: Is there any way where I could tweak Line 1 into giving just the dates and not the Timestamp? I am sure this simple problem cannot have such an inefficient solution. Can I solve this issue in a more time and resource efficient manner?
AFAIk we don't have date dtype n Pandas, we only have datetime, so we will always have a time part.
Even though Pandas shows: 1968-08-01, it has a time part: 00:00:00.
Demo:
In [32]: df = pd.DataFrame(pd.to_datetime(['1968-08-01', '2017-08-01']), columns=['Date'])
In [33]: df
Out[33]:
Date
0 1968-08-01
1 2017-08-01
In [34]: df['Date'].dt.time
Out[34]:
0 00:00:00
1 00:00:00
Name: Date, dtype: object
And if you want to have a string representation, there is a faster way:
df['date'] = df['year'].astype(str) + '-' + df['month'].astype(str) + '-01'
UPDATE: be aware that .dt.date will give you a string representation:
In [53]: df.dtypes
Out[53]:
Date datetime64[ns]
dtype: object
In [54]: df['new'] = df['Date'].dt.date
In [55]: df
Out[55]:
Date new
0 1968-08-01 1968-08-01
1 2017-08-01 2017-08-01
In [56]: df.dtypes
Out[56]:
Date datetime64[ns]
new object # <--- NOTE !!!
dtype: object

Resources