Extract row from pandas dateframe - python-3.x

I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']

I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.

You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

PANDAS date summarisation

I have a pandas dataframe that looks like:
import pandas as pd
df= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22']})
df
I want to aggregate the dates so it shows the first four quarters and then the year end. For the dataset above, I would expect:
df1= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22'],'Mat Group':['1Q18','1Q18','2Q18','FY19','FY20']})
df1
right now I achieve this using a set of loc statements such as :
df.loc[(df['Maturity Date'] >'2018-01-01') & (df['Maturity Date'] <='2018-03-31'),'Mat Group']="1Q18"
df.loc[(df['Maturity Date'] >'2018-04-01') & (df['Maturity Date'] <='2018-06-30'),'Mat Group']="2Q18"
I was wondering if there is a more elegant way to achieve the same result? Perhaps have the buckets in a list and parse through the list so that the bucketing can be made more flexible ?
A bit specific. I would use.
the strftime format %y to get the short
the pandas built-in quarter to get the quarter
the python format function to construct strings
a lambda to apply it to the column
Here is the result. Maybe there is a better answer, but this one is pretty concise.
df['Mat Group'] = df['Maturity Date'].apply(
lambda x: '{}Q{:%y}'.format(x.quarter, x) if x.year < 2019
else 'FY{:%y}'.format(x))
df
# Amount Maturity Date type Mat Group
# 0 10 2018-01-22 Asset 1Q18
# 1 -10 2018-02-22 Liability 1Q18
# 2 20 2018-06-22 Asset 2Q18
# 3 -20 2019-06-22 Liability FY19
# 4 5 2020-01-22 Asset FY20

Pandas Equivalent to SQL YEAR(GETDATE())

I'm a Pandas newbie but decent at SQL. A function I often leverage in SQL is this:
YEAR(date_format_data) = (YEAR(GETDATE())-1)
This will get me all the data from last year. Can someone please help me understand how to do the equivalent in Pandas?
Here's some example data:
Date Number
01/01/15 1
01/02/15 2
01/01/15 3
01/01/16 2
01/01/16 1
And here's my best guess at the code:
df = df[YEAR('Date') == (YEAR(GETDATE()) -1)].agg(['sum'])
And this code would return a value of '3'.
Thank you in advance for your help, I'm having a really hard time figuring out what I'm sure is simple.
Me
I think you can do it this way:
prev_year = pd.datetime.today().year - 1
df.loc[df['Date'].dt.year == prev_year]
PS .dt.year accessor will work only if Date column is of datetime dtype. If it's not the case you may want to convert that column to datetime dtype first:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
For pandas, first convert your date column to timestamp by pd.to_datetime
df['Date2'] = pd.to_datetime(df['Date'])
(pd.to_datetime has a format parameter to specify your input date format) Then you have
df['Date2'].year

Resources