Obtain First Entry for Each Month/Year Pair - python-3.x

I wish to obtain the first entry of each month/year pair. I was thinking of structuring a groupby method but am unsure of how that would play out given the order of precedence.
Date Seconds
2020-05 2748.03
2020-05 2748.25
2020-05 2777.72
... ... ... ...
1997-12 100.22
1997-12 66.66
1997-11 54.53
1997-11 92.11
1997-11 42.52
1997-10 155.22
1997-10 115.03
Thanks!

This is groupby().head:
# change `date` to your year/month column name
df.groupby('date', sort=False).head(1)
or drop_duplicates:
df.drop_duplicates('date')
Output:
date Value
0 2020-05 2748.03
3 1997-10 112.67

I will assume that this is a list of strings like so:
dates = [
"2020-05 2748.03",
...
"1997-10 115.03"
]
In order to group by the year you first need to split the date into year and month column and seconds column like so:
dates = [single_date.split(" ") for single_date in dates]
dates list is now:
[
["2020-05", "2748.03"],
...
["1997-10", "115.03"],
]
Now you should build the dataframe:
df = pd.DataFrame(dates, columns =['year_month', 'seconds'], dtype = float)
Now lets groupby year_month and take the minimum in the seconds column
first_entries_per_month_year = df.groupby("year_month").min()
Hope that helped

Related

How to convert data frame for time series analysis in Python?

I have a dataset of around 13000 rows and 2 columns (text and date) for two year period. One of the column is date in yyyy-mm-dd format. I want to perform time series analysis where x axis would be date (each day) and y axis would be frequency of text on corresponding date.
I think if I create a new data frame with unique dates and number of text on corresponding date that would solve my problem.
Sample data
How can I create a new column with frequency of text each day? For example:
Thanks in Advance!
Depending on the task you are trying to solve, i can see two options for this dataset.
Either, as you show in your example, count the number of occurrences of the text field in each day, independently of the value of the text field.
Or, count the number of occurrence of each unique value of the text field each day. You will then have one column for each possible value of the text field, which may make more sense if the values are purely categorical.
First things to do :
import pandas as pd
df = pd.DataFrame(data={'Date':['2018-01-01','2018-01-01','2018-01-01', '2018-01-02', '2018-01-03'], 'Text':['A','B','C','A','A']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime type if not already done
Date Text
0 2018-01-01 A
1 2018-01-01 B
2 2018-01-01 C
3 2018-01-02 A
4 2018-01-03 A
Then for option one :
df = df.groupby('Date').count()
Text
Date
2018-01-01 3
2018-01-02 1
2018-01-03 1
For option two :
df[df['Text'].unique()] = pd.get_dummies(df['Text'])
df = df.drop('Text', axis=1)
df = df.groupby('Date').sum()
A B C
Date
2018-01-01 1 1 1
2018-01-02 1 0 0
2018-01-03 1 0 0
The get_dummies function will create one column per possible value of the Text field. Each column is then a boolean indicator for each row of the dataframe, telling us which value of the Text field occurred in this row. We can then simply make a sum aggregation with a groupby by the Date field.
If you are not familiar with the use of groupby and aggregation operation, i recommend that you read this guide first.

Calculation of index returns for specific timeframe for each row using (as an option) for loop

I am not too experienced with programming, and I got stuck in a research project in the asset management field.
My Goal:
I have 2 dataframes,- one containing aside from others columns "European short date"," SP150030after", "SP1500365before" (Screenshot) and second containing column "Dates" and "S&P 1500_return"(Screenshot). For each row in the first dataframe, I want to calculate cumulative returns of S&P 1500 for 365 days before the date in column "European short date" and cumulative returns of S&P 1500 for 30 days after the date in column "European short date" and put these results in columns "SP150030after" and "SP1500365before".
These returns are to be calculated using a second Dataframe. "S&P 1500_return" column in the second data frame for each date represents "daily return of S&P 1500 market index + 1". So, for example to get cumulative returns over 1 year before 31.12.2020 in first dataframe, I would have to calculate the product of values in column "S&P 1500_return" from the second dataframe for each day present (trading day) in the dataframe 2 during the period 31.12.2019 - 30.12.2020.
What I have tried so far:
I turned "European short date" in DataFrame 1 and "Date" in Dataframe 2 to be index fields and though about approaching my goal through "for" loop. I tried to turn "European short date" to be "List" to use it to iterate through the dataframe 1, but I get the following error: "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead".
Here is my code so far:
Main_set = pd.read_excel('...')
Main_set = pd.DataFrame(Main_set)
Main_set['European short date'] = pd.to_datetime(Main_set['European short date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d')
Main_set = Main_set.set_index('European short date')
Main_set.head(5)
Indexes = pd.read_excel('...')
Indexes = pd.DataFrame(Indexes)
Indexes['Date'] = pd.to_datetime(Indexes['Date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d')
Indexes = Indexes.set_index('Date')
SP1500DailyReturns = Indexes[['S&P 1500 SUPER COMPOSITE - PRICE INDEX']]
SP1500DailyReturns['S&P 1500_return'] = (SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'] / SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'].shift(1))
SP1500DailyReturns.to_csv('...')
Main_set['SP50030after'] = np.zeros(326)
import math
dates = Main_set['European short date'].to_list()
dates.head()
for n in dates:
Main_set['SP50030after'] = math.prod(arr)
Many thanks in advance!
In case it will be useful for someone, I solved the problem by using a for loop and dividing the problem in more steps:
for n in dates:
Date = pd.Timestamp(n)
DateB4 = Date - pd.Timedelta("365 days")
DateAfter = Date + pd.Timedelta("30 days")
ReturnsofPeriodsBackwards = SP1500DailyReturns.loc[str(DateB4) : str(Date), 'S&P 1500_return']
Main_set.loc[str(Date), 'SP500365before'] = np.prod(ReturnsofPeriodsBackwards)
ReturnsofPeriodsForward = SP1500DailyReturns.loc[str(Date) : str(DateAfter), 'S&P 1500_return']
Main_set.loc[str(Date), 'SP50030after'] = np.prod(ReturnsofPeriodsForward)

Pandas, groupby/Grouper on month ignoring the year

I have the following data in a Pandas df:
index;Aircraft_Registration;issue;Leg_Number;Departure_Time;Departure_Date;Arrival_Time;Arrival_Date;Departure_Airport;Arrival_Airport
0;XXA;0;QQ464;01:07:00;2013-12-01;03:33:00;2013-12-01;JFK;AMS
1;XXA;0;QQQ445;06:08:00;2013-12-01;12:02:00;2013-12-01;AMS;CPT
2;XXA;0;QQQ446;13:04:00;2013-12-01;13:13:00;2013-12-01;JFK;SID
3;XXA;0;QQ446;14:17:00;2013-12-01;20:15:00;2013-12-01;SID;FRA
4;XXA;0;QQ453;02:02:00;2013-12-02;13:09:00;2013-12-02;JFK;BJL
5;XXA;0;QQ150;05:47:00;2018-12-03;12:37:00;2018-03-03;KAO;AMS
6;XXA;0;QQ457;15:09:00;2018-11-03;17:51:00;2018-03-03;AMS;AGP
7;XXA;0;QQ457;08:34:00;2018-12-03;22:47:00;2018-03-03;AGP;JFK
8;XXA;0;QQ458;03:34:00;2018-12-03;23:59:00;2018-03-03;ATL;BJL
9;XXA;0;QQ458;06:26:00;2018-10-04;07:01:00;2018-03-04;BJL;AMS
I want to group this data on the month ignoring the year so ideally would end up with 12 new dataframes each representing the events of that months ignoring the year.
I tried the following:
sort = list(df.groupby(pd.Grouper(freq='M', key='Departure_Date')))
This results in a list containing a data frame for each month and year, in this case yielding 60 lists of which many are empty since there is no data for that month.
My expected result is a list containing 12 dataframes, one for each month (January, Februari etc.)
I think need dt.month for 1-12 months or dt.strftime for January-December:
sort = list(df.groupby(df['Departure_Date'].dt.month))
Or:
sort = list(df.groupby(df['Departure_Date'].dt.strftime('%B')))

How can I count categorical columns by month in Pandas?

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Iterate through CSV and match lines with a specific date

I am parsing a CSV file into a list. Each list item will have a column list[3] which contains a date in the format: mm/dd/yyyy
I need to iterate through the file and extract only the rows which contain a specific date range.
for example, I want to extract all rows for the month of 12/2015. I am having trouble determining how to match the date. Any nudging in the right direction would be helpful.
Thanks.
Method1:
splits your column to month, day and year, converts month and year to integers and then compare and match 12/2015
column3 = "12/31/2015"
month, day, year = column3.split("/")
if int(month) == 12 and int(year) == 2015:
# do your thing
Method2:
parses a datetime string to time object and gets the attributes tm_year and tm_mon, compare them with corresponding month and year.
>>> import time
>>> to = time.strptime("12/03/2015", "%m/%d/%Y")
>>> to.tm_mon
12
>>> to.tm_year
2015

Resources