Iterate through CSV and match lines with a specific date - python-3.x

I am parsing a CSV file into a list. Each list item will have a column list[3] which contains a date in the format: mm/dd/yyyy
I need to iterate through the file and extract only the rows which contain a specific date range.
for example, I want to extract all rows for the month of 12/2015. I am having trouble determining how to match the date. Any nudging in the right direction would be helpful.
Thanks.

Method1:
splits your column to month, day and year, converts month and year to integers and then compare and match 12/2015
column3 = "12/31/2015"
month, day, year = column3.split("/")
if int(month) == 12 and int(year) == 2015:
# do your thing
Method2:
parses a datetime string to time object and gets the attributes tm_year and tm_mon, compare them with corresponding month and year.
>>> import time
>>> to = time.strptime("12/03/2015", "%m/%d/%Y")
>>> to.tm_mon
12
>>> to.tm_year
2015

Related

Obtain First Entry for Each Month/Year Pair

I wish to obtain the first entry of each month/year pair. I was thinking of structuring a groupby method but am unsure of how that would play out given the order of precedence.
Date Seconds
2020-05 2748.03
2020-05 2748.25
2020-05 2777.72
... ... ... ...
1997-12 100.22
1997-12 66.66
1997-11 54.53
1997-11 92.11
1997-11 42.52
1997-10 155.22
1997-10 115.03
Thanks!
This is groupby().head:
# change `date` to your year/month column name
df.groupby('date', sort=False).head(1)
or drop_duplicates:
df.drop_duplicates('date')
Output:
date Value
0 2020-05 2748.03
3 1997-10 112.67
I will assume that this is a list of strings like so:
dates = [
"2020-05 2748.03",
...
"1997-10 115.03"
]
In order to group by the year you first need to split the date into year and month column and seconds column like so:
dates = [single_date.split(" ") for single_date in dates]
dates list is now:
[
["2020-05", "2748.03"],
...
["1997-10", "115.03"],
]
Now you should build the dataframe:
df = pd.DataFrame(dates, columns =['year_month', 'seconds'], dtype = float)
Now lets groupby year_month and take the minimum in the seconds column
first_entries_per_month_year = df.groupby("year_month").min()
Hope that helped

Python: How to create an array of datetime, and extract the corresponding year, month, day, hour for each index in array before binning

I can create a list of datetimes between 1994 and 2020 as follows:
from datetime import datetime, timedelta
# Create datetime for plotting
start_date = datetime(1994,1,1)
start_date_yr = start_date.year
time = [start_date + timedelta(days=x) for x in range(9553)]
'time' is a list, and is useful for plotting my 'y' data as a function of time.
My 'y' data is a pandas series with dimension (9553,) containing some NaNs.
However, I want to plot my 'y' data as a function of day of the year, or month, or year. In MATLAB, I would use the function 'datevec' to get these corresponding years, months, days with same dimension (9553,) from variable 'time'.
I want to bin my 'y' data to get the annual cycle (either for each day of the year or each month), and the yearly averages (using all data corresponding to a given year).
How can I obtain a time array (datetime, year, month, day) with dimension (9553,), and how can I bin my 'y' data?
Make a list of tuples:
[(datetime1, year1, month1, day1), (datetime2, year2, month2, day2), (datetime3, year3, month3, day3) ]
mydates = []
for date in time:
mydates.append(tuple((t, t.strftime('%Y'), t.strftime('%m'), t.strftime('%d'))))

Pandas, groupby/Grouper on month ignoring the year

I have the following data in a Pandas df:
index;Aircraft_Registration;issue;Leg_Number;Departure_Time;Departure_Date;Arrival_Time;Arrival_Date;Departure_Airport;Arrival_Airport
0;XXA;0;QQ464;01:07:00;2013-12-01;03:33:00;2013-12-01;JFK;AMS
1;XXA;0;QQQ445;06:08:00;2013-12-01;12:02:00;2013-12-01;AMS;CPT
2;XXA;0;QQQ446;13:04:00;2013-12-01;13:13:00;2013-12-01;JFK;SID
3;XXA;0;QQ446;14:17:00;2013-12-01;20:15:00;2013-12-01;SID;FRA
4;XXA;0;QQ453;02:02:00;2013-12-02;13:09:00;2013-12-02;JFK;BJL
5;XXA;0;QQ150;05:47:00;2018-12-03;12:37:00;2018-03-03;KAO;AMS
6;XXA;0;QQ457;15:09:00;2018-11-03;17:51:00;2018-03-03;AMS;AGP
7;XXA;0;QQ457;08:34:00;2018-12-03;22:47:00;2018-03-03;AGP;JFK
8;XXA;0;QQ458;03:34:00;2018-12-03;23:59:00;2018-03-03;ATL;BJL
9;XXA;0;QQ458;06:26:00;2018-10-04;07:01:00;2018-03-04;BJL;AMS
I want to group this data on the month ignoring the year so ideally would end up with 12 new dataframes each representing the events of that months ignoring the year.
I tried the following:
sort = list(df.groupby(pd.Grouper(freq='M', key='Departure_Date')))
This results in a list containing a data frame for each month and year, in this case yielding 60 lists of which many are empty since there is no data for that month.
My expected result is a list containing 12 dataframes, one for each month (January, Februari etc.)
I think need dt.month for 1-12 months or dt.strftime for January-December:
sort = list(df.groupby(df['Departure_Date'].dt.month))
Or:
sort = list(df.groupby(df['Departure_Date'].dt.strftime('%B')))

Creating a daily account log from a Pandas expense file in data frame format

I have an expense file that I am trying to read in and from this file create a daily log. A small subset of the file that extends over years is shown below, for a few days in January 2015.
Date,Checking_Debit,Checking_Addition,Savings_Debit,Savings_Addition
2015-01-07,342.1,0.0,0.0,0.0
2015-01-07,981.0,0.0,0.0,0.0
2015-01-07,3185.0,0.0,0.0,0.0
2015-01-05,55.0,0.0,0.0,0.0
2015-01-05,75.0,0.0,0.0,0.0
2015-01-03,287.0,0.0,0.0,0.0
2015-01-02,64.8,0.0,0.0,0.0
2015-01-02,75.0,0.0,0.0,75.0
2015-01-02,1280.0,0.0,0.0,0.0
2015-01-02,245.0,0.0,0.0,0.0
2015-01-01,45.0,0.0,0.0,0.0
In my code I start with the variables checking_start and savings_start that contain the start values of the checking and savings account. I would like to give the code a start date and an end date and have the code iterate through each day, see if there was an expense on that day and subtract the checking and savings debits and add the checking and savings additions. If there were no expenses on that day it should keep the accounts at the same value as the previous day. In addition, I am trying to constrain myself to Pandas data frames in the implementation. So far my code looks like this.
import pandas as pd
from date time import date
check_start = 8500.0
savings_start = 4000.0
start_date = date(2017, 1, 1)
end_date = date(2017, 1, 8)
df = pd.read_csv(file_name.csv, dtype={'Date': str, 'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
In a Pythonic format with the Pandas module, how do I walk through from the start date to the end date, one day at a time, then see if there is an expense or expenses on those date and then subtract that from the checking and savings. At the end I should have an array for the value of the checking account on each date and the same for the savings account on that day.
The result should be arrays written into another .csv file with the following format.
Date,Checking,Savings
2017-01-07,1865.1,3925.0
2017-01-06,6373.2,3925.0
2017-01-05,6373.2,3925.0
2017-01-04,6503.2,3925.0
2017-01-03,6503.2,3925.0
2017-01-02,6790.2,3925.0
2017-01-01,8455.0,4000.0
Start by reading the data that you provided and identifying the date column in data with it
import pandas as pd
df = pd.read_csv(r"dat.csv", parse_dates=[0],dtype={'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
Set Date as index for better data manipulation.
df = df.set_index("Date")
Initialize all the variables for the loop
check_start = 8500.0
savings_start = 4000.0
start_date = pd.to_datetime('2015/1/1')
end_date = pd.to_datetime('2015/1/8')
delta = pd.Timedelta('1 days') # time that needs to be added to start date
Now group the expense data w.r.t to each date
grp_df = df.groupby('Date').sum()
Now we will do while loop for create expense report for each day
expense_report = []
while start_date<=end_date:
if start_date in df.index:
savings_start += (grp_df.loc[start_date,"Savings_Addition"]-grp_df.loc[start_date,"Savings_Debit"])
check_start += (grp_df.loc[start_date,"Checking_Addition"]-grp_df.loc[start_date,"Checking_Debit"])
expense_report.append([start_date,check_start,savings_start])
elif start_date not in df.index:
expense_report.append([start_date,check_start,savings_start])
start_date += delta
convert expense_report list to pandas Dataframe
df_exp_rpt = pd.DataFrame(expense_report,columns=["Date","Checking","Savings"])
print(df_exp_rpt)
Date Checking Savings
0 2015-01-01 8455.0 4000.0
1 2015-01-02 6790.2 4075.0
2 2015-01-03 6503.2 4075.0
3 2015-01-04 6503.2 4075.0
4 2015-01-05 6373.2 4075.0
5 2015-01-06 6373.2 4075.0
6 2015-01-07 1865.1 4075.0
7 2015-01-08 1865.1 4075.0
You can save to csv by
df_exp_rpt.to_csv("filename.csv")
Note: The saving column values are 4075 instead of 3925.0 because you have 75 value in saving_addition column in your original data

Summing all entries in a csv that contain all or portion of a string

Right now I'm trying to sum the number of entries that fall within a given date range of an arbitrary column in this (subarray) of a csv (there are 3 date columns in total and I want to be able to look at any column and there respective entries):
(label:id,Label:invoice number, Label appt date, Label completion date, label: invoice amount last appointment date)
(label 1, Label 2, Label 3, label 4, label 5, label 6)
18565272, 3548587, 2015-12-30 16:30:00, 2017-01-18 4:01:00, 0, 11/30/2016
22909611, 2000404134, 2016-05-18 14:55:00, 2017-01-26 16:59:00, 0, NULL
21541501, 1166588, 2016-07-07 17:00:00, 2017-02-14 4:01:00, 84, 4/11/2016
1000141115,1429670, 2016-10-29 0:06:00, 2017-01-18 21:43:00, 49, 3/2/2016
I'd like to be able to define a column and then compute the number of times a date appears that lie within a range-say "January 1-30 2016". I'm not really experienced with methods related to this (most of my python experience is in the numerical computation side). I have a few ideas at present (using pandas to remove rows that do not contain a given entry along the row and then summing the row count for instance) but I'm looking for a few that probably work a lot better.
Try using pandas.
import pandas as pd
df = pd.read_csv(your_file) # read the data
def date_range_counter(column, start_date, end_date)
dates_range = pd.date_range(start_date, end_date) # creates list of dates between start_date and end_date
arr = df[df[column].isin(dates_range)] # will only keep rows of the dataframe that have dates in the range specified
return len(arr)
for start_date and end_date you can use strings in the format 'YYYY/MM/DD' and the column input should be a string of the column label you want to count dates from e.g 'Label 1'.

Resources