Summing all entries in a csv that contain all or portion of a string - python-3.x

Right now I'm trying to sum the number of entries that fall within a given date range of an arbitrary column in this (subarray) of a csv (there are 3 date columns in total and I want to be able to look at any column and there respective entries):
(label:id,Label:invoice number, Label appt date, Label completion date, label: invoice amount last appointment date)
(label 1, Label 2, Label 3, label 4, label 5, label 6)
18565272, 3548587, 2015-12-30 16:30:00, 2017-01-18 4:01:00, 0, 11/30/2016
22909611, 2000404134, 2016-05-18 14:55:00, 2017-01-26 16:59:00, 0, NULL
21541501, 1166588, 2016-07-07 17:00:00, 2017-02-14 4:01:00, 84, 4/11/2016
1000141115,1429670, 2016-10-29 0:06:00, 2017-01-18 21:43:00, 49, 3/2/2016
I'd like to be able to define a column and then compute the number of times a date appears that lie within a range-say "January 1-30 2016". I'm not really experienced with methods related to this (most of my python experience is in the numerical computation side). I have a few ideas at present (using pandas to remove rows that do not contain a given entry along the row and then summing the row count for instance) but I'm looking for a few that probably work a lot better.

Try using pandas.
import pandas as pd
df = pd.read_csv(your_file) # read the data
def date_range_counter(column, start_date, end_date)
dates_range = pd.date_range(start_date, end_date) # creates list of dates between start_date and end_date
arr = df[df[column].isin(dates_range)] # will only keep rows of the dataframe that have dates in the range specified
return len(arr)
for start_date and end_date you can use strings in the format 'YYYY/MM/DD' and the column input should be a string of the column label you want to count dates from e.g 'Label 1'.

Related

Calculation of index returns for specific timeframe for each row using (as an option) for loop

I am not too experienced with programming, and I got stuck in a research project in the asset management field.
My Goal:
I have 2 dataframes,- one containing aside from others columns "European short date"," SP150030after", "SP1500365before" (Screenshot) and second containing column "Dates" and "S&P 1500_return"(Screenshot). For each row in the first dataframe, I want to calculate cumulative returns of S&P 1500 for 365 days before the date in column "European short date" and cumulative returns of S&P 1500 for 30 days after the date in column "European short date" and put these results in columns "SP150030after" and "SP1500365before".
These returns are to be calculated using a second Dataframe. "S&P 1500_return" column in the second data frame for each date represents "daily return of S&P 1500 market index + 1". So, for example to get cumulative returns over 1 year before 31.12.2020 in first dataframe, I would have to calculate the product of values in column "S&P 1500_return" from the second dataframe for each day present (trading day) in the dataframe 2 during the period 31.12.2019 - 30.12.2020.
What I have tried so far:
I turned "European short date" in DataFrame 1 and "Date" in Dataframe 2 to be index fields and though about approaching my goal through "for" loop. I tried to turn "European short date" to be "List" to use it to iterate through the dataframe 1, but I get the following error: "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead".
Here is my code so far:
Main_set = pd.read_excel('...')
Main_set = pd.DataFrame(Main_set)
Main_set['European short date'] = pd.to_datetime(Main_set['European short date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d')
Main_set = Main_set.set_index('European short date')
Main_set.head(5)
Indexes = pd.read_excel('...')
Indexes = pd.DataFrame(Indexes)
Indexes['Date'] = pd.to_datetime(Indexes['Date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d')
Indexes = Indexes.set_index('Date')
SP1500DailyReturns = Indexes[['S&P 1500 SUPER COMPOSITE - PRICE INDEX']]
SP1500DailyReturns['S&P 1500_return'] = (SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'] / SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'].shift(1))
SP1500DailyReturns.to_csv('...')
Main_set['SP50030after'] = np.zeros(326)
import math
dates = Main_set['European short date'].to_list()
dates.head()
for n in dates:
Main_set['SP50030after'] = math.prod(arr)
Many thanks in advance!
In case it will be useful for someone, I solved the problem by using a for loop and dividing the problem in more steps:
for n in dates:
Date = pd.Timestamp(n)
DateB4 = Date - pd.Timedelta("365 days")
DateAfter = Date + pd.Timedelta("30 days")
ReturnsofPeriodsBackwards = SP1500DailyReturns.loc[str(DateB4) : str(Date), 'S&P 1500_return']
Main_set.loc[str(Date), 'SP500365before'] = np.prod(ReturnsofPeriodsBackwards)
ReturnsofPeriodsForward = SP1500DailyReturns.loc[str(Date) : str(DateAfter), 'S&P 1500_return']
Main_set.loc[str(Date), 'SP50030after'] = np.prod(ReturnsofPeriodsForward)

Finding average age of incidents in a datetime series

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

How to use a function for each row of a dataframe - .apply? .map .mask?

I often need to apply 'by row' logic to a dataframe and though i might have a function to do it, i'm not sure how to apply it to the dataframe rather than rely on a for loop.
The below code works, but it seems like it isn't he best way of doing it. I tried putting it into functions and using lambda, .apply and .map, but whatever i'm doing i'm not putting it together correctly. I've tried reading around this, but can't get by head around whether i'm trying to apply to a dataframe, a series, element-wise etc in a actual example like below.
For example look at this dummy data:
import pandas as pd
import numpy as np
import datetime
import random
example = pd.DataFrame({'Total Completions(2015)': [0,0,0,5,10,0],
'Total Completions(2016)': [0,1,4,0,18,0],
'Total Completions(2017)': [2,1,6,5,0,15],
'Total Completions(2018)': [0,0,8,5,0,1]})
What i want to do is to find the first year that has completions in it (>0). The first bit of code identifies which columns are the total columns and picks out the relevant year from between the brackets. It then finds which of these possible years is the earliest with any completions in it (>0) and uses this as the minimum year in which to do the next part - assign a random date in that financial year.
# initiates the columns i want to populate to start
example['min_yr'] = 0
example['Dummy Date'] = ''
# gets the first year where completions begin
for row in example.index:
min_yr = 0
for header in example.columns:
# if it's one of the total columns
if header[:5] == 'Total':
# get the year as an integer from the brackets in the header
get_yr = int(header[18:-1])
# if it's the first year with completions it comes across, that year is first minimum
if min_yr == 0 and example[header][row] > 0:
min_yr = get_yr
# if there is already a min year but there is a new year with an entry that's earlier
# replace with the new minimum year
elif min_yr > 0 and example[header][row] > 0 and get_yr < min_yr:
min_yr = get_yr
example['min_yr'][row] = min_yr
I then assign the random year based on the min year i just found:
for row in example.index:
start_date = datetime.date((example['min_yr'][row] -1), 4, 1)
end_date = datetime.date(example['min_yr'][row], 3, 31)
time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
random_number_of_days = random.randrange(days_between_dates)
random_date = start_date + datetime.timedelta(days=random_number_of_days)
example['Dummy Date'][row] = random_date
So in these examples, am i applying to a dataframe or a series, and how do i structure the argument?
Thanks

What's the proper way to add to a python list based on a condition from another Pandas column?

Objective: Generate a list of units where the name column includes a number
Description: The current code is close, but gets problematic towards the end where a list is needed and which is filled with output that meets the requested condition.
Code:
# intialise data of lists.
info = {'Name':['Tom', 'nick 11', 'krish', 'jack_14'], 'Units':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(info)
# Print the output.
df
# identify row with number in name
def num_there(string):
return any(i.isdigit() for i in string)
# create a list of all units where num_there function matches == True
for row in df['Name']:
if num_there(row)==True:
print(df['Units'], row)
Any help would be appreciated.
you can use str.contains to look for any number with \d in the column Name and then use it to select the wanted rows of the column Units:
l_units = df.Units[df.Name.str.contains('\d')].tolist()
print (l_units)
[21, 18]
You can use pandas str.contains with regex:
df[df["Name"].str.contains('\d', regex=True)]
Name Units
1 nick 11 21
3 jack_14 18
Then:
list(df[df["Name"].str.contains('\d', regex=True)].Units)
[21, 18]

Iterate through CSV and match lines with a specific date

I am parsing a CSV file into a list. Each list item will have a column list[3] which contains a date in the format: mm/dd/yyyy
I need to iterate through the file and extract only the rows which contain a specific date range.
for example, I want to extract all rows for the month of 12/2015. I am having trouble determining how to match the date. Any nudging in the right direction would be helpful.
Thanks.
Method1:
splits your column to month, day and year, converts month and year to integers and then compare and match 12/2015
column3 = "12/31/2015"
month, day, year = column3.split("/")
if int(month) == 12 and int(year) == 2015:
# do your thing
Method2:
parses a datetime string to time object and gets the attributes tm_year and tm_mon, compare them with corresponding month and year.
>>> import time
>>> to = time.strptime("12/03/2015", "%m/%d/%Y")
>>> to.tm_mon
12
>>> to.tm_year
2015

Resources