Calculate months of coverage based on multiple parameters in Pandas - python-3.x
I need to mark the month with 1s when the patient was covered by some product. One dose provides coverage for 1 month. Also i would like to see the gaps in coverage.
Another detail is that quantity may affect months of coverage too. Lets say the quantity is 2, then patient is covered for next 2 months.
Right now I using df.loc which works with the first dose, but can't wrap my mind around how to calculate those gaps in coverage.
df = pd.DataFrame({'patient':['1','2','3','4','5','6','7'], 'dose1':['A','B','B','A','C','C','C'],
'qty1':[1,2,1,4,1,3,4],
'days_since_last_dose1':[0,0,0,0,0,0,0],
'dose2':['B','A','B','A','C','B','C'],
'qty2':[1,2,1,4,1,3,4],
'days_since_last_dose2':[23,56,120,43,30,15,60],
'dose3':['B','B','B','A','A','C','B'],
'qty3':[3,1,1,2,1,3,4],
'days_since_last_dose3':[44,22,67,150,76,32,21], 'M1':[0,0,0,0,0,0,0], 'M2':[0,0,0,0,0,0,0], 'M3':[0,0,0,0,0,0,0], 'M4':[0,0,0,0,0,0,0], 'M5':[0,0,0,0,0,0,0], 'M6':[0,0,0,0,0,0,0], 'M7':[0,0,0,0,0,0,0], 'M8':[0,0,0,0,0,0,0], 'M9':[0,0,0,0,0,0,0], 'M10':[0,0,0,0,0,0,0], 'M11':[0,0,0,0,0,0,0], 'M12':[0,0,0,0,0,0,0]})
prod_1 = ['A']
prod_2 = ['B']
prod_3 = ['C']
df.loc[(df['dose1'].isin(prod_1)) & (df['qty']==1), ('MONTH1')] = 1
For example patient got Dose_1 (qty=1), which got him covered for 30 days, and comes back for Dose_2 (qty=2) 120 days later. Now it should be represented as:
M1 = 1, M2 = 0, M3 = 0, M4 = 0, M5 = 1 (patient came back 120 days after the first dose + double qty), M6 = 1, M7 = 0, M8 = 0 and so on.
welcome to stackoverflow,
for i in range(len(df['patient'])): #for loop to separate each patient in a separate dict
newdict={}
for k,v in df.items():
newdict[str(k)]=v[i]
data.append(newdict)
for log in data: # for loop to add up the total quantities and assign months to 1
for dose in range(log['qty1']+log['qty2']+log['qty3']):
log[f'M{dose+1}']=1
df = pd.DataFrame.from_dict(data)
df = df.to_csv('doses.csv')
i enjoyed figuring this out.
so basically i separated each patient and then added up the qty and put that through a for loop which assigns one to the months within the range of the added up quantities i hope that's what you were aiming for.
Edit:
for i in range(len(df['patient'])):
newdict={}
for k,v in df.items():
newdict[str(k)]=v[i]
data.append(newdict)
for log in data:
for dose in range(1,log['qty1']+1):
log[f'M{dose}']=1
gap = 1 + round((log['days_since_last_dose2']/30)+0.5)
for dose in range(gap,(gap+log['qty2'])):
log[f'M{dose}']+=1
gap1 = 1 + round((log['days_since_last_dose3']/30)+0.5)+gap
for dose in range(gap1,gap1+log['qty3']):
log[f'M{dose}']+=1
ok so i found the algorithm to calculate coverage, 2 indicates overlap in coverage.
Related
Calculation of index returns for specific timeframe for each row using (as an option) for loop
I am not too experienced with programming, and I got stuck in a research project in the asset management field. My Goal: I have 2 dataframes,- one containing aside from others columns "European short date"," SP150030after", "SP1500365before" (Screenshot) and second containing column "Dates" and "S&P 1500_return"(Screenshot). For each row in the first dataframe, I want to calculate cumulative returns of S&P 1500 for 365 days before the date in column "European short date" and cumulative returns of S&P 1500 for 30 days after the date in column "European short date" and put these results in columns "SP150030after" and "SP1500365before". These returns are to be calculated using a second Dataframe. "S&P 1500_return" column in the second data frame for each date represents "daily return of S&P 1500 market index + 1". So, for example to get cumulative returns over 1 year before 31.12.2020 in first dataframe, I would have to calculate the product of values in column "S&P 1500_return" from the second dataframe for each day present (trading day) in the dataframe 2 during the period 31.12.2019 - 30.12.2020. What I have tried so far: I turned "European short date" in DataFrame 1 and "Date" in Dataframe 2 to be index fields and though about approaching my goal through "for" loop. I tried to turn "European short date" to be "List" to use it to iterate through the dataframe 1, but I get the following error: "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead". Here is my code so far: Main_set = pd.read_excel('...') Main_set = pd.DataFrame(Main_set) Main_set['European short date'] = pd.to_datetime(Main_set['European short date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d') Main_set = Main_set.set_index('European short date') Main_set.head(5) Indexes = pd.read_excel('...') Indexes = pd.DataFrame(Indexes) Indexes['Date'] = pd.to_datetime(Indexes['Date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d') Indexes = Indexes.set_index('Date') SP1500DailyReturns = Indexes[['S&P 1500 SUPER COMPOSITE - PRICE INDEX']] SP1500DailyReturns['S&P 1500_return'] = (SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'] / SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'].shift(1)) SP1500DailyReturns.to_csv('...') Main_set['SP50030after'] = np.zeros(326) import math dates = Main_set['European short date'].to_list() dates.head() for n in dates: Main_set['SP50030after'] = math.prod(arr) Many thanks in advance!
In case it will be useful for someone, I solved the problem by using a for loop and dividing the problem in more steps: for n in dates: Date = pd.Timestamp(n) DateB4 = Date - pd.Timedelta("365 days") DateAfter = Date + pd.Timedelta("30 days") ReturnsofPeriodsBackwards = SP1500DailyReturns.loc[str(DateB4) : str(Date), 'S&P 1500_return'] Main_set.loc[str(Date), 'SP500365before'] = np.prod(ReturnsofPeriodsBackwards) ReturnsofPeriodsForward = SP1500DailyReturns.loc[str(Date) : str(DateAfter), 'S&P 1500_return'] Main_set.loc[str(Date), 'SP50030after'] = np.prod(ReturnsofPeriodsForward)
Sequentially comparing groupby values conditionally
Given a dataframe data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'], ['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']] df = pd.DataFrame(data, columns = ['Customer','Sequence']) Calculate the following: First Sequence in each group is assigned a GroupID of 1. Compare first Sequence to subsequent Sequence values in each group. If difference is greater than .5, increment GroupID. If GroupID was incremented, instead of comparing subsequent values to the first, use the current Sequence. In the desired results table below... Bob only has 1 record so the GroupID is 1. Alice has 2 records and the difference between the two Sequence values (46 & 47) is greater than .5 so the GroupID is incremented. Charlie's Sequence values are all the same, so all records get GroupID 1. For Doug, the difference between the first two Sequence values (23 & 35) is greater than .5, so the GroupID for the second Sequence becomes 2. Now, since the GroupID was incremented, I want to compare the next value of 35.5 to 35, not 23, which means the last two rows share the same GroupID. Desired results: CustomerID Sequence GroupID Bob 25 1 Alice 46 1 Alice 47 2 Charlie 19 1 Charlie 19 1 Charlie 19 1 Doug 23 1 Doug 35 2 Doug 35.5 2 My implementation: # generate unique ID based on each customers Sequence df['EventID'] = df.groupby('Customer')[ 'Sequence'].transform(lambda x: pd.factorize(x)[0]) + 1 # impute first Sequence for each customer for comparison df['FirstSeq'] = np.where( df['EventID'] == 1, df['Sequence'], np.nan ) # groupby and fill first Sequence forward df['FirstSeq'] = df.groupby('Customer')[ 'FirstSeq'].transform(lambda v: v.ffill()) # get difference of first Sequence and all others df['FirstSeqDiff'] = abs(df['FirstSeq'] - df['Sequence']) # create unique GroupID based on Sequence difference from first Sequence df["GroupID"] = np.cumsum(df.FirstSeqDiff > 0.5) + 1 The above works for cases like Bob, Alice and Charlie but not Doug because it is always comparing to the first Sequence. How can I modify the code to change the compared Sequence value if the GroupID is incremented? EDIT: The dataframe will always be sorted by Customer and Sequence. I guess a better way to explain my goal is to assign a unique ID to all Sequence values whose difference are .5 or less, grouping by Customer.
The code has errors -> add df = df.astype({'Customer':str,'Sequence':np.float64}) would fix it. But still you cannot get what you want with this design. Try to define your own lambda function myfunc, which solves your problem directly: data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'], ['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']] df = pd.DataFrame(data, columns = ['Customer','Sequence']) df = df.astype({'Customer':str,'Sequence':np.float64}) def myfunc(series): ret = [] series = series.sort_values().values for i,val in enumerate(series): if i==0: ret.append(1) else: ret.append(ret[-1]+(series[i]-series[i-1]>0.5)) return ret df['EventID'] = df.groupby('Customer')[ 'Sequence'].transform(lambda x: myfunc(x)) print (df) Happy coding my friend.
Finding average age of incidents in a datetime series
I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly! I want to build up a historic trend of the average age of outstanding incidents on a daily basis. I have two dataframes. df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values. df2 contains a column called date with the full date range from 2012-06-13 to now. The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding. I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on) This is unlikely to be the most efficient solution, but I believe you could do: df2["incs_open"] = 0 # Ensure the column exists for row_num in range(df2.shape[0]): df2.at[row_num, "incs_open"] = sum( (df1["opened_at"] < df2.at[row_num, "date"]) & (df2.at[row_num, "date"] < df1["opened_at"]) ) (This assumes you haven't set an index on the data frame other than the default one) For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this: open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \ (df2.at[row_num, "date"] < df1["opened_at"]) ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"] avg_age = ages_of_open_incs.mean() You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.
How to use a function for each row of a dataframe - .apply? .map .mask?
I often need to apply 'by row' logic to a dataframe and though i might have a function to do it, i'm not sure how to apply it to the dataframe rather than rely on a for loop. The below code works, but it seems like it isn't he best way of doing it. I tried putting it into functions and using lambda, .apply and .map, but whatever i'm doing i'm not putting it together correctly. I've tried reading around this, but can't get by head around whether i'm trying to apply to a dataframe, a series, element-wise etc in a actual example like below. For example look at this dummy data: import pandas as pd import numpy as np import datetime import random example = pd.DataFrame({'Total Completions(2015)': [0,0,0,5,10,0], 'Total Completions(2016)': [0,1,4,0,18,0], 'Total Completions(2017)': [2,1,6,5,0,15], 'Total Completions(2018)': [0,0,8,5,0,1]}) What i want to do is to find the first year that has completions in it (>0). The first bit of code identifies which columns are the total columns and picks out the relevant year from between the brackets. It then finds which of these possible years is the earliest with any completions in it (>0) and uses this as the minimum year in which to do the next part - assign a random date in that financial year. # initiates the columns i want to populate to start example['min_yr'] = 0 example['Dummy Date'] = '' # gets the first year where completions begin for row in example.index: min_yr = 0 for header in example.columns: # if it's one of the total columns if header[:5] == 'Total': # get the year as an integer from the brackets in the header get_yr = int(header[18:-1]) # if it's the first year with completions it comes across, that year is first minimum if min_yr == 0 and example[header][row] > 0: min_yr = get_yr # if there is already a min year but there is a new year with an entry that's earlier # replace with the new minimum year elif min_yr > 0 and example[header][row] > 0 and get_yr < min_yr: min_yr = get_yr example['min_yr'][row] = min_yr I then assign the random year based on the min year i just found: for row in example.index: start_date = datetime.date((example['min_yr'][row] -1), 4, 1) end_date = datetime.date(example['min_yr'][row], 3, 31) time_between_dates = end_date - start_date days_between_dates = time_between_dates.days random_number_of_days = random.randrange(days_between_dates) random_date = start_date + datetime.timedelta(days=random_number_of_days) example['Dummy Date'][row] = random_date So in these examples, am i applying to a dataframe or a series, and how do i structure the argument? Thanks
DAX Calculate Year To Date - Year to Previous Month (year change)
Trying to figure out how to calculate the equivalent to YTD but up til previous month when year changes. For example: Date | Value 2018-10 | 100 2018-11 | 100 2018-12 | 100 2019-01 | 100 2019-02 | 100 2019-03 | 100 Results expected: When 2019-03 is selected YTD = 300 (accumulated from 2019-01- to 2019-03) Previous month accumulated = 200 (accumulated from 2019-01 to 2019-02) When 2019-01 is selected YTD = 100 Previous month accumulated = 300 (accumulated from 2018-10 to 2018-12) I've tried: Value_Accum_PreviousMonth:=TOTALYTD([Value];Calendar[Date]) - [Value] but with the change of year, it doesn't work Any ideas?
Disclaimer - I created this solution using Power BI. The formula should transfer over to Excel/PowerPivot, but it may require some minor tweaks. While the time intelligence functions of DAX are super useful, I seem to be drawn towards the manual approach of calculating YTD, prior year, etc. With that in mind, this is how I would solve your problem. First, I would make a measure that is a simply sum of your Value column. Having this measure is just nice down the road; not an absolute necessity. Total Value = SUM(Data[Value]) Next, for use to calculate YTD manually based of the prior month, we need to know two things, 1) what is the target month (i.e. the prior month) and 2) what is the year of that month. If you are going to use those values anywhere else, I would put them into there own measure and use them here. If this is the only place they will be used, I like to use variables to calculate these kinds of values. The first value we need is the selected (or max date in cases of no selection). VAR SelectedDate = MAX(Data[Date]) With that date, we can calculate the TargetYear and TargetMonth. Both of them are simple IF statements to catch the January/December crossover. VAR TargetYear = IF(MONTH(SelectedDate) = 1, YEAR(SelectedDate) - 1, YEAR(SelectedDate)) VAR TargetMonth = IF(MONTH(SelectedDate) = 1, 12, MONTH(SelectedDate) - 1) Having all of the values we need, we write a CALCULATE statement that filters the data to the TargetYear and month is less than or equal to TargetMonth. CALCULATE([Total Value], FILTER(ALL(Data), YEAR([Date]) = TargetYear && MONTH([Date]) <= TargetMonth)) Putting it all together looks like this. Prior Month YTD = VAR SelectedDate = MAX(Data[Date]) VAR TargetYear = IF(MONTH(SelectedDate) = 1, YEAR(SelectedDate) - 1, YEAR(SelectedDate)) VAR TargetMonth = IF(MONTH(SelectedDate) = 1, 12, MONTH(SelectedDate) - 1) RETURN CALCULATE([Total Value], FILTER(ALL(Data), YEAR([Date]) = TargetYear && MONTH([Date]) <= TargetMonth))