Parallel processing a dataframe after partitioned by a column value

Parallel processing a dataframe after partitioned by a column value - python-3.x

I need help optimizing my code to enhance python's processing time for a sequential task. For every hospital encounter record, I need to count the number of missed appointments in the past 3 months.
The hospital encounter data has patient ID ('PAT_ID'), CONTACT_DATE and missed_appoint binary flags to flag every appointment encounter as missed appointment or not.
So, I need to go over every record and count the records for that specific patient that is flagged as missed_appointment in the past 3 months.
I tried two ways to do it, and neither of them is efficient considering having more than 10 million records.
Calculating the date the mark the starting of the 3 prior months
encounter['CONTACT_DATE_3'] = encounter["CONTACT_DATE"]-pd.Timedelta(90,'D')
1-
encounter['missed_appoint_past3M'] = encounter.apply(lambda x: encounter.loc[ (encounter['CONTACT_DATE'] >= x['CONTACT_DATE_3']) &
(encounter['CONTACT_DATE'] < x['CONTACT_DATE']) &
(encounter['PAT_ID'] == x['PAT_ID']), ['missed_appoint']].sum(), axis=1)
2-
for index, x in encounter.iterrows():
encounter.loc[index, 'missed_appoint_past3M'] = encounter[
(encounter["CONTACT_DATE"] >= x["CONTACT_DATE_3"]) & (encounter["CONTACT_DATE"] < x["CONTACT_DATE"]) & (encounter["PAT_ID"] == x["PAT_ID"])]['missed_appoint'].sum()
Both are basically for loops. I appreciate it if anyone could help me optimize the code or maybe a safe way to parallelize this task considering that I need to ensure the each processing is done for all data for a patient together.
Thanks in advance!

Related

Optimize Past stock calculation using cumulative sum in python and pandas without iterating through dataframe (Performance warning)

I'm calculating the cumulative sum of one specific Clothes store stock over time (grouped by Family, Groups, Year and months). To be able to re-establish the stock levels of the past based on three values: Number of purchased items , Number of sold items and the current stock i have today.
I have already solved the calculation problem by: Merging the stock table with the movement table and calculating the mov_itens['novo_estoque'] with the formula bellow:
mov_itens['novo_estoque'] = mov_itens['vendas'] - mov_itens['compras'] + mov_itens['estoque']
Then i have transformed it on a multi-index dataframe. Where indexes are respecticelly: codfamily, codgroup, year and month. By doing:
gruposemindice = mov_itens.groupby(['codfamilia','codgrupo','ano','mes']).sum()
And calculated the CUMSUM() on the 'estoque' column. Where i could not use it after a map or something like that, because i wasn't able to return (to my new dataframe) the other columns that shouldn't receive the cumulative sum.
gruposemindice_ord = gruposemindice.sort_index(ascending=False)
for i in gruposemindice_ord.index:
if(f == i[0]):#codfamilia
if(g == i[1]):#codgrupo
gruposemindice_ord.loc[i[:-2]]['estoque'] = (gruposemindice_ord.loc[i[:-2]]['novo_estoque']).cumsum()
#calcular giro de estoque nessa linha
print(gruposemindice_ord.loc[i[:-2]])
else:
g = i[1]
else:
f = i[0]
The problem is that I'm doing it iteratively and the dataframe is sorted DESCENDING by index, which makes the query last like order(n) for each index that I access. In fact, it should be Order (1) [direct access], to be fast enough and do not cause a bottleneck and these errors that are appearing....
gruposemindice_ord.loc[i[:-2]]['estoque'] = (gruposemindice_ord.loc[i[:-2]]['novo_estoque']).cumsum()
C:\Users\Diego\AppData\Local\Temp/ipykernel_20264/1416332248.py:10: **PerformanceWarning: indexing past lexsort depth may impact performance.**
print(gruposemindice_ord.loc[i[:-2]])
C:\Users\Diego\AppData\Local\Temp/ipykernel_20264/1416332248.py:9: PerformanceWarning: indexing past lexsort depth may impact performance.
gruposemindice_ord.loc[i[:-2]]['estoque'] = (gruposemindice_ord.loc[i[:-2]]['novo_estoque']).cumsum()
C:\Users\Diego\AppData\Local\Temp/ipykernel_20264/1416332248.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
According to python,I can correct it just ASCENDING sorting the dataframe before doing the manipulation, but I need a DESCENDING sorting, because stock info is only available for the current month, which is the last month of the table, and then I calculate the previous months starting for it.... If I sort the other way the calculation will not work.
Some Observations
The dataframe Multi-indexes are the numbers of the Families, Groups and Brands of the store, so I can't re-index and loose these numbers,
Also i cannot do this by account on an ascending order, as my first stock is on the last month
I am already checking if the sort and transaction are correct (As pointed on another stack answer).
gruposemindice_ord = gruposemindice.sort_index(ascending=False)
gruposemindice_ord.index.is_lexsorted()
Hope someone can help me!
Best Regards, Diego Mello

Setting a value in a column within a group according to some condition

I'm new to groups in pandas, and relatively new to pandas, so I hope someone of you can help me with my problem.
Aim: flag outliers within a group by setting the relevant cell in the relevant column to 1.The condition is that the data point is outside a calculated group specific limit.
Data: This is a geopandas dataframe with multiple time series with some numeric variables. Each timeseries has its own id.
Some background:
I want to determine outliers for each timeseries by
first grouping the timeseries according to timeseries id
then calculate the lower and upper limit of the variables PER group
then 'flag' the values which are outside the limit by adding a 1 in a specific 'outlier'column
Here is the code which calculates the limits, however, when it comes to setting the flag I have a hard time to figure that out:
df_timeseries['outlier'] = np.zeros
for timeseries, group in df_timeseries.groupby('timeseries.id'):
Q1 = group['Variable.value'].quantile(0.25)
Q3 = group['Variable.value'].quantile(0.75)
IQR = Q3 - Q1
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 *IQR
for value in group['Variable.value']:
if ((value < low_lim) or (value > up_lim)):
# here --> set '1' in the column 'outlier'
I tried it multiple ways, for example:
df_timeseries.loc[df_timeseries['Variable.value'] > up_lim, 'outlier']=1
I also tried 'apply()', so instead of iterating over the tracks I tried to first define a function and then apply it on the group. However nothing really worked, and I could not find out what I actually do wrong. If someone can help, I would be really glad, as I have already tried to figure this out about a couple of hours.
I would need something like:
group.loc[group['outlier']] = 1

How to define an array of numbers with a formula

I have a project where I need to break people into 3 buckets with task lists that rotate quarterly (Phase A = task list 1, B = task list 2, C = task list 3). The goal here is to sort people into the buckets based on a departure date, with the ideal being that they would depart when they're in the C phase. I have a formula already set up that will tell me the number of quarters between the project start date and the person's departure date, so now I'm trying to figure out how to get Excel to tell me if a person's departure date falls within their bucket's C Phase.
I have this formula in a column called DEROSQtr:=ROUNDDOWN(DAYS360("1-Oct-2020",[#DEROS],FALSE)/90,0)
Now the easy way to approach this would be to build a static array and just see if that formula results in a value in the right array, where the numbers in the array define which quarter from Oct 2020 that the bucket's C Phase is going to be in:
ArrayA = {1;4;7;10;13;16} ArrayB = {2;5;8;11;14;17} ArrayC = {0;3;6;9;12;15}
The formula that pulls this all together is then:
=IF([#EFP]="A",IF(IFNA(MATCH([#DEROSQtr],ArrayA,0),-1)<>-1,TRUE,FALSE),IF([#EFP]="B",IF(IFNA(MATCH([#DEROSQtr],ArrayB,0),-1)<>-1,TRUE,FALSE),IF([#EFP]="C",IF(IFNA(MATCH([#DEROSQtr],ArrayC,0),-1)<>-1,TRUE,FALSE),"-")))
Now while this will work for as long as I build out the static array, I'm trying to figure out how to define each of these buckets with a formula that Excel can work with, i.e. bucket A hits phase C in 3n + 1 quarters where n is the number of cycles through all 3 phases, so ArrayA = 3n+1, ArrayB = 3n+2 and ArrayC = 3n. What I'm hunting for here is the best way to define each of the arrays as a formula.

After some additional digging and looking back at how to define each array, I came across the MOD() function in Excel. I was then able to rewrite the formula that does the checking as =IF([#EFP]="A",IF(MOD([#DEROSQtr]-1,3)=0,TRUE,FALSE),IF([#EFP]="B",IF(MOD([#DEROSQtr]-2,3)=0,TRUE,FALSE),IF([#EFP]="C",IF(MOD([#DEROSQtr],3)=0,TRUE,FALSE),"-"))), replacing ArrayA(3n+1) with MOD([#DEROSQtr]-1,3), ArrayB(3n+2) with MOD([#DEROSQtr]-2,3), and ArrayC(3n) with MOD([#DEROSQtr],3).

Since I do not have the data you are calculating your quarter, its difficult to give you exact answer. However, as I understand your have a column which has the formula to calculate the quarter say "Formula_Col"
Solution will be to add a new column and flag it based on the values in "Formula_Col".
If you can give some sample data I can provide exact answer.

Loop to select rows until I reach a target number

I have two dataframes, which I join to see the active people. There are people who stop being active and I use one of the dataframes to fill the other one.
mass pnr freq
1 [40666303, 68229102, 35784905, 47603805] 4
54 [17182402] 1
234 [07694901, 35070201, 36765601] 3
The other table looks the same I just need to select enough people to reach my target of 7500-7600 people (40666303 - this is one person and 'freq' is the number of people in the list. It doesn't matter what is the 'mass', I just need when the sum of 'freq' is between 7500 and 7600 the process to stop. Now I need 400 people, but next month it is possible to need 20 people, it differs every month. Basically, my code now removes the non-active people and when it removes them i need to replace them with active. The first run of the process I used this code to select the initial 7500 people:
target = 7500
freq_sum = sum(mass_grouped3['freq'])
new_mass_not_in_whitelist1['records_to_select'] = [math.ceil(int((el * target ) / freq_sum )) for el in new_mass_not_in_whitelist1['freq']]
But now, with this code I am not getting the desired sum of people to fill the missing gap of 400 people. Also, it would be good to not select only the first rows, but maybe every other or some random condition. What can I change to work the way I explained?

How to skip the last iteration of a for loop in a dataframe

I want to skip the final iteration of my for loop that i am running on a dataframe as it is giving incorrect results. This might be a very trivial problem and need help on this as i am new to python.I have my data spanning into millions of rows and a few hundred thousand ids.
I have a dataframe(df) with 3 columns ID,Event,Time. I am trying to compute another column TimeDiff, which is the difference between events for a particular id, can also be referred to as time to live for a particular event.
ID|Event|Time|TimeDiff
1|x|hh|hh(y)-hh(x)
1|y|hh|hh(z)-hh(y)
1|z|hh|Nan
2|x|hh|hh(y)-hh(x)
2|y|hh|Nan
Above is the desired output but the approach i am trying gives me a value for Nan as well which is basically the time difference with the next event and next id which ideally should be Nan.
for i in df.Id.unique():
df['TimeDiff'] = (df['Time'].shift(-1) - df['Time']).astype('timedelta64[h]')
Expected Result:
ID|Event|Time|TimeDiff
1|x|hh|hh(y)-hh(x)
1|y|hh|hh(z)-hh(y)
1|z|hh|Nan
2|x|hh|hh(y)-hh(x)
2|y|hh|Nan
Actual Result:
ID|Event|Time|TimeDiff
1|x|hh|hh(y)-hh(x)
1|y|hh|hh(z)-hh(y)
1|z|hh|hh(x)-hh(z)
2|x|hh|hh(y)-hh(x)
2|y|hh|Nan
If i am able to skip the final for loop iteration for my id no 1, i will be able to get the desired solution

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parallel processing a dataframe after partitioned by a column value - python-3.x

Related

Optimize Past stock calculation using cumulative sum in python and pandas without iterating through dataframe (Performance warning)

Setting a value in a column within a group according to some condition

How to define an array of numbers with a formula

Loop to select rows until I reach a target number

How to skip the last iteration of a for loop in a dataframe

Categories

Resources