Improving the speed of cross-referencing rows in the same DataFrame in pandas - python-3.x

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min

I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

Related

Using Pandas to get a contiguous segment of one dataframe and copy it into a new one?

Using Pandas, I'm attempting to 'slice' (Sorry if that's not the correct term) segments of a dataframe out of one DF and into a new one, where every segment is stacked one on top of the other.
Code:
import pandas as pd
df = pd.DataFrame(
{
'TYPE': ['System','VERIFY','CMD','SECTION','SECTION','VERIFY','CMD','CMD','VERIFY','CMD','System'],
'DATE': [100,200,300,400,500,600,700,800,900,1000,1100],
'OTHER': [10,20,30,40,50,60,70,80,90,100,110],
'STEP': ['Power On','Start: 2','Start: 1-1','Start: 10-7','End: 10-7','Start: 3-1','Start: 10-8','End: 1-1','End: 3-1','End: 10-8','Power Off']
})
print(df)
column_headers = df.columns.values.tolist()
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_step = 'STEP'
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
type_df = df[df[col_name_type].isin(types_to_check)]
for row in type_df:
if 'CMD' in row:
if 'START:' in row[col_name_step].value:
idx_start = row.iloc[::-1].str.match('VERIFY').first_valid_index() #go backwards and find first VERIFY
step_match = row[col_name_step].value[6:] #get the unique ID after Start:
idx_end = df[df[col_name_step].str.endswith(step_match, na=False)].last_valid_index() #find last instance of matching unique id
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
print(df)
print(df_segments)
Nothing gets populated in my segements array so the concat function fails.
From my research I'm confident that this can be done using either .loc or .iloc, but I can't seem to get a working implementation in.
My DF:
What I am trying to make:
Any help and/or guidance would be welcome.
Edit: To clarify, I'm trying to create a new DF that is comprised of every group of rows, where the start is the "VERIFY" that comes before a "CMD" row that also contains "Start:", and the end is the matching "CMD" row that has end.
EDIT2: I think the following is something close to what I need, but I'm unsure how to get it to reliably work:
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
cmd_check = ['CMD']
verify_check = ['VERIFY']
cmd_df = df[(df[col_name_type].isin(cmd_check))]
cmd_start_df = cmd_df[(cmd_df[col_name_step].str.contains('START:'))]
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,]
idx_start = temp_df[col_name_type].isin(verify_check).last_valid_index()
idx_end = cmd_df[cmd_df[col_name_type].str.endswith(step_name, na=False)].last_valid_index()
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
you can use str.contains
segmented_df = df.loc[df['STEP'].str.contains('Start|End')]
print(segmented_df )
I created some code to accomplish the 'slicing' I wanted:
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,:]
temp_list = temp_df[col_name_type].values.tolist()
if 'VERIFY' in temp_list:
idx_start = temp_df[temp_df[col_name_type].str.match('VERIFY')].last_valid_index()
else:
idx_start = cmd_idx
idx_end = cmd_df[cmd_df[col_name_step].str.endswith(step_name, na=False)].last_valid_index()
slides.append(df.loc[idx_start:idx_end, :])
slides.append(df_blank)
I essentially create a new DF that is a subset of the old DF up to my first START index, then I find the last_valid_index that has VERIFY, then I use that index to create a filtered DF from idx_start to idx_end and then eventually concat all those slices into one DF.
Maybe there's an easier way, but I couldn't find it.

Optimize for loop Pandas

There is a data frame with name df which contains repeating rows identified with DICE_SUMMARY_ID.
After I perform some calculations for different columns, I need to write back the results to the original dataframe.
The issue is that df contains over 100k rows and a for loop is very time consuming. Currently, it shows about 3 hours.
How can I reduce the time?
#extract unique ids from dataframe
uniqueIDs = df['DICE_SUMMARY_ID'].unique()
#iterate over the unique ids and calculate
for i in range(len(uniqueIDs)):
#get a slice of the dataframe at i'th unique id
uniqueID_df = df.loc[df['DICE_SUMMARY_ID'] == uniqueIDs[i]]
#calculate sum of all family types
SINGLE_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10001).sum())
EXTRA_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10003).sum())
NO_OF_ADULTS = int(SINGLE_ADULTS + EXTRA_ADULTS)
NO_OF_DEPENDENTS_U_16 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20001).sum())
NO_OF_DEPENDENTS_16_TO_18 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20002).sum())
#get the array of indexes of each unique uid in the tuple
#each unique uid has 10 - 20 rows in the original df,
#given that there are over 100k records, this becoming very time consuming
indices = np.where(df["DICE_SUMMARY_ID"] == uniqueIDs[i])[0]
for j in indices:
# #insert value in column at index for each repeating index
df['NO_OF_ADULTS'].iloc[j] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[j] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[j] = NO_OF_DEPENDENTS_16_TO_18
faster version, but I am still not satisfied
df['NO_OF_ADULTS'].iloc[indices.min():indices.max()] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_16_TO_18

Data to explode between two columns

My current dataframe looks as below:
existing_data = {'STORE_ID': ['1234','5678','9876','3456','6789'],
'FULFILLMENT_TYPE': ['DELIVERY','DRIVE','DELIVERY','DRIVE','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-02','2020-08-03','2020-08-04','2020-08-05'],
'DAY_OF_WEEK':['SATURDAY','SUNDAY','MONDAY','TUESDAY','WEDNESDAY'],
'START_HOUR':[8,8,6,7,9],
'END_HOUR':[19,19,18,19,17]}
existing = pd.DataFrame(data=existing_data)
I would need the data to be exploded between the start and end hour such that each hour is a different row like below:
needed_data = {'STORE_ID': ['1234','1234','1234','1234','1234'],
'FULFILLMENT_TYPE': ['DELIVERY','DELIVERY','DELIVERY','DELIVERY','DELIVERY'],
'FORECAST_DATE':['2020-08-01','2020-08-01','2020-08-01','2020-08-01','2020-08-01'],
'DAY_OF_WEEK': ['SATURDAY','SATURDAY','SATURDAY','SATURDAY','SATURDAY'],
'HOUR':[8,9,10,11,12]}
required = pd.DataFrame(data=needed_data)
Not sure how to achieve this ..I know it should be with explode() but unable to achieve it.
If small DataFrame or performance is not important use range per both columns with DataFrame.explode:
existing['HOUR'] = existing.apply(lambda x: range(x['START_HOUR'], x['END_HOUR']+1), axis=1)
existing = (existing.explode('HOUR')
.reset_index(drop=True)
.drop(['START_HOUR','END_HOUR'], axis=1))
If performance is important use Index.repeat by subtract both columns and then add counter by GroupBy.cumcount to START_HOUR:
s = existing["END_HOUR"].sub(existing["START_HOUR"]) + 1
df = existing.loc[existing.index.repeat(s)].copy()
add = df.groupby(level=0).cumcount()
df['HOUR'] = df["START_HOUR"].add(add)
df = df.reset_index(drop=True).drop(['START_HOUR','END_HOUR'], axis=1)

Spark window function and taking first and last values per column per partition (aggregation over window)

Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition

How to get specific attributes of a df that has been grouped

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

Resources