Optimize for loop Pandas - python-3.x

There is a data frame with name df which contains repeating rows identified with DICE_SUMMARY_ID.
After I perform some calculations for different columns, I need to write back the results to the original dataframe.
The issue is that df contains over 100k rows and a for loop is very time consuming. Currently, it shows about 3 hours.
How can I reduce the time?
#extract unique ids from dataframe
uniqueIDs = df['DICE_SUMMARY_ID'].unique()
#iterate over the unique ids and calculate
for i in range(len(uniqueIDs)):
#get a slice of the dataframe at i'th unique id
uniqueID_df = df.loc[df['DICE_SUMMARY_ID'] == uniqueIDs[i]]
#calculate sum of all family types
SINGLE_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10001).sum())
EXTRA_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10003).sum())
NO_OF_ADULTS = int(SINGLE_ADULTS + EXTRA_ADULTS)
NO_OF_DEPENDENTS_U_16 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20001).sum())
NO_OF_DEPENDENTS_16_TO_18 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20002).sum())
#get the array of indexes of each unique uid in the tuple
#each unique uid has 10 - 20 rows in the original df,
#given that there are over 100k records, this becoming very time consuming
indices = np.where(df["DICE_SUMMARY_ID"] == uniqueIDs[i])[0]
for j in indices:
# #insert value in column at index for each repeating index
df['NO_OF_ADULTS'].iloc[j] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[j] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[j] = NO_OF_DEPENDENTS_16_TO_18
faster version, but I am still not satisfied
df['NO_OF_ADULTS'].iloc[indices.min():indices.max()] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_16_TO_18

Related

Spark window function and taking first and last values per column per partition (aggregation over window)

Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition

How to get specific attributes of a df that has been grouped

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

pyspark rdd of csv to data frame with large number of columns dynamically

I have an existing rdd which consists of a single column of text with many (20k+) comma separated values.
How can I convert this to a data frame without specifying every column literally?
# split into columns
split_rdd = input_rdd.map(lambda l: l.split(","))
# convert to Row types
rows_rdd = split_rdd.map(lambda p: Row(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
))
df = spark.createDataFrame(rows_rdd)
How can I dynamically create the
field_1=p[0],
dict?
For example
row_dict = dict(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
)
is invalid syntax since the 'p[0]' needs to be quoted, but then it is a literal and doesn't get evaluated in the lambda function.
This is a large enough dataset that I need to avoid writing out the rdd and reading it back into a dataframe for performance.
You could try using dictionary comprehension in your creation of the row instance:
df = split_rdd\
.map(lambda p: {'field_%s' % index : val
for (index, val) in enumerate(p)})\
.map(lambda p: Row(**p))\
.toDF()
This is first mapping the list column values array from split_rdd into a dictionary with dynamically generated field_N keys mapped to respective values. These dictionaries are then used in the creation of Row instances.

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

Improving the speed of cross-referencing rows in the same DataFrame in pandas

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

Resources