There is a data frame with name df which contains repeating rows identified with DICE_SUMMARY_ID.
After I perform some calculations for different columns, I need to write back the results to the original dataframe.
The issue is that df contains over 100k rows and a for loop is very time consuming. Currently, it shows about 3 hours.
How can I reduce the time?
#extract unique ids from dataframe
uniqueIDs = df['DICE_SUMMARY_ID'].unique()
#iterate over the unique ids and calculate
for i in range(len(uniqueIDs)):
#get a slice of the dataframe at i'th unique id
uniqueID_df = df.loc[df['DICE_SUMMARY_ID'] == uniqueIDs[i]]
#calculate sum of all family types
SINGLE_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10001).sum())
EXTRA_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10003).sum())
NO_OF_ADULTS = int(SINGLE_ADULTS + EXTRA_ADULTS)
NO_OF_DEPENDENTS_U_16 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20001).sum())
NO_OF_DEPENDENTS_16_TO_18 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20002).sum())
#get the array of indexes of each unique uid in the tuple
#each unique uid has 10 - 20 rows in the original df,
#given that there are over 100k records, this becoming very time consuming
indices = np.where(df["DICE_SUMMARY_ID"] == uniqueIDs[i])[0]
for j in indices:
# #insert value in column at index for each repeating index
df['NO_OF_ADULTS'].iloc[j] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[j] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[j] = NO_OF_DEPENDENTS_16_TO_18
faster version, but I am still not satisfied
df['NO_OF_ADULTS'].iloc[indices.min():indices.max()] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_16_TO_18
Related
Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition
I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])
I have an existing rdd which consists of a single column of text with many (20k+) comma separated values.
How can I convert this to a data frame without specifying every column literally?
# split into columns
split_rdd = input_rdd.map(lambda l: l.split(","))
# convert to Row types
rows_rdd = split_rdd.map(lambda p: Row(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
))
df = spark.createDataFrame(rows_rdd)
How can I dynamically create the
field_1=p[0],
dict?
For example
row_dict = dict(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
)
is invalid syntax since the 'p[0]' needs to be quoted, but then it is a literal and doesn't get evaluated in the lambda function.
This is a large enough dataset that I need to avoid writing out the rdd and reading it back into a dataframe for performance.
You could try using dictionary comprehension in your creation of the row instance:
df = split_rdd\
.map(lambda p: {'field_%s' % index : val
for (index, val) in enumerate(p)})\
.map(lambda p: Row(**p))\
.toDF()
This is first mapping the list column values array from split_rdd into a dictionary with dynamically generated field_N keys mapped to respective values. These dictionaries are then used in the creation of Row instances.
User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)
I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())