I'm trying to have a pairwise comparison on two dataframes based on some key, but I'm having a hard time with pandas groupby in a double for loop since it is very slow. Is there any way I can optimize so that I don't have to recompute the groups every time I run the outer loop?
I tried using the same groupby variable but it doesn't seem to solve the recomputation problem.
mygroups = mydf.groupby('mykey')
for key1,subdf1 in mygroups:
for key2,subdf2 in mygroups:
if(key2 <= key1):
continue
do_some_work(subdf1,subdf2)
subdf2 seems to start recomputing from the first key rather than from the next key after key1. In my use-case scenario I expected that key2 will be the next in the iteration after key1 and so on. How can I have such behavior happen without the need to recompute?
Your observation is correct, the inner loop iterates over the whole dataframe, not just the records after key1.
The easiest way for smaller DataFrames
I would create a list with the groups first and then iterate over this list.
This is what I would do:
mygroups_list= [(key, subdf) for (key, subdf) mydf.groupby('mykey')]
for len(mygroups_list) > 0:
key1,subdf1= mygroups_list.pop(0)
for key2,subdf2 in mygroups_list:
do_some_work(subdf1,subdf2)
You just have to make sure, the groups are really sorted, but AFAIK this is done by the .groupby method anyways. If you are not sure, you can just add a mygroups_list.sort(key=lambda tup: tup[0]) outside your loop.
If size yet does matter
For larger dataframes you can avoid materializing the dataframes at once and just defer that until you actually need the data like this:
# create the groupby object as usual
group_by= mydf.groupby('mykey')
# now fetch the row indices from the groupby object
# and because this is actually a dictionary
# extract the keys from it and sort them
mygroups_dict= group_by.indices
mygroups_labels= list(mygroups_dict)
mygroups_labels.sort()
# now use a similar approach as above
while len(mygroups_labels) > 0:
key1= mygroups_labels.pop(0)
# but instead of creating the sub dataframes
# before you enter the loop, just do it
# within the loop and use the row indices
# the groupby object evaluated
subdf1= mydf.iloc[mygroups_dict[key1]]
for key2 in mygroups_labels:
subdf2= mydf.iloc[mygroups_dict[key2]]
do_some_work(subdf1, subdf2)
That should be much less memory extensive because you just need to store the row indices instead of the whole rows throughout the hole processing time.
For the following example setup:
import numpy as np
def do_some_work(subdf1, subdf2):
print('{} --> {} (len={}/{})'.format(subdf1['mykey'].iat[0], subdf2['mykey'].iat[0], len(subdf1), len(subdf2)))
mydf= pd.DataFrame(dict(mykey=np.random.randint(5, size=100), col=range(1, 101)))
This outputs something like (of course the len info will look different from run to run because of the randint). But note the group labels (left and right of the arrow). On the right side you have key2 which always is > key1:
0 --> 1 (len=21/16)
0 --> 2 (len=21/21)
0 --> 3 (len=21/20)
0 --> 4 (len=21/22)
1 --> 2 (len=16/21)
1 --> 3 (len=16/20)
1 --> 4 (len=16/22)
2 --> 3 (len=21/20)
2 --> 4 (len=21/22)
3 --> 4 (len=20/22)
Related
Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)
So, i'm facing this seems-to-be-classic-problem, extract timeframed toppers for unbounded stream,
using Apache Beam (Flink as the engine):
Assuming sites+hits tuples input:
{"aaa.com", 1001}, {"bbb.com", 21}, {"aaa.com", 1002}, {"ccc.com", 3001}, {"bbb.com", 22} ....
(Expected rate: +100K entries per hour)
Goal is to output sites which are >1% of total hits, in each 1 hour timeframe.
i.e. for 1 hour fix window, pick the site that sums >1% hits out total hits.
So first, sum by key:
{"aaa.com", 2003}, {"bbb.com", 43}, {"ccc.com", 3001} ....
And finally output the >1%:
{"aaa.com"}, {"ccc.com"}
Alternative:
1) Group + parDo:
Fixed time windowing 1 hour, group all elements, following by iterable parDo for all window elements, calculate sum and output the >1% sites.
Cons seems to be all agg process done single thread and also seems require double iterations to get the sum and get >1%.
2) GroupByKey + Combine
Fixed time windowing 1 hour, GrouByKey using key=Site, applying Combine with custom accumulator to sum hits per key.
Although the Combine option(#2) seems more suitable,
i'm missing the part of getting in the sum-per-1-hour-window, needed to calculate the >%1 elements:
Can same window be used for 2 combines: one per key and one total hits sum in this window?
and what is the best approach to combine them both to make the >1% call per element?
10x
You can do this via side inputs. For instance, you'd do something like this (code in Python, but answer for Java is similar):
input_data = .... # ("aaa.com", 1001), ("bbb.com", 21), ("aaa.com", 1002), ("ccc.com", 3001), ("bbb.com", 22) ....
total_per_key = input_data | beam.CombinePerKey(sum)
global_sum_per_window = beam.pvalue.AsSingleton(
input_data
| beam.Values()
| beam.CombineGlobally(sum).without_defaults())
def find_more_than_1pct(elem, global_sum):
key, value = elem
if value > global_sum * 0.01:
yield elem
over_1_pct_keys = total_per_key | beam.FlatMap(find_more_than_1pct)
In this case, the global_sum_per_window PCollection will have one value for each window, and the total_per_key will have one value per-key-per-window.
Let me know if that works!
I'm attempting to merge two large dataframes (one 50k+ values, and another with 650k+ values - pared down from 7M+). Merging/matching is being done via fuzzywuzzy, to find which string in the first dataframe matches which string in the other most closely.
At the moment, it takes about 3 minutes to test 100 rows for variables. Consequently, I'm attempting to institute Dask to help with the processing speed. In doing so, Dask returns the following error: 'NotImplementedError: Series getitem in only supported for other series objects with matching partition structure'
Presumably, the error is due to my dataframes not being of equal size. In trying to set a chunksize when converting the my pandas dataframes to dask dataframes, I receive an error (TypeError: 'float' object cannot be interpreted as an integer) even though I previous forced all my datatypes in each dataframe to 'objects'. Consequently, I was forced to use the npartitions parameter in the dataframe conversion, which then leads to the 'NotImplementedError' above.
I've tried to standardize the chunksize with partitions with a mathematical index, and also tried using the npartitions parameter to no effect, and resulting in the same NotImplementedError.
As mentioned my efforts to utilize this without Dask have been successful, but far too slow to be useful.
I've also taken a look at these questions/responses:
- Different error
- No solution presented
- Seems promising, but results are still slow
''''
aprices_filtered_ddf = dd.from_pandas(prices_filtered, chunksize = 25e6) #prices_filtered: 404.2MB
all_data_ddf = dd.from_pandas(all_data, chunksize = 25e6) #all_data: 88.7MB
# import dask
client = Client()
dask.config.set(scheduler='processes')
# Define matching function
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iterating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.token_set_ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our players without salaries found above for name in prices_filtered_ddf['ndc_description'][:100]:
# Use our method to find best match, we can set a threshold here
match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
# New dict for storing data
dict_ = {}
dict_.update({'ndc_description_agg' : name})
dict_.update({'ndc_description' : match[0]})
dict_.update({'score' : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
Here's the full error:
NotImplementedError Traceback (most recent call last)
<ipython-input-219-e8d4dcb63d89> in <module>
3 dict_list = []
4 # iterating over our players without salaries found above
----> 5 for name in prices_filtered_ddf['ndc_description'][:100]:
6 # Use our method to find best match, we can set a threshold here
7 match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
C:\Anaconda\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
2671 return Series(graph, name, self._meta, self.divisions)
2672 raise NotImplementedError(
-> 2673 "Series getitem in only supported for other series objects "
2674 "with matching partition structure"
2675 )
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
''''
I expect that the merge_table will return, in a relatively short time, a dataframe with data for each of the update columns. At the moment, it's extremely slow.
I'm afraid there are a number of problems with this question, so after pointing these out, I can only provide some general guidance.
The traceback shown is clearly not produced by the code above
The indentation and syntax are broken
A distributed client is made, then config set not to use it ("processes" is not the distributed scheduler)
The client object is called, client(...), but it is not callable, this shouldn't work at all
The main processing function, match_name is called directly; how do you expect Dask to intervene?
You don't ever call compute(), so in the code given, I'm not sure Dask is invoked at all.
What you actually want to do:
Load your smaller, reference dataframe using pandas, and call client.scatter to make sure all the workers have it
Load your main data with dd.read_csv
Call df.map_partitions(..) to process your data, where the function you pass should take two pandas dataframes, and work row-by-row.
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column
I have been set a task, to using an original Dataset of 25 A's and C's (50 data points in total), to randomly select 1 of the 50 data points then add this to a new nested list and to repeat this 50 times (length of dataset). This random selection then needs to be done again with the new dataset, for 25 times.
import random
SetSize = 50
StatData = [[] for b in range(1,50)]
for i in range(0, int(SetSize/2)):
StatData[0].append('A')
for i in range(int(SetSize/2)):
StatData[0].append('C')
for a in range(1,25):
for b in range(0, int(SetSize)):
StatData(a+1).append(random.choice(StatData[a]))
This is my current piece of code, so i have created the empty nested list, created the initial dataset (StatData0) however the next section is not working it is returning
"TypeError: 'list' object is not callable"
Any help would be much appreciated!
The error says it all. You cannot call a list.
Typo
On the last line of the code you provided, change
StatData(a+1).appen...
to
StatData[a+1].appen...
and I guess it should fix the error.
Index problem (edit)
You are populating the list of index 0 with As and Bs (StatData[0]) but in your last loop a goes from 1 to 50. So your first iteration will choose a random character in StatData[1] (which is empty) thus throwing another error.
Starting your loop from the index 0 should make this go away !