Slow loop aggregating rows and columns - python-3.x

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).

I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Related

I'm trying to remove certain words from a column on each row in my dataframe

I'm still trying to understand how pandas works please bear with me. In this exercise, I,m trying to access a particular column ['Without Stop Words'] on each row which has a list of words. I wish to remove certain words from each row of that column. the words to be removed have been specified in a dictionary called {'stop_words_dict'}. here's my code, but the dataframe seems to be unchanged after running it.
def stop_words_remover(df):
# your code here
df['Without Stop Words']= df['Tweets'].str.lower().str.split()
for i, r in df.iterrows():
for word in df['Without Stop Words']:
if word in stop_words_dict.items():
df['Without Stop Words'][i] = df['Without Stop Words'].str.remove(word)
return df
this is how the input looks like
INPUT
EXPECTED OUTPUT
In Pandas, it's generally a bad idea to loop over your dataframe row by row to try to change rows. Instead, try using methods like .apply().
An example for stopwords, together with list comprehension:
test['Tweets'].apply(lambda x: [item for item in x if item not in stop_words_dict.items()])
See https://stackoverflow.com/a/29523440/12904151 for more context.

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)

Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.

How to extract several dataframes from dictionary

I currently trying to extract several dataframes from a dictionary. The problem is, that the number of dataframes will vary, sometimes I'll have two dataframes in there and sometimes 30.
At the beginning I create a dictionary (dict_of_exceptions) from a dataframe (exceptions_df). In this dictionary I'll have several dataframes depending on how many different 'Source Wells' I have. With the current code I can extract the first dataframe from the dictionary which is j:
dict_of_exceptions = {k: v for k, v in exceptions_df.groupby('Source Well') }
print (dict_of_exceptions)
for k in dict_of_exceptions.keys():
j = dict_of_exceptions[k]
Could someone help me modify the last line to go trough the dictionary and extract each dataframe (and name them like the corresponding key)?
I think I get your intention, but could not really read your intentions from your code though. Currently, as #cyrilb38 stated in comments, your loop is overriding j, so you would only be able to see the result of last iteration. Anyways, rather transforming use dataframe instead, and I think (may be wrong) that you call your row a dataframe. Replacing a groupby object with dict is not what you wanted, or it is just prolonging the process for nothing.
If you want to see the info of Well X only for example, try this
exceptions_df[exceptins_df['Source Well'] == 'Well X']

Pandas - iterating to fill values of a dataframe

I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?

Resources