I've 2 dataframe, both with a column date:
I need to set in first dataframe the value of specific column found in the second dataframe,
So in first of all I find the correct row of first dataframe with:
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0]) #example: 910
and then I want to update the value of column ['search_volume'] at this row: 910
I will do this with:
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
but I get back this error:
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
my full code is, but not working and nothing is updated.
for x in range(len(searchs)):
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0])
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
It work fine if I test manually the update with:
dataset['search_volume'][910] = searchs[kw_list[0]][47]
What's append?!
Use .loc:
dataset.loc[910, 'search_volume'] = searchs.loc[47, kw_list[0]]
For more info about the error message, see this
Also, there are way more efficient methods for doing this. As a rule of thumb, if you are looping over a dataframe, you are generally doing something wrong. Some potential solutions: pd.DataFrame.join, pd.merge, masking, pd.DataFrame.where, etc.
Related
I have a pandas object df and I would like to save that to .csv:
df.to_csv('output.csv', index = False)
Even if the data frame is displayed right in the terminal after printing, in the *.csv some lines are shifted several columns forward. I do not know how to demonstrate that in the minimal working code. I tried that with the one problematic column, but the result of one column was correct in the *.csv. What should I check, please? The whole column contains strings.
After advice:
selected['SpType'] = selected['SpType'].str.replace('\t', '')
I obtained an error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
selected['SpType'] = selected['SpType'].str.replace('\t', '')
If the tabs are the problem, you could just replace all tabs.
If the tabs occur in column column_name you could do something like:
df['column_name'] = df['column_name'].str.replace('\t', '')
If the problem is in several columns, you could loop over all columns. eg.:
for col in df.columns:
df[col] = df[col].str.replace('\t', '')
df.to_csv('output.csv', index = False)
Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.
I have 2 data frames(final_combine_df & acs_df) that have a shared column ('CBG'). Dataframe acs_df has 2 additional columns that I want to add to the combined dataframe (acs_total_persons & cs_total_building_units) . For the 'CBG' column values in acs_df that match those in final_combine_df, I want to add the acs_total_persons & acs_total_housing_units values to that row.
acs_df.head()
CBG acs_total_persons acs_total_housing_units
10010211001 1925.0 1013.0 1
10030114011 2668.0 1303.0 2
10070100043 930.0 532.0 3
10139534001 1570.0 763.0 4
10150021023 1059.0 379.0
I tried combine_acs_merge = pd.concat([final_combine,acs_df], sort=True) but it did not seem to match them up. I also tried combine_acs_merge = final_combine.merge(acs_df, on='CBG') and got
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
What do I need to do here?
Note: Column acs_df['CBG'] is type numpy.float64, not a string but it should still return. Oddly, when I run the following: print(acs_df.loc[acs_df['CBG'] == '01030114011']) it returns an empty dataframe. I created the acs_df from a csv file (see below). Is that creating a problem?
acs_df = pd.read_csv(acs_data)
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column
I am getting the following warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here is my code that is getting the warning:
col_names = ['Column1', 'Column2']
features = X_train[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
X_train[col_names] = features
I realize this is happening because I'm copying the dataframe. But what I am doing here is not like any of the answers I found googling, so I can't figure out how to apply their answers to my particular situation. It looks like the normal scenario where you get this warning is if you do something like this:
d2 = data[data['name'] == 'fred']
So .loc doesn't work. And .assign doesn't either because I have a list of columns instead of just a column I can assign. I'm just not quite sure how to handle this the way it wants me too.
It works fine the way it is, other than the warning. So the way I have it is correct.
I think the warning is saying for you to do something like
X_train.loc[:, col_names] = features