Referencing local variable within the same function - python-3.x

I want to merge 2 DataFrames using a function.
The function creates DataFrame df1 when called with variable 'x=1', and then another, df2, when called with 'x != 1', based on an if-statement within the function - code snippet below for further clarity.
Upon reaching the "df3 = pd.concat" line, I get the error "UnboundLocalError: local variable 'df1' referenced before assignment".
I would like to understand how to achieve the result of concatenating df1 and df2 into df3.
def Concat(url, x):
if x == 1:
df1 = pd.read_json(url)
else:
df2 = pd.read_json(url)
df3 = pd.concat([df1, df2], ignore_index=True)
def main():
Concat('*url*', 1)
Concat('*url*', 2)

You should tweak it a bit, to be:
def Concat(url, x):
for i in x:
if i == 1:
df1 = pd.read_json(url)
else:
df2 = pd.read_json(url)
df3 = pd.concat([df1, df2], ignore_index=True)
def main():
Concat('*url*', [1, 2])

Related

How to alter a dataframe in the cycle inside a function

I'm trying to make a function that takes the column list cols and performs get_dummies for each.
The thing is that if I use cycle only that works fine. But if I try to make a function out of it, the dataframe remains unchanged.
The function:
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
The usage:
cols = ['column1', 'column2']
cols_to_dummies(df, cols)
concatenating the two frames, creates df variable inside the function, so it loss reference with the outside df. A simple way to solve your code is to have a return frame.
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
return df
df = cols_to_dummies(df, ['num', 'user'])

Create columns with .apply() Pandas with strings

I have a Dataframe df.
One of the columns is named Adress and contains a string.
I have created a function processing(string) which takes as argument a string a returns a part of this string.
I succeeded to apply the function to df and create a new column in df with:
df.loc[:, 'new_col_name`] = df.loc[:, 'Adress`].apply(processing)
I modified my function processing(string) in such a way it returns two strings. I would like the second string returned to be stored in another new column.
To do so I tried to follow the steps given in : Create multiple pandas DataFrame columns from applying a function with multiple returns
Here is an example of my function processing(string):
def processing(string):
#some processing
return [A_string, B_string]
I also tried to return the two strings in a tuple.
Here are the different ways I tried to apply the function to my df :
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].astype(str).apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing, axis=1)
>>> TypeError: processing() got an unexpected keyword argument 'axis'
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'], axis=1)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'].astype(str), axis=1)
>>> AttributeError: 'str' object has no attribute 'astype'
#This is the only Error I could understand
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'])
>>> KeyError: 'Adress'
I think I am close, but I have no ideas about how to get it.
Try:
df["Adress"].apply(process)
Also, it's better to return a pd.Series in the apply function.
Here one example:
# build example dataframe
df = pd.DataFrame(data={'Adress' : ['Word_1_1 Word_1_2','Word_2_1 Word_2_2','Word_3_1 Word_3_2','Word_4_1 Word_4_2']})
print(df)
# Adress
# 0 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2
# Define your own function : here return two elements
def process(my_str):
l = my_str.split(" ")
return pd.Series(l)
# Apply the function and store the output in two new columns
df[["new_col_1", "new_col_2"]] = df["Adress"].apply(process)
print(df)
# Adress new_col_1 new_col_2
# 0 Word_1_1 Word_1_2 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2 Word_4_1 Word_4_2
You can try this.
df['new_column'] = df.apply(lambda row: processing(row['Address']), axis=1)
or this.
df['new_column'] = df['Address'].apply(lambda value: processing(value))

Error: 'BlockManager' object has no attribute 'T' issue while using df.at function in a loop

When i am trying to use df.at fuction without loop it works fine and change the data for a perticular column but it is giving error while using this in a loop.
Code is here.
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2]}
df1 = pd.DataFrame(data1)
data2 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [4.1, 3.4, 7.1, 9.2]}
df2 = pd.DataFrame(data2)
df3 = pd.concat([df1, df2], axis=1)
for i in range(int(len(df1))):
for j in range(int(len(df2))):
if df1['Name'][i] != df2['Name'][j]:
continue
else:
out = (df1['Height'][i] - df2['Height'][j])
df3.at[i, 'Height_Comparison'] = out
break
print(df3)
The issue was occurring becz of duplicate column names('Name', 'Height') in Data Frame df3 becz of the concat operation. Concat make double entries with same column names ('Name', 'Height') in Data Frame df3 which is creating this problem.
once i changed the column names to Name1, Height1 in df1 and Name2, Heigh2 in df2 the issue got resolved.

pandas SettingWithCopyWarning only inside function

With a dataframe like
import pandas as pd
df = pd.DataFrame(
["2017-01-01 04:45:00", "2017-01-01 04:45:00removeMe"], columns=["col"]
)
why do I get a SettingWithCopyWarning here
def test_fun(df):
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
df = test_fun(df)
but not if I run it without the function?
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
And how is my function supposed to look like?
In the function, you have df, which when you index it with your boolean array, gives a view of the outside-scope df - then you're trying to additionally index that view, which is why the warning comes in. Without the function, df is just a dataframe that's resized with your index instead (it's not a view).
I would write it as this instead either way:
df["col"] = pd.to_datetime(df["col"], errors='coerce')
return df[~pd.isna(df["col"])]
Found the trick:
def test_fun(df):
df.loc[:] = df[~df["col"].str.endswith("removeMe")] <------- I added the `.loc[:]`
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
Don't do df = ... in the function.
Instead do df.loc[:] = ... !

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

Resources