How to alter a dataframe in the cycle inside a function - python-3.x

I'm trying to make a function that takes the column list cols and performs get_dummies for each.
The thing is that if I use cycle only that works fine. But if I try to make a function out of it, the dataframe remains unchanged.
The function:
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
The usage:
cols = ['column1', 'column2']
cols_to_dummies(df, cols)

concatenating the two frames, creates df variable inside the function, so it loss reference with the outside df. A simple way to solve your code is to have a return frame.
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
return df
df = cols_to_dummies(df, ['num', 'user'])

Related

Referencing local variable within the same function

I want to merge 2 DataFrames using a function.
The function creates DataFrame df1 when called with variable 'x=1', and then another, df2, when called with 'x != 1', based on an if-statement within the function - code snippet below for further clarity.
Upon reaching the "df3 = pd.concat" line, I get the error "UnboundLocalError: local variable 'df1' referenced before assignment".
I would like to understand how to achieve the result of concatenating df1 and df2 into df3.
def Concat(url, x):
if x == 1:
df1 = pd.read_json(url)
else:
df2 = pd.read_json(url)
df3 = pd.concat([df1, df2], ignore_index=True)
def main():
Concat('*url*', 1)
Concat('*url*', 2)
You should tweak it a bit, to be:
def Concat(url, x):
for i in x:
if i == 1:
df1 = pd.read_json(url)
else:
df2 = pd.read_json(url)
df3 = pd.concat([df1, df2], ignore_index=True)
def main():
Concat('*url*', [1, 2])

Create columns with .apply() Pandas with strings

I have a Dataframe df.
One of the columns is named Adress and contains a string.
I have created a function processing(string) which takes as argument a string a returns a part of this string.
I succeeded to apply the function to df and create a new column in df with:
df.loc[:, 'new_col_name`] = df.loc[:, 'Adress`].apply(processing)
I modified my function processing(string) in such a way it returns two strings. I would like the second string returned to be stored in another new column.
To do so I tried to follow the steps given in : Create multiple pandas DataFrame columns from applying a function with multiple returns
Here is an example of my function processing(string):
def processing(string):
#some processing
return [A_string, B_string]
I also tried to return the two strings in a tuple.
Here are the different ways I tried to apply the function to my df :
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].astype(str).apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing, axis=1)
>>> TypeError: processing() got an unexpected keyword argument 'axis'
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'], axis=1)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'].astype(str), axis=1)
>>> AttributeError: 'str' object has no attribute 'astype'
#This is the only Error I could understand
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'])
>>> KeyError: 'Adress'
I think I am close, but I have no ideas about how to get it.
Try:
df["Adress"].apply(process)
Also, it's better to return a pd.Series in the apply function.
Here one example:
# build example dataframe
df = pd.DataFrame(data={'Adress' : ['Word_1_1 Word_1_2','Word_2_1 Word_2_2','Word_3_1 Word_3_2','Word_4_1 Word_4_2']})
print(df)
# Adress
# 0 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2
# Define your own function : here return two elements
def process(my_str):
l = my_str.split(" ")
return pd.Series(l)
# Apply the function and store the output in two new columns
df[["new_col_1", "new_col_2"]] = df["Adress"].apply(process)
print(df)
# Adress new_col_1 new_col_2
# 0 Word_1_1 Word_1_2 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2 Word_4_1 Word_4_2
You can try this.
df['new_column'] = df.apply(lambda row: processing(row['Address']), axis=1)
or this.
df['new_column'] = df['Address'].apply(lambda value: processing(value))

pandas SettingWithCopyWarning only inside function

With a dataframe like
import pandas as pd
df = pd.DataFrame(
["2017-01-01 04:45:00", "2017-01-01 04:45:00removeMe"], columns=["col"]
)
why do I get a SettingWithCopyWarning here
def test_fun(df):
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
df = test_fun(df)
but not if I run it without the function?
df = df[~df["col"].str.endswith("removeMe")]
df.loc[:, "col"] = pd.to_datetime(df["col"])
And how is my function supposed to look like?
In the function, you have df, which when you index it with your boolean array, gives a view of the outside-scope df - then you're trying to additionally index that view, which is why the warning comes in. Without the function, df is just a dataframe that's resized with your index instead (it's not a view).
I would write it as this instead either way:
df["col"] = pd.to_datetime(df["col"], errors='coerce')
return df[~pd.isna(df["col"])]
Found the trick:
def test_fun(df):
df.loc[:] = df[~df["col"].str.endswith("removeMe")] <------- I added the `.loc[:]`
df.loc[:, "col"] = pd.to_datetime(df["col"])
return df
Don't do df = ... in the function.
Instead do df.loc[:] = ... !

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

Reducing dimensionality of multiindex pandas dataframe using apply

I have the following dataframe:
df = pd.DataFrame({('psl', 't1'): {'fiat': 36.389809173765507,
'mazda': 18.139242981049016,
'opel': 0.97626485600703961,
'toyota': 74.464422292108878},
('psl', 't2'): {'fiat': 35.423004380643462,
'mazda': 24.269803148695079,
'opel': 1.0170540474994665,
'toyota': 60.389948228586832},
('psv', 't1'): {'fiat': 35.836800462163097,
'mazda': 15.893295606055901,
'opel': 0.78744853046848606,
'toyota': 74.054850828062271},
('psv', 't2'): {'fiat': 34.379812557124815,
'mazda': 23.202587247335682,
'opel': 0.80191294532382451,
'toyota': 58.735083244244322}})
It looks like this:
I wish to reduce it from a multiindex to a normal index. I wish to do this by applying a function using t1 and t2 values and returning only a single value which will result in there being two columns: psl and psv.
I have succeeded in grouping it as such and applying a function:
df.groupby(level=0, axis=1).agg(np.mean)
which is very close to what I want except that I don't want to apply np.mean, but rather a custom function. In particular, a percent change function.
My end goal is to be able to do something like this:
df.groupby(level=0, axis=1).apply(lambda t1, t2: (t2-t1)/t1)
Which returns this error:
TypeError: <lambda>() missing 1 required positional argument: 't2'
I have also tried this:
df.apply(lambda x: x[x.name].apply(lambda x: x['t1']/x['t2']))
which in turn returns:
KeyError: (('psl', 't1'), 'occurred at index (psl, t1)')
Could you please include a thorough explanation of each part of your answer to the best of your abilities so I can better understand how pandas works.
Not easy. Use custom function with squeeze for Series and xs for select MultiIndex in columns:
def f(x):
t2 = x.xs('t2', axis=1, level=1)
t1 = x.xs('t1', axis=1, level=1)
a = (t2-t1)/t1
#print (a)
return (a.squeeze())
df1 = df.groupby(level=0, axis=1).agg(f)
print (df1)
psl psv
fiat -0.026568 -0.040656
mazda 0.337972 0.459898
opel 0.041781 0.018369
toyota -0.189009 -0.206871
Use lambda function is possible, but really awfull with repeating code:
df1 = df.groupby(level=0, axis=1)
.agg(lambda x: ((x.xs('t2', axis=1, level=1)-x.xs('t1', axis=1, level=1))/
x.xs('t1', axis=1, level=1)).squeeze())
Using iloc can solve the problem:
df.groupby(level=0, axis=1).agg(lambda x: (x.iloc[:,0]-x.iloc[:,1])/x.iloc[:,0])
Outputs:
psl psv
fiat 0.026568 0.040656
mazda -0.337972 -0.459898
opel -0.041781 -0.018369
toyota 0.189009 0.206871

Resources