Create columns with .apply() Pandas with strings - python-3.x

I have a Dataframe df.
One of the columns is named Adress and contains a string.
I have created a function processing(string) which takes as argument a string a returns a part of this string.
I succeeded to apply the function to df and create a new column in df with:
df.loc[:, 'new_col_name`] = df.loc[:, 'Adress`].apply(processing)
I modified my function processing(string) in such a way it returns two strings. I would like the second string returned to be stored in another new column.
To do so I tried to follow the steps given in : Create multiple pandas DataFrame columns from applying a function with multiple returns
Here is an example of my function processing(string):
def processing(string):
#some processing
return [A_string, B_string]
I also tried to return the two strings in a tuple.
Here are the different ways I tried to apply the function to my df :
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].astype(str).apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing, axis=1)
>>> TypeError: processing() got an unexpected keyword argument 'axis'
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'], axis=1)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'].astype(str), axis=1)
>>> AttributeError: 'str' object has no attribute 'astype'
#This is the only Error I could understand
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'])
>>> KeyError: 'Adress'
I think I am close, but I have no ideas about how to get it.

Try:
df["Adress"].apply(process)
Also, it's better to return a pd.Series in the apply function.
Here one example:
# build example dataframe
df = pd.DataFrame(data={'Adress' : ['Word_1_1 Word_1_2','Word_2_1 Word_2_2','Word_3_1 Word_3_2','Word_4_1 Word_4_2']})
print(df)
# Adress
# 0 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2
# Define your own function : here return two elements
def process(my_str):
l = my_str.split(" ")
return pd.Series(l)
# Apply the function and store the output in two new columns
df[["new_col_1", "new_col_2"]] = df["Adress"].apply(process)
print(df)
# Adress new_col_1 new_col_2
# 0 Word_1_1 Word_1_2 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2 Word_4_1 Word_4_2

You can try this.
df['new_column'] = df.apply(lambda row: processing(row['Address']), axis=1)
or this.
df['new_column'] = df['Address'].apply(lambda value: processing(value))

Related

More elegant and efficient wayt to get the same output

I have a df:
info
{"any_name":{"value":["5"], "ref":"any text"}, "another_name":{"value":["2"], "ref":"any text"}
{"any_name":{"value":["1"], "ref":"any text"}, "another_name":{"value":["12"], "ref":"any text"}
the dtype of this column is:
df['info'].apply(type) => <class 'str'>
I want to make a dataframe to get this output:
any_name another_any_name
5 2
1 12
My solution is:
A=list(df['answers'])
J=[]
for i in range(0,len(A)):
D=eval(A[i])
foo = {k: v['value'] for k, v in D.items() if k in list_to_filter_columns}
J.append(foo)
out=pd.DataFrame(J)
code to cast to numeric the values from value as they are list with one element
outt = outt.apply(lambda x: x.str[0])
outt = outt.apply(pd.to_numeric)
outt.head(2)
The above solution is working just fine.
I want to know if there's a more elegant way to get the same result. I think code above is very inefficient and not elegant. Is there a better way to do it?
No need for a loop, you could use pandas.json_normalize :
import ast
df["info"] = df["info"].apply(lambda x: ast.literal_eval(x+"}"))
​
out = (
pd.json_normalize(df["info"])
.filter(regex="value$")
.astype(str)
.apply(lambda x: x.str.strip("['']"))
)
​
out.columns = out.columns.str.replace("\.value", "", regex=True)
# Output
​
print(out)
any_name another_name
0 5 2
1 1 12

How to alter a dataframe in the cycle inside a function

I'm trying to make a function that takes the column list cols and performs get_dummies for each.
The thing is that if I use cycle only that works fine. But if I try to make a function out of it, the dataframe remains unchanged.
The function:
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
The usage:
cols = ['column1', 'column2']
cols_to_dummies(df, cols)
concatenating the two frames, creates df variable inside the function, so it loss reference with the outside df. A simple way to solve your code is to have a return frame.
def cols_to_dummies(df, cols: list):
for c in cols:
temp_dum = pd.get_dummies(df[str(c)])
df = pd.concat([df, temp_dum], axis=1)
df.drop(str(c), axis=1, inplace=True)
del temp_dum
return df
df = cols_to_dummies(df, ['num', 'user'])

Referencing local variable within the same function

I want to merge 2 DataFrames using a function.
The function creates DataFrame df1 when called with variable 'x=1', and then another, df2, when called with 'x != 1', based on an if-statement within the function - code snippet below for further clarity.
Upon reaching the "df3 = pd.concat" line, I get the error "UnboundLocalError: local variable 'df1' referenced before assignment".
I would like to understand how to achieve the result of concatenating df1 and df2 into df3.
def Concat(url, x):
if x == 1:
df1 = pd.read_json(url)
else:
df2 = pd.read_json(url)
df3 = pd.concat([df1, df2], ignore_index=True)
def main():
Concat('*url*', 1)
Concat('*url*', 2)
You should tweak it a bit, to be:
def Concat(url, x):
for i in x:
if i == 1:
df1 = pd.read_json(url)
else:
df2 = pd.read_json(url)
df3 = pd.concat([df1, df2], ignore_index=True)
def main():
Concat('*url*', [1, 2])

Iteration over a list in a Pandas DataFrame column

I have a dataframe df as this one:
my_list
Index
0 [81310, 81800]
1 [82160]
2 [75001, 75002, 75003, 75004, 75005, 75006, 750...
3 [95190]
4 [38170, 38180]
5 [95240]
6 [71150]
7 [62520]
I have a list named code with at least one element.
code = ['75008', '75015']
I want to create another column in my DataFrame named my_min, containing the minimum absolute difference between each element of the list code and the list from df.my_list.
Here are the commands I tried :
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list'].str[:]])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list']])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list'].tolist()])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list'].str[:]])
>>> UnboundLocalError: local variable 'z' referenced before assignment
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list']])
>>> UnboundLocalError: local variable 'z' referenced before assignment
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list'].tolist()])
>>> UnboundLocalError: local variable 'z' referenced before assignment
you could do this with a list comprehension:
import pandas as pd
import numpy as np
df = pd.DataFrame({'my_list':[[81310, 81800],[82160]]})
code = ['75008', '75015']
pd.DataFrame({'my_min':[min([abs(int(i) - j) for i in code for j in x])
for x in df.my_list]})
returns
my_min
0 6295
1 7145
You could also use pd.Series.apply instead of the outer list, for example:
df.my_list.apply(lambda x: min([abs(int(i) - j) for i in code for j in x]) )
Write a helper: def find_min(lst): -- it is clear you know how to do that. The helper will consult a global named code.
Then apply it:
df['my_min'] = df.my_list.apply(find_min)
The advantage of breaking out a helper
is you can write separate unit tests for it.
If you prefer to avoid globals,
you will find partial quite helpful.
https://docs.python.org/3/library/functools.html#functools.partial
If you have pandas 0.25+ you can use explode and combine with np.min:
# sample data
df = pd.DataFrame({'my_list':
[[81310, 81800], [82160], [75001,75002]]})
code = ['75008', '75015']
# concatenate the lists into one series
s = df.my_list.explode()
# convert `code` into np.array
code = np.array(code, dtype=int)
# this is the output series
pd.Series(np.min(np.abs(s.values[:,None] - code),axis=1),
index=s.index).min(level=0)
Output:
0 6295
1 7145
2 6
dtype: int64

Reducing dimensionality of multiindex pandas dataframe using apply

I have the following dataframe:
df = pd.DataFrame({('psl', 't1'): {'fiat': 36.389809173765507,
'mazda': 18.139242981049016,
'opel': 0.97626485600703961,
'toyota': 74.464422292108878},
('psl', 't2'): {'fiat': 35.423004380643462,
'mazda': 24.269803148695079,
'opel': 1.0170540474994665,
'toyota': 60.389948228586832},
('psv', 't1'): {'fiat': 35.836800462163097,
'mazda': 15.893295606055901,
'opel': 0.78744853046848606,
'toyota': 74.054850828062271},
('psv', 't2'): {'fiat': 34.379812557124815,
'mazda': 23.202587247335682,
'opel': 0.80191294532382451,
'toyota': 58.735083244244322}})
It looks like this:
I wish to reduce it from a multiindex to a normal index. I wish to do this by applying a function using t1 and t2 values and returning only a single value which will result in there being two columns: psl and psv.
I have succeeded in grouping it as such and applying a function:
df.groupby(level=0, axis=1).agg(np.mean)
which is very close to what I want except that I don't want to apply np.mean, but rather a custom function. In particular, a percent change function.
My end goal is to be able to do something like this:
df.groupby(level=0, axis=1).apply(lambda t1, t2: (t2-t1)/t1)
Which returns this error:
TypeError: <lambda>() missing 1 required positional argument: 't2'
I have also tried this:
df.apply(lambda x: x[x.name].apply(lambda x: x['t1']/x['t2']))
which in turn returns:
KeyError: (('psl', 't1'), 'occurred at index (psl, t1)')
Could you please include a thorough explanation of each part of your answer to the best of your abilities so I can better understand how pandas works.
Not easy. Use custom function with squeeze for Series and xs for select MultiIndex in columns:
def f(x):
t2 = x.xs('t2', axis=1, level=1)
t1 = x.xs('t1', axis=1, level=1)
a = (t2-t1)/t1
#print (a)
return (a.squeeze())
df1 = df.groupby(level=0, axis=1).agg(f)
print (df1)
psl psv
fiat -0.026568 -0.040656
mazda 0.337972 0.459898
opel 0.041781 0.018369
toyota -0.189009 -0.206871
Use lambda function is possible, but really awfull with repeating code:
df1 = df.groupby(level=0, axis=1)
.agg(lambda x: ((x.xs('t2', axis=1, level=1)-x.xs('t1', axis=1, level=1))/
x.xs('t1', axis=1, level=1)).squeeze())
Using iloc can solve the problem:
df.groupby(level=0, axis=1).agg(lambda x: (x.iloc[:,0]-x.iloc[:,1])/x.iloc[:,0])
Outputs:
psl psv
fiat 0.026568 0.040656
mazda -0.337972 -0.459898
opel -0.041781 -0.018369
toyota 0.189009 0.206871

Resources