Splitting a column of a Pandas dataframe using multiple conditions - python-3.x

Let's say I have this column in Pandas:
df['GPS'][0]:
0 '39.21,38.6;39.23,38.68;39.26,38.68'
I would like to split the column into:
Xcoord1 Ycoord1 Xcoord2,Ycoord2, Xcoord3, Ycoord3
37.21 38.6 37.23 38.68 37.26 38.68
My approach is to first split the column using:
df['GPS_list']=df['GPS'].apply(lambda x: x.split(';'))
df['GPS_list']:
['39.21,38.6','39.23,38.68','39.26,38.68']
Now I would need to split based on , to separate the x and y value which I am not sure how to do for each row. Can I use apply function work here?
Next, I would need to convert each of those values from string to int.
Is there an elegant way to do this in a step or two?
I am new to Python and Pandas so any nudge in the right direction is helpful.

It If you always have the same number of coordinates, a simple str.split will work:
out = (df['GPS'].str.split('[,;]', expand=True)
.set_axis(['Xcoord1', 'Ycoord1', 'Xcoord2', 'Ycoord2', 'Xcoord3', 'Ycoord3'], axis=1)
)
If you have an arbitrary number of pairs, you can use:
out = (df['GPS'].str.split(';', expand=True).stack()
.str.split(',', expand=True)
.set_axis(['Xcoord', 'Ycoord'], axis=1).unstack()
.sort_index(level=1, axis=1)
)
out.columns = out.columns.map(lambda x: f'{x[0]}{x[1]+1}')
Output:
Xcoord1 Ycoord1 Xcoord2 Ycoord2 Xcoord3 Ycoord3
0 39.21 38.6 39.23 38.68 39.26 38.68

Example
df = pd.DataFrame(['39.21,38.6;39.23,38.68;39.26,38.68'], columns=['GPS'])
df
GPS
0 39.21,38.6;39.23,38.68;39.26,38.68
Code
col1 = ['Xcoord1', 'Ycoord1', 'Xcoord2', 'Ycoord2', 'Xcoord3', 'Ycoord3']
df['GPS'].str.split(r'[,;]', expand=True).set_axis(col1, axis=1)
result:
Xcoord1 Ycoord1 Xcoord2 Ycoord2 Xcoord3 Ycoord3
0 39.21 38.6 39.23 38.68 39.26 38.68

Related

More elegant and efficient wayt to get the same output

I have a df:
info
{"any_name":{"value":["5"], "ref":"any text"}, "another_name":{"value":["2"], "ref":"any text"}
{"any_name":{"value":["1"], "ref":"any text"}, "another_name":{"value":["12"], "ref":"any text"}
the dtype of this column is:
df['info'].apply(type) => <class 'str'>
I want to make a dataframe to get this output:
any_name another_any_name
5 2
1 12
My solution is:
A=list(df['answers'])
J=[]
for i in range(0,len(A)):
D=eval(A[i])
foo = {k: v['value'] for k, v in D.items() if k in list_to_filter_columns}
J.append(foo)
out=pd.DataFrame(J)
code to cast to numeric the values from value as they are list with one element
outt = outt.apply(lambda x: x.str[0])
outt = outt.apply(pd.to_numeric)
outt.head(2)
The above solution is working just fine.
I want to know if there's a more elegant way to get the same result. I think code above is very inefficient and not elegant. Is there a better way to do it?
No need for a loop, you could use pandas.json_normalize :
import ast
df["info"] = df["info"].apply(lambda x: ast.literal_eval(x+"}"))
​
out = (
pd.json_normalize(df["info"])
.filter(regex="value$")
.astype(str)
.apply(lambda x: x.str.strip("['']"))
)
​
out.columns = out.columns.str.replace("\.value", "", regex=True)
# Output
​
print(out)
any_name another_name
0 5 2
1 1 12

Set decimal values to 2 points in list under list pandas

I am trying to set max decimal values upto 2 digit for result of a nested list. I have already tried to set precision and tried other things but can not find a way.
r_ij_matrix = variables[1]
print(type(r_ij_matrix))
print(type(r_ij_matrix[0]))
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.precision", 2)
data = pd.DataFrame(r_ij_matrix, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Combined Decision Matrix')
You can solve your problem with the apply() method of the dataframe. You can do something like that :
df.apply(lambda x: [[round(elt, 2) for elt in list_] for list_ in x])
Solved it by copying the list to another with the desired decimal points. Thanks everyone.
rij_matrix = variables[1]
rij_nparray = np.empty([8, 6, 3])
for i in range(8):
for j in range(6):
for k in range(3):
rij_nparray[i][j][k] = round(rij_matrix[i][j][k], 2)
rij_list = rij_nparray.tolist()
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(rij_list, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Normalized Fuzzy Decision Matrix (r_ij)')
applymap seems to be good here:
but there is a BUT: be aware that it is propably not the best idea to store lists as values of a df, you just give up the functionality of pandas. and also after formatting them like this, there are stored as strings. This (if really wanted) should only be for presentation.
df.applymap(lambda lst: list(map("{:.2f}".format, lst)))
Output:
A B
0 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
1 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
2 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
Used Input:
df = pd.DataFrame({
'A': [[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463]],
'B': [[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414]]})

Return pieces of strings from separate pandas dataframes based on multi-conditional logic

I'm new to python, and trying to do some work with dataframes in pandas
On the left side is piece of the primary dataframe (df1), and the right is a second (df2). The goal is to fill in the df1['vd_type'] column with strings based on several pieces of conditional logic. I can make this work with nested np.where() functions, but as this gets deeper into the hierarchy, it gets too long to run at all, so I'm looking for a more elegant solution.
The english version of the logic is this:
For df1['vd_type']: If df1['shape'] == the first two characters in df2['vd_combo'] AND df1['vd_pct'] <= df2['combo_value'], then return the last 3 characters in df2['vd_combo'] on the line where both of these conditions are true. If it can't find a line in df2 where both conditions are true, then return "vd4".
Thanks in advance!
EDIT #2: So I want to implement a 3rd condition based on another variable, with everything else the same, except in df1 there is another column 'log_vsc' with existing values, and the goal is to fill in an empty df1 column 'vsc_type' with one of 4 strings in the same scheme. The extra condition would be just that the 'vd_type' that we just defined would match the 'vd' column arising from the split 'vsc_combo'.
df3 = pd.DataFrame()
df3['vsc_combo'] = ['A1_vd1_vsc1','A1_vd1_vsc2','A1_vd1_vsc3','A1_vd2_vsc1','A1_vd2_vsc2' etc etc etc
df3['combo_value'] = [(number), (number), (number), (number), (number), etc etc
df3[['shape','vd','vsc']] = df3['vsc_combo'].str.split('_', expand = True)
def vsc_condition( row, df3):
df_select = df3[(df3['shape'] == row['shape']) & (df3['vd'] == row['vd_type']) & (row['log_vsc'] <= df3['combo_value'])]
if df_select.empty:
return 'vsc4'
else:
return df_select['vsc'].iloc[0]
## apply vsc_type
df1['vsc_type'] = df1.apply( vsc_condition, args = ([df3]), axis = 1)
And this works!! Thanks again!
so your inputs are like:
import pandas as pd
df1 = pd.DataFrame({'shape': ['A2', 'A1', 'B1', 'B1', 'A2'],
'vd_pct': [0.78, 0.33, 0.48, 0.38, 0.59]} )
df2 = pd.DataFrame({'vd_combo': ['A1_vd1', 'A1_vd2', 'A1_vd3', 'A2_vd1', 'A2_vd2', 'A2_vd3', 'B1_vd1', 'B1_vd2', 'B1_vd3'],
'combo_value':[0.38, 0.56, 0.68, 0.42, 0.58, 0.71, 0.39, 0.57, 0.69]} )
If you are not against creating columns in df2 (you can delete them at the end if it's a problem) you generate two columns shape and vd by splitting the column vd_combo:
df2[['shape','vd']] = df2['vd_combo'].str.split('_',expand=True)
Then you can create a function condition that you will use in apply such as:
def condition( row, df2):
# row will be a row of df1 in apply
# here you select only the rows of df2 with your conditions on shape and value
df_select = df2[(df2['shape'] == row['shape']) & (row['vd_pct'] <= df2['combo_value'])]
# if empty (your condition not met) then return vd4
if df_select.empty:
return 'vd4'
# if your condition met, then return the value of 'vd' the smallest
else:
return df_select['vd'].iloc[0]
Now you can create your column vd_type in df1 with:
df1['vd_type'] = df1.apply( condition, args =([df2]), axis=1)
df1 is like:
shape vd_pct vd_type
0 A2 0.78 vd4
1 A1 0.33 vd1
2 B1 0.48 vd2
3 B1 0.38 vd1
4 A2 0.59 vd3

Is there any way to replace all occurrences in Pandas DataFrame? [duplicate]

I have looked up this issue and most questions are for more complex replacements. However in my case I have a very simple dataframe as a test dummy.
The aim is to replace a string anywhere in the dataframe with an nan, however this does not seem to work (i.e. does not replace; no errors whatsoever). I've tried replacing with another string and it does not work either. E.g.
d = {'color' : pd.Series(['white', 'blue', 'orange']),
'second_color': pd.Series(['white', 'black', 'blue']),
'value' : pd.Series([1., 2., 3.])}
df = pd.DataFrame(d)
df.replace('white', np.nan)
The output is still:
color second_color value
0 white white 1
1 blue black 2
2 orange blue 3
This problem is often addressed using inplace=True, but there are caveats to that. Please also see Understanding inplace=True in pandas.
Given that this is the top Google result when searching for "Pandas replace is not working" I'd like to also mention that:
replace does full replacement searches, unless you turn on the regex
switch. Use regex=True, and it should perform partial replacements as
well.
This took me 30 minutes to find out, so hopefully I've saved the next person 30 minutes.
You need to assign back
df = df.replace('white', np.nan)
or pass param inplace=True:
In [50]:
d = {'color' : pd.Series(['white', 'blue', 'orange']),
'second_color': pd.Series(['white', 'black', 'blue']),
'value' : pd.Series([1., 2., 3.])}
df = pd.DataFrame(d)
df.replace('white', np.nan, inplace=True)
df
Out[50]:
color second_color value
0 NaN NaN 1.0
1 blue black 2.0
2 orange blue 3.0
Most pandas ops return a copy and most have param inplace which is usually defaulted to False
Neither one with inplace=True nor the other with regex=True don't work in my case.
So I found a solution with using Series.str.replace instead. It can be useful if you need to replace a substring.
In [4]: df['color'] = df.color.str.replace('e', 'E!')
In [5]: df
Out[5]:
color second_color value
0 whitE! white 1.0
1 bluE! black 2.0
2 orangE! blue 3.0
or even with a slicing.
In [10]: df.loc[df.color=='blue', 'color'] = df.color.str.replace('e', 'E!')
In [11]: df
Out[11]:
color second_color value
0 white white 1.0
1 bluE! black 2.0
2 orange blue 3.0
You might need to check the data type of the column before using replace function directly. It could be the case that you are using replace function on Object data type, in this case, you need to apply replace function after converting it into a string.
Wrong:
df["column-name"] = df["column-name"].replace('abc', 'def')
Correct:
df["column-name"] = df["column-name"].str.replace('abc', 'def')
When you use df.replace() it creates a new temporary object, but doesn't modify yours. You can use one of the two following lines to modify df:
df = df.replace('white', np.nan)
df.replace('white', np.nan, inplace = True)
What worked for me was using this dict notation.
{old_value:new_value}
df.replace({10:100},inplace=True)
check the documentation for more info.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.replace.html
df.replace({'white': np.nan}, inplace=True, regex=True)
Python 3.10, pandas 1.4.2, inplace=True did not work for below example (column dtype int32), but reassigning it did.
df["col"].replace[[0, 130], [12555555, 12555555], inplace=True) # NOT work
df["col"] = df["col"].replace[[0, 130], [12555555, 12555555]) # worked
... and in another situation involving nans in text columns, the column needed typing in a pre-step (not just .str, as above):
df["col"].replace[["man", "woman", np.nan], [1, 2, -1], inplace=True) # NOT work
df["col"] = df["col"].str.replace[["man", "woman", np.nan], [1, 2, -1]) # NOT work
df["col"] = df["col"].astype(str) # needed
df["col"] = df["col"].replace[["man", "woman", np.nan], [1, 2, -1]) # worked
One other reason, where i faced .replace function was not working and i found the reason and fixed.
If you have the string in the column as "word1 word2", when read from excel, the space in between "word1" and "word2" has the "nbsp" meaning non blank spacing. If we replace with normal space, everything works fine. My column name is "Name"
nonBreakSpace = u'\xa0'
df['Name'] = df['Name'].replace(nonBreakSpace,' ',regex=True)
df['Name']=df["Name"].str.replace("replace with","replace to",regex=True)

Reducing dimensionality of multiindex pandas dataframe using apply

I have the following dataframe:
df = pd.DataFrame({('psl', 't1'): {'fiat': 36.389809173765507,
'mazda': 18.139242981049016,
'opel': 0.97626485600703961,
'toyota': 74.464422292108878},
('psl', 't2'): {'fiat': 35.423004380643462,
'mazda': 24.269803148695079,
'opel': 1.0170540474994665,
'toyota': 60.389948228586832},
('psv', 't1'): {'fiat': 35.836800462163097,
'mazda': 15.893295606055901,
'opel': 0.78744853046848606,
'toyota': 74.054850828062271},
('psv', 't2'): {'fiat': 34.379812557124815,
'mazda': 23.202587247335682,
'opel': 0.80191294532382451,
'toyota': 58.735083244244322}})
It looks like this:
I wish to reduce it from a multiindex to a normal index. I wish to do this by applying a function using t1 and t2 values and returning only a single value which will result in there being two columns: psl and psv.
I have succeeded in grouping it as such and applying a function:
df.groupby(level=0, axis=1).agg(np.mean)
which is very close to what I want except that I don't want to apply np.mean, but rather a custom function. In particular, a percent change function.
My end goal is to be able to do something like this:
df.groupby(level=0, axis=1).apply(lambda t1, t2: (t2-t1)/t1)
Which returns this error:
TypeError: <lambda>() missing 1 required positional argument: 't2'
I have also tried this:
df.apply(lambda x: x[x.name].apply(lambda x: x['t1']/x['t2']))
which in turn returns:
KeyError: (('psl', 't1'), 'occurred at index (psl, t1)')
Could you please include a thorough explanation of each part of your answer to the best of your abilities so I can better understand how pandas works.
Not easy. Use custom function with squeeze for Series and xs for select MultiIndex in columns:
def f(x):
t2 = x.xs('t2', axis=1, level=1)
t1 = x.xs('t1', axis=1, level=1)
a = (t2-t1)/t1
#print (a)
return (a.squeeze())
df1 = df.groupby(level=0, axis=1).agg(f)
print (df1)
psl psv
fiat -0.026568 -0.040656
mazda 0.337972 0.459898
opel 0.041781 0.018369
toyota -0.189009 -0.206871
Use lambda function is possible, but really awfull with repeating code:
df1 = df.groupby(level=0, axis=1)
.agg(lambda x: ((x.xs('t2', axis=1, level=1)-x.xs('t1', axis=1, level=1))/
x.xs('t1', axis=1, level=1)).squeeze())
Using iloc can solve the problem:
df.groupby(level=0, axis=1).agg(lambda x: (x.iloc[:,0]-x.iloc[:,1])/x.iloc[:,0])
Outputs:
psl psv
fiat 0.026568 0.040656
mazda -0.337972 -0.459898
opel -0.041781 -0.018369
toyota 0.189009 0.206871

Resources