Iteration over a list in a Pandas DataFrame column - python-3.x

I have a dataframe df as this one:
my_list
Index
0 [81310, 81800]
1 [82160]
2 [75001, 75002, 75003, 75004, 75005, 75006, 750...
3 [95190]
4 [38170, 38180]
5 [95240]
6 [71150]
7 [62520]
I have a list named code with at least one element.
code = ['75008', '75015']
I want to create another column in my DataFrame named my_min, containing the minimum absolute difference between each element of the list code and the list from df.my_list.
Here are the commands I tried :
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list'].str[:]])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list']])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in df.loc[:, 'my_list'].tolist()])
>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list'].str[:]])
>>> UnboundLocalError: local variable 'z' referenced before assignment
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list']])
>>> UnboundLocalError: local variable 'z' referenced before assignment
#or
df.loc[:, 'my_list'] = min([abs(int(x)-int(y)) for x in code for y in z for z in df.loc[:, 'my_list'].tolist()])
>>> UnboundLocalError: local variable 'z' referenced before assignment

you could do this with a list comprehension:
import pandas as pd
import numpy as np
df = pd.DataFrame({'my_list':[[81310, 81800],[82160]]})
code = ['75008', '75015']
pd.DataFrame({'my_min':[min([abs(int(i) - j) for i in code for j in x])
for x in df.my_list]})
returns
my_min
0 6295
1 7145
You could also use pd.Series.apply instead of the outer list, for example:
df.my_list.apply(lambda x: min([abs(int(i) - j) for i in code for j in x]) )

Write a helper: def find_min(lst): -- it is clear you know how to do that. The helper will consult a global named code.
Then apply it:
df['my_min'] = df.my_list.apply(find_min)
The advantage of breaking out a helper
is you can write separate unit tests for it.
If you prefer to avoid globals,
you will find partial quite helpful.
https://docs.python.org/3/library/functools.html#functools.partial

If you have pandas 0.25+ you can use explode and combine with np.min:
# sample data
df = pd.DataFrame({'my_list':
[[81310, 81800], [82160], [75001,75002]]})
code = ['75008', '75015']
# concatenate the lists into one series
s = df.my_list.explode()
# convert `code` into np.array
code = np.array(code, dtype=int)
# this is the output series
pd.Series(np.min(np.abs(s.values[:,None] - code),axis=1),
index=s.index).min(level=0)
Output:
0 6295
1 7145
2 6
dtype: int64

Related

Create columns with .apply() Pandas with strings

I have a Dataframe df.
One of the columns is named Adress and contains a string.
I have created a function processing(string) which takes as argument a string a returns a part of this string.
I succeeded to apply the function to df and create a new column in df with:
df.loc[:, 'new_col_name`] = df.loc[:, 'Adress`].apply(processing)
I modified my function processing(string) in such a way it returns two strings. I would like the second string returned to be stored in another new column.
To do so I tried to follow the steps given in : Create multiple pandas DataFrame columns from applying a function with multiple returns
Here is an example of my function processing(string):
def processing(string):
#some processing
return [A_string, B_string]
I also tried to return the two strings in a tuple.
Here are the different ways I tried to apply the function to my df :
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, '1st_new_col'], df.loc[:, '2nd_new_col'] = df.loc[:, 'Adress'].astype(str).apply(processing)
>>> ValueError: too many values to unpack (expected 2)
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.loc[:, 'Adress'].apply(processing, axis=1)
>>> TypeError: processing() got an unexpected keyword argument 'axis'
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'], axis=1)
>>> KeyError: "None of [Index(['1st_new_col', '2nd_new_col'], dtype='object')] are in the [columns]"
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'].astype(str), axis=1)
>>> AttributeError: 'str' object has no attribute 'astype'
#This is the only Error I could understand
#or
df.loc[:, ['1st_new_col', '2nd_new_col']] = df.apply(lambda x: processing(x['Adress'])
>>> KeyError: 'Adress'
I think I am close, but I have no ideas about how to get it.
Try:
df["Adress"].apply(process)
Also, it's better to return a pd.Series in the apply function.
Here one example:
# build example dataframe
df = pd.DataFrame(data={'Adress' : ['Word_1_1 Word_1_2','Word_2_1 Word_2_2','Word_3_1 Word_3_2','Word_4_1 Word_4_2']})
print(df)
# Adress
# 0 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2
# Define your own function : here return two elements
def process(my_str):
l = my_str.split(" ")
return pd.Series(l)
# Apply the function and store the output in two new columns
df[["new_col_1", "new_col_2"]] = df["Adress"].apply(process)
print(df)
# Adress new_col_1 new_col_2
# 0 Word_1_1 Word_1_2 Word_1_1 Word_1_2
# 1 Word_2_1 Word_2_2 Word_2_1 Word_2_2
# 2 Word_3_1 Word_3_2 Word_3_1 Word_3_2
# 3 Word_4_1 Word_4_2 Word_4_1 Word_4_2
You can try this.
df['new_column'] = df.apply(lambda row: processing(row['Address']), axis=1)
or this.
df['new_column'] = df['Address'].apply(lambda value: processing(value))

Everytime I run this code it says that numpy.ndarray has not attribute 'index'

When I run this code it returns that the numpy.ndarray object has no attributes. I'm trying to write a function that in case the number given is in the array will return with the position of that number in the array.
a = np.c_[np.array([1, 2, 3, 4, 5])]
x = int(input('Type a number'))
def findelement(x, a):
if x in a:
print (a.index(x))
else:
print (-1)
print(findelement(x, a))
Please use np.where instead of list.index.
import numpy as np
a = np.c_[np.array([1, 2, 3, 4, 5])]
x = int(input('Type a number: '))
def findelement(x, a):
if x in a:
print(np.where(a == x)[0][0])
else:
print(-1)
print(findelement(x, a))
Result:
Type a number: 3
2
None
Note np.where returns the indices of elements in an input array where
the given condition is satisfied.
You should check out np.where and np.argwhere.

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

Filling nulls with a list in Pandas using fillna

Given a pd.Series, I would like to replace null values with a list. That is, given:
import numpy as np
import pandas as pd
ser = pd.Series([0,1,np.nan])
I want a function that would return
0 0
1 1
2 [nan]
But if I try using the natural function for this, namely fillna:
result = ser.fillna([np.nan])
but I get the error
TypeError: "value" parameter must be a scalar or dict, but you passed a "list"
Any suggestions of a simple way to acheive this?
Use apply, because fillna working with scalars only:
print (ser.apply(lambda x: [np.nan] if pd.isnull(x) else x))
0 0
1 1
2 [nan]
dtype: object
You can change to object
ser=ser.astype('object')
Then assign the list np.nan
ser.loc[ser.isnull()]=[[np.nan]]
I ended up using
ser.loc[ser.isnull()] = ser.loc[ser.isnull()].apply(lambda x: [np.nan])
because pd.isnull(x) would give me ambiguous truth values error ( i have other lists in my series too ). This is a combination of YOBEN_S' and jezrael's answer.
fillna can take a Series, and a list can be converted to a Series. Wrapping your list in pd.Series() worked for me:
result = ser.fillna(pd.Series([np.nan]))
result
0 0.0
1 1.0
2 NaN
dtype: float64

Why a range_iterator when a range is reversed?

I can subscript a range object:
>>> r = range(4)
>>> r
range(0, 4)
>>> r[3]
3
>>> for i in r:
print(i)
0
1
2
3
>>> list(r)
[0, 1, 2, 3]
But, if I call reversed on the same range object:
>>> r = reversed(range(4))
>>> r
<range_iterator object at memaddr>
>>> for i in r:
print(i)
3
2
1
0
>>> r[3]
TypeError: 'range_iterator' object is not subscriptable # ?
>>> range(r)
TypeError: 'range_iterator' cannot be interpreted as an integer # ?
>>> list(r)
[] # ? uhmm
Hmm... Acting kinda like a generator but less useful.
Is there a reason a reversed range object isn't like a normal generator / iterator in how it quacks?
The reversed function returns an iterator, not a sequence. That's just how it's designed. The range_iterator you're seeing is essentially iter called on the reversed range you seem to want.
To get the reversed sequence rather than a reverse iterator, use the "alien smiley" slice: r[::-1] (where r is the value you got from range). This works both in Python 2 (where range returns a list) and in Python 3 (where range returns a sequence-like range object).
You need to change r back to a list type. For example:
reversed([1,2]) #prints <listreverseiterator object at 0x10a0039d0>
list(reversed([1,2])) #prints [2,1]
Edit
To clarify what you are asking, here is some sample I/O:
>>> r = range(5)
>>> x = reversed(r)
>>> print x
<listreverseiterator object at 0x10b6cea90>
>>> x[2]
Traceback (most recent call last):
File "<pyshell#24>", line 1, in <module>
x[2]
TypeError: 'listreverseiterator' object has no attribute '__getitem__'
>>> x = list(x)
>>> x[2] #it works here
2

Resources