Replace specific column values with pd.NA - python-3.x

I am working on a data set that contains longitude and latitude values.
I converted those values to clusters using DBSCAN.
Then I plotted the clusters just as a sanity check.
I get this:
The point at (0, 0) is obviously an issue.
So I ran this code to capture which row(s) are a problem.
a = df3.loc[(df3['latitude'] < 0.01) & (df3['longitude'] < 0.01)].index
print(a) # 1812 rows with 0.0 longitude and -2e-08 latitude
I have 1812 rows with missing data all represented as 0.0 longitude and -2e-08 latitude in the source file.
I am debating some imputation strategies but first I want to replace the 0.0 and -2e-08 values
with np.NA or np.nan so that I can then use fillna() with whatever I ultimately decide to do.
I have tried both:
df3.replace((df3['longitude'] == 0.0), pd.NA, inplace=True)
df3.replace((df3['latitude'] == -2e-08), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
and
df3.replace((df3['longitude'] < 0.01), pd.NA, inplace=True)
df3.replace((df3['latitude'] < 0.01), pd.NA, inplace=True)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))
In both cases the existing values remain in place, i.e., the desired substitution with pd.NA
is not occurring.
What would be the correct procedure to replace the unwanted 1812 values in both the latitude and longitude columns with pd.NA or np.nan, as I simply plan to the impute something to replace the null values.

Try this one out:
df3['longitude'] = df3['longitude'].apply(lambda x:np.nan if x == 0.0 else x)
df3['latitude'] = df3['latitude'].apply(lambda x:np.nan if x==-2e-08 else x)
print(df3['longitude'].value_counts(dropna=False), '\n')
print(df3['latitude'].value_counts(dropna=False))

With an example
import numpy as np
import pandas as pd
a = [1, 2, 0.0, -2e-08]
b = [1, 2, 0.0, -2e-08]
df = pd.DataFrame(zip(a, b))
df.columns = ['lat', 'long']
df.long = df.long.apply(lambda x:np.nan if x == 0.0 else x)
df.lat = df.lat.apply(lambda x:np.nan if x==-2e-08 else x)

Related

How do I eliminate elements in the same position in two lists but just filtering one list?

I have two lists i.e.:
a = [0.0 , 30.1, 0.0, 10.1]
b = [1000, 9830, 100, 1023]
I want to remove from list "a" the elements equals to 0.0 and remove the elements from list "b" which are in the same position of the elements 0.0 in the list "a".
I know I can do this saving the index of 0.0 elements in a list, and then delete from the list b. Is there something more efficient? Because I want to apply the method in very large datasets.
Thanks
Using NumPy is of the most efficient ways. It is very easy doing so by NumPy indexing if you can use this library, without needing loops and much more faster. We need to convert them to arrays and reconvert them to list by .tolist() if needed:
a = np.array(a)
b = np.array(b)
a_ = a[a == 0]
# [0. 0.]
b_ = b[a != 0]
# [9830 1023]
This should do the trick:
a = [0.0 , 30.1, 0.0, 10.1]
b = [1000, 9830, 100, 1023]
#####################################################
assert(len(a) == len(b))
b = [b[index] if value != 0 else "" for index, value in enumerate(a)]
b = list(filter(lambda x: x != "", b))
#####################################################
print(b) # prints [9830, 1023]

Set decimal values to 2 points in list under list pandas

I am trying to set max decimal values upto 2 digit for result of a nested list. I have already tried to set precision and tried other things but can not find a way.
r_ij_matrix = variables[1]
print(type(r_ij_matrix))
print(type(r_ij_matrix[0]))
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.precision", 2)
data = pd.DataFrame(r_ij_matrix, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Combined Decision Matrix')
You can solve your problem with the apply() method of the dataframe. You can do something like that :
df.apply(lambda x: [[round(elt, 2) for elt in list_] for list_ in x])
Solved it by copying the list to another with the desired decimal points. Thanks everyone.
rij_matrix = variables[1]
rij_nparray = np.empty([8, 6, 3])
for i in range(8):
for j in range(6):
for k in range(3):
rij_nparray[i][j][k] = round(rij_matrix[i][j][k], 2)
rij_list = rij_nparray.tolist()
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(rij_list, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Normalized Fuzzy Decision Matrix (r_ij)')
applymap seems to be good here:
but there is a BUT: be aware that it is propably not the best idea to store lists as values of a df, you just give up the functionality of pandas. and also after formatting them like this, there are stored as strings. This (if really wanted) should only be for presentation.
df.applymap(lambda lst: list(map("{:.2f}".format, lst)))
Output:
A B
0 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
1 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
2 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
Used Input:
df = pd.DataFrame({
'A': [[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463]],
'B': [[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414]]})

Starting with Sklearns Nearest Neighbors output, how do I remove results where the record is it's own nearest neighbor?

I am want to use sklearns NearestNeighbors() model to do some data analysis.
In my use case, I want to grab the N nearest neighbors and put it back into a pandas dataframe to evaluate the similarity of different records.
However, the results include the original record. In my case, that isn't useful. I want the nearest different records.
Example:
xtest = np.array([[1,1,1], [1,1,1], [1,.8,1] [.8,1,1]])
nn = NearestNeighbors(n_neighbors=2)
nn.fit(xtest)
distances, indices = nn.kneighbors(xtest)
returns:
(array([[0. , 0. ],
[0. , 0. ],
[0. , 0.2],
[0. , 0.2]]),
array([[0, 1],
[0, 1],
[2, 1],
[3, 1]], dtype=int64))
In the above arrays the cells at indices (0,0), (1,1), (2, 0) and (3,0) are unimportant.
My goal is to manipulate this output so that I can create the following columns in pandas:
"NearestNeighbor1" - the index of the nearest record other than itself
"NearestNeighbor1_dist" -the distance of the nearest record other than itself even if the distance is zero.
"NearestNeighbor2" - the index of the next nearest record other than itself
"NearestNeighbor2_dist" -the distance of the nearest record other than itself even if the distance is zero.
In the event of a tie, I don't care which record comes first (as long as it isn't itself).
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
xtest = np.array([[1,1,1], [1,1,1], [1,.8,1], [.8,1,1]])
nn = NearestNeighbors(n_neighbors=3)
nn.fit(xtest)
distances, indices = nn.kneighbors(xtest)
df_ind = pd.DataFrame(data=indices)
df_ind = df_ind.apply(func=lambda x: [y for y in x if y != x.name], axis=1, result_type='expand')
df = pd.DataFrame({'NearestNeighbor1': df_ind.iloc[:, 0],
'NearestNeighbor1_dist': distances[:,1],
'NearestNeighbor2': df_ind.iloc[:, 1],
'NearestNeighbor2_dist': distances[:, 2]
})
print(df)
Output:
NearestNeighbor1 NearestNeighbor1_dist NearestNeighbor2 NearestNeighbor2_dist
0 1 0.0 2 0.2
1 0 0.0 2 0.2
2 1 0.2 0 0.2
3 1 0.2 0 0.2
This solution works for an arbitrary number of neighbors, although, I wonder if there is a more elegant solution using numpy.
N_NBRS = 4
nbrs = NearestNeighbors(n_neighbors=N_NBRS + 1, algorithm='brute')
nbrs.fit(X)
dist_n, ix_n = nbrs.kneighbors(X)
replacement = []
for row_idx, row in enumerate(ix_n):
new_row = [val for val in row if val != row_idx]
new_row = new_row[:N_NBRS] #Truncate in the event of many exact matches
replacement.append(new_row)
ix_n2 = np.array(replacement)
dist_n2 = dist_n[:,1:]
results = X.copy()
for col_idx in range(N_NBRS):
results[f'Neighbor{col_idx + 1}'] = ix_n2[:,col_idx]
results[f'Neighbor{col_idx + 1}_dist'] = dist_n2[:,col_idx]

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

Return pieces of strings from separate pandas dataframes based on multi-conditional logic

I'm new to python, and trying to do some work with dataframes in pandas
On the left side is piece of the primary dataframe (df1), and the right is a second (df2). The goal is to fill in the df1['vd_type'] column with strings based on several pieces of conditional logic. I can make this work with nested np.where() functions, but as this gets deeper into the hierarchy, it gets too long to run at all, so I'm looking for a more elegant solution.
The english version of the logic is this:
For df1['vd_type']: If df1['shape'] == the first two characters in df2['vd_combo'] AND df1['vd_pct'] <= df2['combo_value'], then return the last 3 characters in df2['vd_combo'] on the line where both of these conditions are true. If it can't find a line in df2 where both conditions are true, then return "vd4".
Thanks in advance!
EDIT #2: So I want to implement a 3rd condition based on another variable, with everything else the same, except in df1 there is another column 'log_vsc' with existing values, and the goal is to fill in an empty df1 column 'vsc_type' with one of 4 strings in the same scheme. The extra condition would be just that the 'vd_type' that we just defined would match the 'vd' column arising from the split 'vsc_combo'.
df3 = pd.DataFrame()
df3['vsc_combo'] = ['A1_vd1_vsc1','A1_vd1_vsc2','A1_vd1_vsc3','A1_vd2_vsc1','A1_vd2_vsc2' etc etc etc
df3['combo_value'] = [(number), (number), (number), (number), (number), etc etc
df3[['shape','vd','vsc']] = df3['vsc_combo'].str.split('_', expand = True)
def vsc_condition( row, df3):
df_select = df3[(df3['shape'] == row['shape']) & (df3['vd'] == row['vd_type']) & (row['log_vsc'] <= df3['combo_value'])]
if df_select.empty:
return 'vsc4'
else:
return df_select['vsc'].iloc[0]
## apply vsc_type
df1['vsc_type'] = df1.apply( vsc_condition, args = ([df3]), axis = 1)
And this works!! Thanks again!
so your inputs are like:
import pandas as pd
df1 = pd.DataFrame({'shape': ['A2', 'A1', 'B1', 'B1', 'A2'],
'vd_pct': [0.78, 0.33, 0.48, 0.38, 0.59]} )
df2 = pd.DataFrame({'vd_combo': ['A1_vd1', 'A1_vd2', 'A1_vd3', 'A2_vd1', 'A2_vd2', 'A2_vd3', 'B1_vd1', 'B1_vd2', 'B1_vd3'],
'combo_value':[0.38, 0.56, 0.68, 0.42, 0.58, 0.71, 0.39, 0.57, 0.69]} )
If you are not against creating columns in df2 (you can delete them at the end if it's a problem) you generate two columns shape and vd by splitting the column vd_combo:
df2[['shape','vd']] = df2['vd_combo'].str.split('_',expand=True)
Then you can create a function condition that you will use in apply such as:
def condition( row, df2):
# row will be a row of df1 in apply
# here you select only the rows of df2 with your conditions on shape and value
df_select = df2[(df2['shape'] == row['shape']) & (row['vd_pct'] <= df2['combo_value'])]
# if empty (your condition not met) then return vd4
if df_select.empty:
return 'vd4'
# if your condition met, then return the value of 'vd' the smallest
else:
return df_select['vd'].iloc[0]
Now you can create your column vd_type in df1 with:
df1['vd_type'] = df1.apply( condition, args =([df2]), axis=1)
df1 is like:
shape vd_pct vd_type
0 A2 0.78 vd4
1 A1 0.33 vd1
2 B1 0.48 vd2
3 B1 0.38 vd1
4 A2 0.59 vd3

Resources