New column based on a row with conditions in Pandas - python-3.x

I'm trying to do an operation with Dataframes but i'm not sure how I can solve the problem using the built-in Pandas Operations (Actualy my code is based on a for so I'm trying to build a more elegant solution).
Given the following Dataframes, defined with the columns described below
original_df = [o1, o2, o3, o4]
weights_df = [w1, w2, w3, w4]
conditions_df = [c1, c2, c3, c4]
I need to built a new column on original_df based on the division of o1/w1 but depending on the value of c1, with takes the values ["+" or "-" I need to do the -o1/w1 operation.
As long as I did was:
orignal_df['newcolumn'] = original_df / weights_df
Where of course I divided the two terms but without applying the condition, I'm trying to do with map and apply functions but I'm not sure how I can add the third column into the function.

original_df = [100, 200, 300, 400]
weights_df = [10, 20, 30, 40]
conditions_df = [1, 2, 3, 4]
df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
def div(x, y, z):
if z > 2:
return float(x/y)
else:
return float(-1*x/y)
df['new_feature'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
This is one way of solving. If your conditions_df contains '+'/'-' then you can change the condition in def div(x, y, z) accordingly.

You can use numpy.where for mask by condition:
#data from lisa answer
#df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
df['new_feature'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
print (df)
x y z new_feature
0 100 10 1 -10.0
1 200 20 2 -10.0
2 300 30 3 10.0
3 400 40 4 10.0
Timings:
#4k rows
df = pd.concat([df]*1000).reset_index(drop=True)
#lisa answer
In [95]: %timeit df['new_feature1'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
10 loops, best of 3: 123 ms per loop
In [96]: %timeit df['new_feature2'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
1000 loops, best of 3: 595 µs per loop

Related

The output of my code comes out too slowly.. How can i speed up my process

Thanks to the help from some users of this sites.
My code seems to work fine, but it's taking too long..
I'm trying to compare two data frames.(df1 has 1,291,250 rows / df2 has 1,286,692 rows)
if df1.iloc[0,0] == df2.iloc[0,0] and df1.iloc[0,1] == df2.iloc[0,1], then compare df1.iloc[0,2], df2.iloc[0,2].
If the first(df1.iloc[0,2]) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]) is larger, I want to put the second index into the list.
Example DataFrame
In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
A1 B1 C1
0 0 1 98
1 1 1 198
2 2 2 228
In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
A2 B2 C2
0 0 1 228
1 1 2 110
2 2 2 130
In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0]) # The result what i want
This is just simple example. my data is bigger than this
my code is:
for i in range(60):
setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')
import glob
files = glob.glob('df_1_*')
def find_high_pre(df1 = files, df2):
subtract_df2 = []
subtract_df1 = []
same_data = []
for df1_index, line in enumerate(df1.to_numpy()):
for df2_idx, row in enumerate(df2.to_numpy()):
if (line[0:2] == row[0:2]).all():
if line[2] < row[2]:
subtract_df2.append(df2_idx)
break
elif line[2] > row[2]:
subtract_df1.append(df1_idx)
break
else:
continue
break
return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();
data_1 = []
for i in files:
e_data = pd.read_pickle(i)
num_cores = 30
df_split = np.array_split(e_data, num_cores)
data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)
My code seems to work fine, but it's taking too long..
Chances are that replacing your nested for loops with a DataFrame.merge operation will take less time:
keys = ['A', 'B']
df1.columns = [*keys, 'C1']
df2.columns = [*keys, 'C2']
df = df1.reset_index().set_index(keys).merge(
df2.reset_index().set_index(keys), on=keys)
# now we have a merged dataframe like this:
# index_x C1 index_y C2
# A B
# 0 1 0 98 0 228
# 2 2 2 228 2 130
# therefrom we can easily extract the wanted indexes
data = [df.loc[df['C1'] > df['C2'], 'index_x'].values,
df.loc[df['C1'] < df['C2'], 'index_y'].values]

Randomly select elements from string in a dataframe

I have dataframe with 7 string columns:
bul; age; gender; hh; pn; freq_pn; rcrds_to_select
1; 2; 5; 1; ['35784905', '40666303', '47603805', '68229102'];4;3
2; 3; 3; 3; ['06299501', '07694901', '35070201'];3;2
In the last column I have the number of id's from "pn" column that I need to select randomly. Example: in the first row I have 4 id's ['35784905', '40666303', '47603805', '68229102'] and I need to select 3 random id's and remove the not selected one. There can be rows with only one id. I came to the conclusion that I need to turn the values in tuples and store them in another column ('pnTuple'). I don't know if this is the right way.
mass_grouped3['pnTuple'] = [tuple(x) for x in mass_grouped3['pn'].values]
I think random.shuffle will do the job, but have no idea how to implement it in my script. I was thinking something like this, but is not working:
for row in mass_grouped3['pnTuple']:
list = list(mass_grouped3['pnTuple'])
whitelist = random.shuffle(list)
Any ideas how to do this selection are appreciated.
You want to randomly select 1 from every row and make the rest 0. Here's one approach. Sample the indices and based on indices assign 1. i.e
idx = pd.DataFrame(np.stack(np.where(df==1))).T.groupby(0).apply(lambda x: x.sample(1)).values
# array([[0, 2],
# [1, 1],
# [2, 0],
# [3, 3]])
ndf = pd.DataFrame(np.zeros(df.shape),columns=df.columns)
ndf.values[idx[:,0],idx[:,1]] = 1
W1 W2 W3 W4
0 0 0 1 0
1 1 0 0 0
2 1 0 0 0
3 0 1 0 0
Welcome to StackOverflow! Hope this helps
Lets go step by step
First lets construct our random function that can select 3
>>> import random
>>> random.choices(['35784905', '40666303', '47603805', '68229102'], k=3)
['68229102', '40666303', '35784905']
I have a sample data frame, df with columns with same data as yours
>>> df
a b
0 12 [35784905, 40666303, 47603805, 68229102]
1 12 [06299501, 07694901, 35070201]
>>> df['b']
0 [35784905, 40666303, 47603805, 68229102]
1 [06299501, 07694901, 35070201]
Name: b, dtype: object
>>> df['b'].map(lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist)
0 [35784905, 68229102, 35784905]
1 [06299501, 07694901, 35070201]
Name: b, dtype: object
>>> df['b'] = df['b'].map(lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist)
Using pandas map operation to apply this data transformation to whole columns
Note: We are using a lambda function lambda alist: random.choices(alist, k=3) if len(alist) > 3 else alist to ensure that each list has more than 3 items, and only then apply this operation.
It might be a little new, but this a standard way of writing code in python. Learn more about Python, lambda function and pandas for some time.

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

grouping coordinates within a distance to each other

I have written this code which works but takes a very long time (~8hrs) to finish execution.
Wondering if it can be optimized to execute quicker.
The aim is to group a lots of items (x,y,z) coordinates based on their distance to one another. For example;
I would like to group them for a distance of +-0.5 in x, +-0.5 in y and +-0.5 in z, then the output from the data below would be [(0,3),(1),(2,4)...].
x y z
0 1000.1 20.2 93.1
1 647.7 91.7 87.7
2 941.2 44.3 50.6
3 1000.3 20.3 92.9
4 941.6 44.1 50.6
...
What I have done (and which works) is described below.
It compares the first row of the data_frame with the 2nd, 3rd, 4th .. until the end, and for each row, if the distance from x to x < +-0.5 and y to y < +-0.5 and z to z < +- 0.5 then the index is added to a list, group. If it doesn't then it compares the next row until reaching the end of the loop.
After each loop is complete the indexes which matched (stored in group), are added to another list, groups, as a set and then removed from the original list, a, and then next a[0] is compared and so on.
groups = []
group = []
data = [(x,y,z),(x,y,z),(etc)] # > 50,000 entries
data_frame = pd.DataFrame(data, columns=['x','y','z'])
a = list(i for i in range(len(data_frame)))
threshold = 0.5
for j in range(len(a) - 1) :
if len(a) > 0:
group.append(a[0])
for ii in range(a[0], len(data_frame) - 1):
if ((data_frame.loc[a[0],'x'] - data_frame.loc[ii,'x']) < threshold) and ((data_frame.loc[a[0],'y'] - data_frame.loc[ii,'y']) < threshold) and ((data_frame.loc[a[0],'z'] - data_frame.loc[ii,'z']) < threshold):
group.append(ii)
else:
continue
groups.append(set(group))
for iii in group:
if iii in a:
a.remove(iii)
else:
continue
group = []
else:
break
which returns something like this, for example;
groups = [{0}, {1, 69}, {2, 70}, {3, 67}, {4}, {5}, {6}, {7, 9}, {8}, {10}, {11}, {12}, 13}, {14, 73}, {15}, {16}, {17, 21, 74}, {18, 20}, {19}, {22, 23}]
Have made many edits to this question as it was not very clear. Hopefully makes sense now.
Below is an attempt using better logic 'O(NlogN)' which is much faster but doesn't return the correct answer. Have used the same +-0.5 for x,y,z.
Edit:
test_list = [(i,x,y,z), ... , (i,x,y,z)]
df3 = sorted(test_list,key=lambda x: x[1])
result = []
while df3:
if len(df3) > 1: ####added this because was crashing at the end of the loop
a = df3.pop(0)
alist=[a[0]]
while ((abs(a[1] - df3[0][1]) < 0.5) and (abs(a[2] - df3[0][2]) < 0.5) and (abs(a[3] - df3[0][3]) < 0.5)):
alist.append(df3.pop(0)[0])
if df3:
continue
else:
break
result.append(alist)
else:
result.append(a[0])
break
Since you are comparing each data point with every other one, your implementation has a worst time complexity of O(N!). A better way is to do a sorting first.
import random
df = [i for i in range(100)]
random.shuffle(df)
df2 = [(i,x) for i,x in enumerate(df)]
df3 = sorted(df2,key=lambda x: x[1])
df3
[(31, 0), (24, 1), (83, 2)......
Assuming now you want to group number that are +5/-5 into one list. You can then slice number into list based on a condition.
result = []
while df3:
a = df3.pop(0)
alist=[a[0]]
while a[1] + 5 >= df3[0][1]:
alist.append(df3.pop(0)[0])
if df3:
continue
else:
break
result.append(alist)
result
[[31, 24, 83, 58, 82, 35], [0, 65, 77, 41, 67, 56].......
Sorting takes O(NlogN) and a grouping basically takes linear time. So this would be much faster than N!

Resources