compare index and column in data frame with dictionary - python-3.x

I have a dictionary:
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
and a list:
L = [A,B,C]
I have a DataFrame:
df =pd.DataFrame(columns = L, index=L)
I would like to fill each row in df by values in dictionary based on dictionary keys.For example:
A B C
A 1 2 3
B 5 1 5
C 3 4 9
I tried doing that by:
df.loc[L[0]]=[1,2,3]
df.loc[L[1]]=[5,1,5]
df.loc[L[2]] =[3,4,9]
Is there another way to do that especially when there is a huge data?
Thank you for help

Here is another way that I can think of:
import numpy as np
import pandas as pd
# given
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
L = ['A', 'B', 'C']
# copy the key values into a numpy array
z = np.asarray(list(d.values()))
# reshape the array according to your DataFrame
z_new = np.reshape(z, (3, 3))
# copy it into your DataFrame
df = pd.DataFrame(z_new, columns = L, index=L)

This should do the trick, though it's probably not the best way:
for index in L:
prefix = index + "-"
df.loc[index] = [d.get(prefix + column, 0) for column in L]
Calculating the prefix separately beforehand is probably slower for a small list and probably faster for a large list.
Explanation
for index in L:
This iterates through all of the row names.
prefix = index + "-"
All of the keys for each row start with index + "-", e.g. "A-", "B-"… etc..
df.loc[index] =
Set the contents of the entire row.
[ for column in L]
The same as your comma thing ([1, 2, 3]) just for an arbitrary number of items. This is called a "list comprehension".
d.get( , 0)
This is the same as d[ ] but returns 0 if it can't find anything.
prefix + column
Sticks the column on the end, e.g. "A-" gives "A-A", "A-B"…

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

Set decimal values to 2 points in list under list pandas

I am trying to set max decimal values upto 2 digit for result of a nested list. I have already tried to set precision and tried other things but can not find a way.
r_ij_matrix = variables[1]
print(type(r_ij_matrix))
print(type(r_ij_matrix[0]))
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.precision", 2)
data = pd.DataFrame(r_ij_matrix, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Combined Decision Matrix')
You can solve your problem with the apply() method of the dataframe. You can do something like that :
df.apply(lambda x: [[round(elt, 2) for elt in list_] for list_ in x])
Solved it by copying the list to another with the desired decimal points. Thanks everyone.
rij_matrix = variables[1]
rij_nparray = np.empty([8, 6, 3])
for i in range(8):
for j in range(6):
for k in range(3):
rij_nparray[i][j][k] = round(rij_matrix[i][j][k], 2)
rij_list = rij_nparray.tolist()
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(rij_list, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Normalized Fuzzy Decision Matrix (r_ij)')
applymap seems to be good here:
but there is a BUT: be aware that it is propably not the best idea to store lists as values of a df, you just give up the functionality of pandas. and also after formatting them like this, there are stored as strings. This (if really wanted) should only be for presentation.
df.applymap(lambda lst: list(map("{:.2f}".format, lst)))
Output:
A B
0 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
1 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
2 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
Used Input:
df = pd.DataFrame({
'A': [[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463]],
'B': [[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414]]})

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

Python Pandas Create Multiple dataframes by slicing data at certain locations

I am new to Python and data analysis using programming. I have a long csv and I would like to create DataFrame dynamically and plot them later on. Here is an example of the DataFrame similar to the data exist in my csv file
df = pd.DataFrame(
{"a" : [4 ,5, 6, 'a', 1, 2, 'a', 4, 5, 'a'],
"b" : [7, 8, 9, 'b', 0.1, 0.2, 'b', 0.3, 0.4, 'b'],
"c" : [10, 11, 12, 'c', 10, 20, 'c', 30, 40, 'c']})
As seen, there are elements which repeated in each column. So I would first need to find the index of the repetition and following that use this for making subsets. Here is the way I did this.
find_Repeat = df.groupby(['a'], group_keys=False).apply(lambda df: df if
df.shape[0] > 1 else None)
repeat_idxs = find_Repeat.index[find_Repeat['a'] == 'a'].tolist()
If I print repeat_idxs, I would get
[3, 6, 9]
And this is the example of what I want to achieve in the end
dfa_1 = df['a'][Index_Identifier[0], Index_Identifier[1])
dfa_2 = df['a'][Index_Identifier[1], Index_Identifier[2])
dfb_1 = df['b'][Index_Identifier[0], Index_Identifier[1])
dfb_2 = df['b'][Index_Identifier[1], Index_Identifier[2])
But this is not efficient and convenient as I need to create many DataFrame like these for plotting later on. So I tried the following method
dfNames = ['dfa_' + str(i) for i in range(len(repeat_idxs))]
dfs = dict()
for i, row in enumerate(repeat_idxs):
dfName = dfNames[i]
slices = df['a'].loc[row:row+1]
dfs[dfName] = slices
If I print dfs, this is exactly what I want.
{'df_0': 3 a
4 1
Name: a, dtype: object, 'df_1': 6 a
7 4
Name: a, dtype: object, 'df_2': 9 a
Name: a, dtype: object}
However, if I want to read my csv and apply the above, I am not getting what's desired. I can find the repeated indices from csv file but I am not able to slice the data properly. I am presuming that I am not reading csv file correctly. I attached the csv file for further clarification csv file
Two options:
Loop over and slice
Detect the repeat row indices and then loop over to slice contiguous chunks of the dataframe, ignoring the repeat rows:
# detect rows for which all values are equal to the column names
repeat_idxs = df.index[(df == df.columns.values).all(axis=1)]
slices = []
start = 0
for i in repeat_idxs:
slices.append(df.loc[start:i - 1])
start = i + 1
The result is a list of dataframes slices, which are the slices of your data in order.
Use pandas groupby
You could also do this in one line using pandas groupby if you prefer:
grouped = df[~(df == df.columns.values).all(axis=1)].groupby((df == df.columns.values).all(axis=1).cumsum())
And you can now iterate over the groups like so:
for i, group_df in grouped:
# do something with group_df

Resources