Does ParameterGrid produce duplicates? - python-3.x

Is it ParameterGrid function from scikit-learn 0.22 in Python 3.7.5 that produces duplicates or is it because I don't use it correctly? Have a look at the following example.
from sklearn.model_selection import ParameterGrid
import pandas as pd
hyper_params_dict = {
"SQM_FOLDER_SUFFIX": ["_SQM_MM"],
"HYPER_RATIO_SCORED_POSES": [0.8],
"HYPER_OUTLIER_MAD_THRESHOLD": [2.0, 2.5, 3.0, 3.5, 4.0, 5.0, 6.0, 7.0, 8.0],
"HYPER_KEEP_MAX_DeltaG_POSES": [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0],
"HYPER_KEEP_POSE_COLUMN": ["r_i_docking_score"],
"HYPER_SELECT_BEST_BASEMOLNAME_SCORE_BY": ["Eint"],
"HYPER_SELECT_BEST_BASEMOLNAME_POSE_BY": ["Eint"],
"HYPER_SELECT_BEST_STRUCTVAR_POSE_BY": ["complexE"],
"CROSSVAL_PROTEINS_STRING": ['MARK4', 'ACHE', 'JNK2', 'AR', 'EPHB4', 'PPARG', 'MDM2', 'PARP-1', 'TP', 'TPA',
'SIRT2', 'SARS-HCoV', 'PPARG'],
"XTEST_PROTEINS_STRING": [""],
"HYPER_2D_DESCRIPTORS": [""],
"HYPER_3D_DESCRIPTORS": [""],
"HYPER_GLIDE_DESCRIPTORS": [""]
}
df = pd.concat([pd.DataFrame({k: [v] for k, v in p.items()}) for p in ParameterGrid(hyper_params_dict)], ignore_index=True)
df.duplicated().value_counts()

ParameterGrid creates combinations of all values without duplicates.
You have duplicated parameters combinations because CROSSVAL_PROTEINS_STRING contains 2 times the value PPARG.

Related

I want add my simples list in list of comprehension in python

I have two lists of 24 values and I would like to create a list which could be seen as a 24x2 matrix where the first column is my the values of my first list and the other column is the values of my second list.
Here are my two lists:
q = [6.0, 5.75, 5.5, 5.25, 5.0, 4.75, 4.5, 4.25, 4.0, 3.75, 3.5, 3.25, 3.0, 2.75, 2.5, 2.25, 2.0, 1.75, 1.5, 1.25, 1.0, 0.75, 0.5, 0.25]
t = [0.38, 0.51, 0.71, 1.09, 2.0, 5.68, 0.31, 0.32, 0.34, 0.35, 0.36, 0.38, 0.4, 0.42, 0.44, 0.48, 0.51, 0.56, 0.63, 0.74, 1.41, 2.17, 3.97, 11.36]
You can use zip() function like this
q = [6.0, 5.75, 5.5, 5.25, 5.0, 4.75, 4.5, 4.25, 4.0, 3.75, 3.5, 3.25, 3.0, 2.75, 2.5, 2.25, 2.0, 1.75, 1.5, 1.25, 1.0, 0.75, 0.5, 0.25]
t = [0.38, 0.51, 0.71, 1.09, 2.0, 5.68, 0.31, 0.32, 0.34, 0.35, 0.36, 0.38, 0.4, 0.42, 0.44, 0.48, 0.51, 0.56, 0.63, 0.74, 1.41, 2.17, 3.97, 11.36]
L1 = list(zip(q, t))
res = []
for i, j in L1:
res.append(i)
res.append(j)
print(res)
It seems that you just need to zip your two lists:
myList = [0,1,2,3,4,5]
myOtherList = ["a","b","c","d","e","f"]
# Iterator of tuples
zip(myList, myOtherList)
# List of tuples
list(zip(myList, myOtherList))
You will get this result: [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f')].
If you need another structure, you could use comprehension:
length = min(len(myList), len(myOtherList))
# List of list
[[myList[i], myOtherList[i]] for i in range(length)]
# Dict
{myList[i]: myOtherList[i] for i in range(length)}

How to use fit_transform with an array?

Example of array content:
[
[4.9, 3.0, 1.4, 0.2, 0.0, 2.0],
[4.7, 3.2, 1.3, 0.2, 0.0, 2.0],
[4.6, 3.1, 1.5, 0.2, 0.0, 2.0],
...
]
model = TSNE(learning_rate=100)
transformed = model.fit_transform(data)
I'm trying to apply tSNE to a float array, but I get an error. What should I change?
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (149,) + inhomogeneous part.
Try this example:
from sklearn.manifold import TSNE
import numpy as np
X = np.array([[4.9, 3.0, 1.4, 0.2, 0.0, 2.0], [4.7, 3.2, 1.3, 0.2, 0.0, 2.0]])
model = TSNE(learning_rate=100)
transformed = model.fit_transform(X)
print(transformed)

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)

I am trying to get the columns from dataframe whose correlation with another column is greater than certain values like below.
df.loc[:, (df.corr()['col'] <= -0.05) | (df.corr()['col'] >= 0.05)]
But I am getting below error,
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
Also if I try to select the columns with variance > 1, I get the same error,
df.loc[;df.var() > 1 ].
Why I am getting indexing error. I want to filter the columns of dataframe if correlation of that column with another columns is between -0.05 and 0.05.
Can someone assist in resolving this issue. I am not sure where I am going wrong
I think I found what's your problem.
First I tried to build my own testing set, unfortunately everything worked nicely:
df = pd.DataFrame({
"col": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
"A": [1.1, 1.0, 1.0, 1.0, 1.0, 1.1],
"B": [1.0, 2.1, 3.0, 3.9, 5.0, 6.0]
})
df.loc[:, (df.corr()['col'] <= -0.05) | (df.corr()['col'] >= 0.05)]
I got :
col B
0 1.0 1.0
1 2.0 2.1
2 3.0 3.0
3 4.0 3.9
4 5.0 5.0
5 6.0 6.0
But then, after reading again your error, I thought maybe there are some columns in your data the corr() method is just ignoring such as column with an object dtype.
If I build a new testing set with textual columns, I get the same error as you:
df = pd.DataFrame({
"col": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
"A": [1.1, 1.0, 1.0, 1.0, 1.0, 1.1],
"B": [1.0, 2.1, 3.0, 3.9, 5.0, 6.0],
"C": ["A", "B", "C", "D", "E", "F"]
})
df.corr()['col'] >= 0.05
df.loc[:, (df.corr()['col'] <= -0.05) | (df.corr()['col'] >= 0.05)]
Then I got:
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
One way of fixing this is by doing so:
df = df.drop(columns=df.corr().query("-0.05 < col < 0.05").index)
Note: Please remind you'll have quicker and more relevant answers if you provide a full sample of the non-working code so that your error can be reproduced easily ;)

Replace columns in a 2D numpy array by columns from another2D array

I have two 2D arrays, I want to create arrays that are copy of the first one and then replace some columns by others from the second one.
M1 = np.array([[1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
[4.0, 5.0, 6.0, 4.0, 5.0, 6.0]])
M2 = np.array([[1.1, 2.1, 3.1, 1.2, 2.2, 3.2],
[4.1, 5.1, 6.1., 4.2, 5.2, 6.2]])
I want to do a loop that can give the following arrays:
M3 = np.array([[1.1, 2.0, 3.0, 1.2, 2.0, 3.0],
[4.1, 5.0, 6.0, 4.2, 5.0, 6.0]])
M4 = np.array([[1.0, 2.1, 3.0, 1.0, 2.2, 3.0],
[4.0, 5.1, 6.0, 4.0, 5.2, 6.0]])
M5 = np.array([[1.0, 2.0, 3.1, 1.0, 2.0, 3.2],
[4.0, 5.0, 6.1, 4.0, 5.0, 6.2]])
You can use np.where:
selector = [1,0,0,1,0,0]
np.where(selector,M2,M1)
# array([[1.1, 2. , 3. , 1.2, 2. , 3. ],
# [4.1, 5. , 6. , 4.2, 5. , 6. ]])
selector = [0,1,0,0,1,0]
np.where(selector,M2,M1)
# array([[1. , 2.1, 3. , 1. , 2.2, 3. ],
# [4. , 5.1, 6. , 4. , 5.2, 6. ]])
etc.
Or in a loop:
M3,M4,M5 = (np.where(s,M2,M1) for s in np.tile(np.identity(3,bool), (1,2)))
M3
# array([[1.1, 2. , 3. , 1.2, 2. , 3. ],
# [4.1, 5. , 6. , 4.2, 5. , 6. ]])
M4
# array([[1. , 2.1, 3. , 1. , 2.2, 3. ],
# [4. , 5.1, 6. , 4. , 5.2, 6. ]])
M5
# array([[1. , 2. , 3.1, 1. , 2. , 3.2],
# [4. , 5. , 6.1, 4. , 5. , 6.2]])
Alternatively, you can copy M1 and then slice in M2. This is more verbose but should be faster:
n = 3
Mj = []
for j in range(n):
Mp = M1.copy()
Mp[:,j::n] = M2[:,j::n]
Mj.append(Mp)
M3,M4,M5 = Mj

generate list of means from lists of floats python

I'm trying to write simple code that will take floats in two lists, find the mean between the two numbers in the same position in each list, and generate a new list with the updated means. For example, with list_1 and list_2,
list_1: [1.0, 2.0, 3.0, 4.0, 5.0]
list_2: [6.0, 7.0, 8.0, 9.0, 10.0]
list_3: []
for i in list_1:
for x in list_2:
list_3.append((x+i)/2)
print (list_3)
Find the mean between floats in two lists and create a new list such that:
list_3 = [3.5, 4.5, 5.5, 6.5, 7.5]
I tried creating a for loop but (for obvious reasons) doesn't iterate the way that I want it to. The output is:
[3.5, 4.0, 4.5, 5.0, 5.5, 4.0, 4.5, 5.0, 5.5, 6.0, 4.5, 5.0, 5.5, 6.0, 6.5, 5.0, 5.5, 6.0, 6.5, 7.0, 5.5, 6.0, 6.5, 7.0, 7.5]
any help would be greatly appreciated!
You can do that with a generator expresion like:
Code:
[sum(x)/len(x) for x in zip(list_1, list_2)]
How:
The function zip() allows easy iteration through multiple lists at the same time. From there these values can be fed into sum() and len() as shown.
Test Code:
list_1 = [1.0, 2.0, 3.0, 4.0, 5.0]
list_2 = [6.0, 7.0, 8.0, 9.0, 10.0]
list_3 = [sum(x)/len(x) for x in zip(list_1, list_2)]
print(list_3)
Results:
[3.5, 4.5, 5.5, 6.5, 7.5]

Resources