Measure classifier by using cross validation with ROC metrics - python-3.x

I am trying to do a cross validation with the ROC metric to evaluate the classifier, and I came across with the following code from Scikit learn :
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
I have trouble understanding the X,y = X[y!=2],y[y!=2] line, what is the purpose of this line?
Also, can someone possibly help me to clarify the use of underline
n_samples, n_features?
Thanks!

Iris dataset has three classes labeled 0, 1, 2.
When you see
X, y = X[y != 2], y[y != 2]
it just means new values of X and y will not contain records for class with a label 2.
Here is how it works.
y != 2 returns a boolean vector equal to the length of y, that contains True when y was 0 or 1, and False where y was 2, according to the given condition y != 2. I.e. [True, False, False, ...]. It is also sometimes called a mask.
y[y != 2] is boolean-based indexing, it returns a new array consisting of such elements of y where y is not 2. I.e. the resulting array will not contain 2s.
Finally, X[y != 2] return a new array X with elements that correspond to True values of a mask.
Since X and y a re of the same length, applying the same mask to it works perfectly, and in this case effectively all records with class label 2 are removed.
Now for the purpose of removing en entire class from the dataset - this is something you should look for in the tutorial your were reading.
X.shape returns a tuple with number of rows and number of columns in a dataframe. This is what data scientists call samples and features.

Related

Plotting a Line of Best Fit on the Same Plot for Multiple Datasets

I am trying to approximate a line of best fit between multiple datasets, and display everything on one plot. This question addresses a similar notion, but the contents are in MatLab and, hence, not the same.
I have data from 4 different experiments that's composed of 146 values, the Y values represent changes in distance over time, the X value, which is represented by integer timesteps (1,2,3,...). The shape of my Y data is (4,146), as I've decided to keep all of it in a nested list, and the shape of my X data is (146,). I have the following set-up for my subplots:
x = [i for i in range(len(temp[0]))]
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x,Y[0],c="blue", marker='.',linewidth=1)
ax1.scatter(x,Y[1],c="orange", marker='.',linewidth=1)
ax1.scatter(x,Y[2],c="green", marker='.',linewidth=1)
ax1.scatter(x,Y[3],c="purple", marker='.',linewidth=1)
z = np.polyfit(x,Y,3) # Throws an error because x,Y are not the same length
p = np.poly1d(z)
plt.plot(x, p(x))
I do not know how to fit a line of best fit between the scatter plots. numpy.polyfit documentation suggests that "Several data sets of sample points sharing the same x-coordinates can be fitted at once", but I have been unsuccessful thus far, and can only fit the line to one dataset. Is there a way that I can fit the line to all of the data sets? Should I use a different library entirely, like Seaborn?
Try to cast x and Y to a numpy arrays (I assume it is in a list). You can do this by using x = np.asarray(x). Now to fit on the data collectively, you can flatten the Y array using Y.flatten(). It transforms the shape from (n,N) to (n*N). And you can tile the x array n times to make a fit, this just copies the array n times into a new array so this will also become shape (n*N,). In this way you match the values form Y to corresponding values of x.
N = 10 # no. datapoints
n = 4 # no. experiments
# creating some dummy data
x = np.linspace(0,1, N) # shape (N,)
Y = np.random.normal(0,1,(n, N))
np.polyfit(np.tile(x, n), Y.flatten(), deg=3)
The polyfit function expects the Y array to be, in your case, (146, 4) rather than (4, 146), so you should pass it the transpose of Y, e.g.,
z = np.polyfit(x, Y.T, 3)
The poly1d function can only do one polynomial at a time, so you have to loop over the results from polyfit, e.g.,:
for res in z:
p = np.poly1d(res)
plt.plot(x, p(x))

Multiply a [3, 2, 3] by a [3, 2] tensor in pytorch (dot product along dimension)

Given the following tensors x and y with shapes [3,2,3] and [3,2]. I want to multiply the tensors along the 2nd dimension, this is expected to be a kind of dot product and scaling along the axis and return a [3,2,3] tensor.
import torch
a = [[[0.2,0.3,0.5],[-0.5,0.02,1.0]],[[0.01,0.13,0.06],[0.35,0.12,0.0]], [[1.0,-0.3,1.0],[1.0,0.02, 0.03]] ]
b = [[1,2],[1,3],[0,2]]
x = torch.FloatTensor(a) # shape [3,2,3]
y = torch.FloatTensor(b) # shape [3,2]
The expected output :
Expected output shape should be [3,2,3]
#output = [[[0.2,0.3,0.5],[-1.0,0.04,2.0]],[[0.01,0.13,0.06],[1.05,0.36,0.0]], [[0.0,0.0,0.0],[2.0,0.04, 0.06]] ]
I have tried the two below but none of them is giving the desired output and output shape.
torch.matmul(x,y)
torch.matmul(x,y.unsqueeze(1).shape)
What is the best way to fix this?
This is just broadcasted multiply. So you can insert a unitary dimension on the end of y to make it a [3,2,1] tensor and then multiply by x. There are multiple ways to insert unitary dimensions.
# all equivalent
x * y.unsqueeze(2)
x * y[..., None]
x * y[:, :, None]
x * y.reshape(3, 2, 1)
You could also use torch.einsum.
torch.einsum('abc,ab->abc', x, y)

Could someone please help me with sklearn.metrics.roc_curve's use and what does the function expect?

I am trying to construct 2 numpy ndarray-s from a networkx Graph's data structures that look like a list of tuples and a simple list. I would like to make a roc curve where
the validation set is the above mentioned list of tuples of the edges of a G graph that I was trying to construct like this:
x = []
for i in G_orig.nodes():
for j in G_orig.nodes():
if j > I and (i, j) not in G.edges():
if (i, j) in G_orig.edges():
x.append((i, j, 1))
else:
x.append((i, j, 0))
y_validation = np.array(x)
It looks something like this: [(1, 344, 1), (2, 23, 0), (3, 5, 0), ...... (333, 334, 1)].
The first 2 numbers mean 2 nodes, the 3rd one means whether there is an edge between them. 1 means edge, 0 means no edge.
Then roc_curve expects something called y_score in the documentation. I have a list for that made with a method called preferential attachment, therefore I named it pref_att_types. I tried to make a numpy array of it in case the roc_curve expects only it.
positive_class_predicted_probabilities = np.array(pref_att_types)
3.Then I just did what we used in class.
FPRs, TPRs, thresholds = roc_curve(y_validation,
positive_class_predicted_probabilities,
pos_label=1)
It is literally just Ctrl C + Ctrl V. But it says Value error and 'multiclass-multioutput format is not supported'. Please note that I am not a programmer just someone who studies to be a mathematics analyst.
The first argument, y_true, needs to be just the true labels, in this case 0/1 without the pair of nodes. Just be sure that the indices of the arrays y_validation and pref_att_types match
The code below draws the ROC curves for two RF models:
from sklearn.metrics import roc_curve
#create array of probabilities
y_test_predict1_probaRF = rf1.predict_proba(X_test)
y_test_predict2_probaRF = rf2.predict_proba(X_test)
RFfpr1, RFtpr1, thresholds = roc_curve(y_test, y_test_predict1_probaRF[:,1])
RFfpr2, RFtpr2, thresholds = roc_curve(y_test, y_test_predict2_probaRF[:,1])
def plot_roc_curve (fpr, tpr, label = None):
plt.plot(fpr, tpr, linewidth = 2, label = label)
plt.plot([0,1], [0,1], "k--")
plt.axis([0,1,0,1])
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plot_roc_curve (RFfpr1,RFtpr1,"RF1")
plot_roc_curve (RFfpr2,RFtpr2,"RF2")
plt.legend()
plt.show()

Converting labels to one-hot encoding

So I was learning one-hot encoding using iris dataset
iris = load_iris()
X = iris['data'] # the complete data -2D
Y = iris['target'] # 1-D only the 150 rows
names = iris['target_names'] #['setosa','versicolor','viginica']
feature_names = iris['feature_names']# [sl,sw,pl,pw]
isamples = np.random.randint(len(Y), size = 5)
Ny = len(np.unique(Y))
Y = keras.utils.to_categorical(Y[:], num_classes = Ny)
print('X:', X[isamples,:])
print('Y:', Y[isamples])
I am confused in this part:
Y = keras.utils.to_categorical(Y[:], num_classes = Ny)
what does Y[:] mean and what is the use of : in print(X[isamples,:])
The iris data set consists of 150 samples from each of three species of Iris flower (Iris setosa, Iris Virginia, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. in your code, the X represents the set of features to train your model on which you can get from iris.data, and y represents the target label for each row on the X set of features which you can get from iris.target. the labels are represented by using numerical value (e.g. 0 for setosa class, 1 for Virginia class, and 2 for versicolor class) you can get the name of each class by using iris.target_names. the colon you see between brackets called the slice operator in Python which let you take a subset of elements from the elemenst of the list for example if you have a list l = [1,2,3,4] if you want just the second and the third element of the list you can just use l[1:3]. ok now using the colon operator without using numbers like this l[:] will give you a copy of the whole list so Y[:] mean give me a copy of the Y list and for print(X[isamples,:]) isamples is a list of 5 randomly generated Indices between 0 and 600 to get a sample of features from the X list print(X[isamples,:]) means take 5 random samples from the list of features and print all of the four features for each sample

'numpy.ndarray' object has no attribute 'iterrows' while predicting value using lstm in python

I have a dataset with three inputs and trying to predict next value of X1 with the combination of previous inputs values.
My three inputs are X1, X2, X3, X4.
So here I am trying to predict next future value of X1. To predict the next X1 these four inputs combination affect with:
X1 + X2 - X3 -X4
I wrote this code inside the class. Then I wrote the code to run the lstm . After that I wrote the code for predict value. Then it gave me this error. Can anyone help me to solve this problem?
my code:
def model_predict(data):
pred=[]
for index, row in data.iterrows():
val = row['X1']
if np.isnan(val):
data.iloc[index]['X1'] = pred[-1]
row['X1'] = pred[-1]
f = row['X1','X2','X3','X4']
s = row['X1'] - row['X2'] + row['X3'] -row['X4']
val = model.predict(s)
pred.append(val)
return np.array(pred)
After lstm code then I wrote the code for predict value:
pred = model_predict(x_test_n)
Gave me this error:
` ---> 5 pred = model_predict(x_test_n)
def model_predict(data):
pred=[]
-->for index, row in data.iterrows():
val = row['X1']
if np.isnan(val):`
AttributeError: 'numpy.ndarray' object has no attribute 'iterrows'
Apparenty, data argument of your function is a Numpy array, not a DataFrame.
Data, as a np.ndarray, has also no named columns.
One of possible solutions, keeping the argument as np.ndarray is:
iterate over rows of this array using np.apply_along_axis(),
refer to columns by indices (instead of names).
Another solution is to create a DataFrame from data, setting proper
column names and iterate on its rows.
One of possible solutions how to write the code without DataFrame
Assume that data is a Numpy table with 4 columns,
containing respectively X1, X2, X3 and X4:
[[ 1 2 3 4]
[10 8 1 3]
[20 6 2 5]
[31 3 3 1]]
Then your function can be:
def model_predict(data):
s = np.apply_along_axis(lambda row: row[0] + row[1] - row[2] - row[3],
axis=1, arr=data)
return model.predict(s)
Note that:
s - all input values to your model - can be computed in a single
instruction, calling apply_along_axis for each row (axis=1),
the predictions can also be computed "all at once", passing a Numpy
vector - just s.
For demonstration purpose, compute s and print it.

Resources