Converting labels to one-hot encoding - python-3.x

So I was learning one-hot encoding using iris dataset
iris = load_iris()
X = iris['data'] # the complete data -2D
Y = iris['target'] # 1-D only the 150 rows
names = iris['target_names'] #['setosa','versicolor','viginica']
feature_names = iris['feature_names']# [sl,sw,pl,pw]
isamples = np.random.randint(len(Y), size = 5)
Ny = len(np.unique(Y))
Y = keras.utils.to_categorical(Y[:], num_classes = Ny)
print('X:', X[isamples,:])
print('Y:', Y[isamples])
I am confused in this part:
Y = keras.utils.to_categorical(Y[:], num_classes = Ny)
what does Y[:] mean and what is the use of : in print(X[isamples,:])

The iris data set consists of 150 samples from each of three species of Iris flower (Iris setosa, Iris Virginia, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. in your code, the X represents the set of features to train your model on which you can get from iris.data, and y represents the target label for each row on the X set of features which you can get from iris.target. the labels are represented by using numerical value (e.g. 0 for setosa class, 1 for Virginia class, and 2 for versicolor class) you can get the name of each class by using iris.target_names. the colon you see between brackets called the slice operator in Python which let you take a subset of elements from the elemenst of the list for example if you have a list l = [1,2,3,4] if you want just the second and the third element of the list you can just use l[1:3]. ok now using the colon operator without using numbers like this l[:] will give you a copy of the whole list so Y[:] mean give me a copy of the Y list and for print(X[isamples,:]) isamples is a list of 5 randomly generated Indices between 0 and 600 to get a sample of features from the X list print(X[isamples,:]) means take 5 random samples from the list of features and print all of the four features for each sample

Related

Plotting a Line of Best Fit on the Same Plot for Multiple Datasets

I am trying to approximate a line of best fit between multiple datasets, and display everything on one plot. This question addresses a similar notion, but the contents are in MatLab and, hence, not the same.
I have data from 4 different experiments that's composed of 146 values, the Y values represent changes in distance over time, the X value, which is represented by integer timesteps (1,2,3,...). The shape of my Y data is (4,146), as I've decided to keep all of it in a nested list, and the shape of my X data is (146,). I have the following set-up for my subplots:
x = [i for i in range(len(temp[0]))]
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(x,Y[0],c="blue", marker='.',linewidth=1)
ax1.scatter(x,Y[1],c="orange", marker='.',linewidth=1)
ax1.scatter(x,Y[2],c="green", marker='.',linewidth=1)
ax1.scatter(x,Y[3],c="purple", marker='.',linewidth=1)
z = np.polyfit(x,Y,3) # Throws an error because x,Y are not the same length
p = np.poly1d(z)
plt.plot(x, p(x))
I do not know how to fit a line of best fit between the scatter plots. numpy.polyfit documentation suggests that "Several data sets of sample points sharing the same x-coordinates can be fitted at once", but I have been unsuccessful thus far, and can only fit the line to one dataset. Is there a way that I can fit the line to all of the data sets? Should I use a different library entirely, like Seaborn?
Try to cast x and Y to a numpy arrays (I assume it is in a list). You can do this by using x = np.asarray(x). Now to fit on the data collectively, you can flatten the Y array using Y.flatten(). It transforms the shape from (n,N) to (n*N). And you can tile the x array n times to make a fit, this just copies the array n times into a new array so this will also become shape (n*N,). In this way you match the values form Y to corresponding values of x.
N = 10 # no. datapoints
n = 4 # no. experiments
# creating some dummy data
x = np.linspace(0,1, N) # shape (N,)
Y = np.random.normal(0,1,(n, N))
np.polyfit(np.tile(x, n), Y.flatten(), deg=3)
The polyfit function expects the Y array to be, in your case, (146, 4) rather than (4, 146), so you should pass it the transpose of Y, e.g.,
z = np.polyfit(x, Y.T, 3)
The poly1d function can only do one polynomial at a time, so you have to loop over the results from polyfit, e.g.,:
for res in z:
p = np.poly1d(res)
plt.plot(x, p(x))

How to use the smoothed line with patsy cr in production?

I smooth a set of features using patsy cr (with natural splines) however confused with something looking very basic. Here is a sample raw data points and corresponding smoothed points by patsy.
x = df[feature]
y = np.log(df['varTarget'])
x_val = 100
#y_val = np.log(df_val['varTarget'])
x_basis = cr(x, df=10, constraints="center", lower_bound=x.min(), upper_bound=x.max())
x_basis_val = cr(x_val, df=10, constraints="center", lower_bound=x.min(), upper_bound=x.max())
# Fit model to the data
# this model uses an input x_basis with 10 columns created through cr
model = LinearRegression().fit(x_basis, y)
# Get estimates
y_hat = model.predict(x_basis)
y_hat_val = model.predict(x_basis_val)
plt.figure(figsize=(17,7))
plt.scatter(x, y, s=4, color="tab:blue")
plt.scatter(x, y_hat, s=8, color="tab:red")
and the plot:
So the linear regression model based on the smoothed points expects an input with 10 columns. This is created by cr. So suppose in production I have a new x = 100. Then how can I get a smoothed value for the new x relying on the smoothed line already created?
When trying with one value I get the following:
Unable to compute n_inner_knots(=4) + 2 distinct knots: 1 data
value(s) found between lower_bound(=30.023212890625) and
upper_bound(=998.42234375).

Measure classifier by using cross validation with ROC metrics

I am trying to do a cross validation with the ROC metric to evaluate the classifier, and I came across with the following code from Scikit learn :
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
I have trouble understanding the X,y = X[y!=2],y[y!=2] line, what is the purpose of this line?
Also, can someone possibly help me to clarify the use of underline
n_samples, n_features?
Thanks!
Iris dataset has three classes labeled 0, 1, 2.
When you see
X, y = X[y != 2], y[y != 2]
it just means new values of X and y will not contain records for class with a label 2.
Here is how it works.
y != 2 returns a boolean vector equal to the length of y, that contains True when y was 0 or 1, and False where y was 2, according to the given condition y != 2. I.e. [True, False, False, ...]. It is also sometimes called a mask.
y[y != 2] is boolean-based indexing, it returns a new array consisting of such elements of y where y is not 2. I.e. the resulting array will not contain 2s.
Finally, X[y != 2] return a new array X with elements that correspond to True values of a mask.
Since X and y a re of the same length, applying the same mask to it works perfectly, and in this case effectively all records with class label 2 are removed.
Now for the purpose of removing en entire class from the dataset - this is something you should look for in the tutorial your were reading.
X.shape returns a tuple with number of rows and number of columns in a dataframe. This is what data scientists call samples and features.

Multivariate binary sequence prediction with CRF

this question is an extension of this one which focuses on LSTM as opposed to CRF. Unfortunately, I do not have any experience with CRFs, which is why I'm asking these questions.
Problem:
I would like to predict a sequence of binary signal for multiple, non-independent groups. My dataset is moderately small (~1000 records per group), so I would like to try a CRF model here.
Available data:
I have a dataset with the following variables:
Timestamps
Group
Binary signal representing activity
Using this dataset I would like to forecast group_a_activity and group_b_activity which are both 0 or 1.
Note that the groups are believed to be cross-correlated and additional features can be extracted from timestamps -- for simplicity we can assume that there is only 1 feature we extract from the timestamps.
What I have so far:
Here is the data setup that you can reproduce on your own machine.
# libraries
import re
import numpy as np
import pandas as pd
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
df.head() # check it out
# shift (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
[5 rows x 15 columns]
Before we get to the CRF part, I suspect that I cannot use approach this problem from a multi-task learning point of view (predicting patterns for both A and B via one model) and therefore I'm going to have to predict each of them individually.
Now the CRF part. I've found some relevant example (here is one) but they all tend to predict a single class value based on a prior sequence.
Here is my attempt at using a CRF here:
import pycrfsuite
crf_features = [] # a container for features
crf_labels = [] # a container for response
# lets focus on group A only for this one
current_response = [c for c in df.columns if c.startswith('a_next')]
# predictors are going to have to be nested otherwise I'll run into problems with dimensions
current_predictors = [c for c in df.columns if not 'next' in c]
current_predictors = set([re.sub('_\d+$','',v) for v in current_predictors])
for index, row in df.iterrows():
# not sure if its an effective way to iterate over a DF...
iter_features = []
for p in current_predictors:
pred_feature = []
# note that 0/1 values have to be converted into booleans
for k in range(shift_length):
iter_pred_feature = p + '_{0:02d}'.format(k+1)
pred_feature.append(p + "=" + str(bool(row[iter_pred_feature])))
iter_features.append(pred_feature)
iter_response = [row[current_response].apply(lambda z: str(bool(z))).tolist()]
crf_labels.extend(iter_response)
crf_features.append(iter_features)
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(crf_features, crf_labels):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 0.0, # coefficient for L1 penalty
'c2': 0.0, # coefficient for L2 penalty
'max_iterations': 10, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('testcrf.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('testcrf.crfsuite')
tagger.tag(xseq)
# ['False', 'True', 'False']
It seems that I did manage to get it working, but I'm not sure if I've approached it correctly. I'll formulate my questions in the Questions section, but first, here is an alternative approach using keras_contrib package:
from keras import Sequential
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
# we are gonna have to revisit data prep stage again
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
model = Sequential()
model.add(CRF(2, input_shape=(predictor_array.shape[1],predictor_array.shape[2])))
model.summary()
model.compile(loss=crf_loss, optimizer='adam', metrics=['accuracy'])
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict(predictor_array) # not gonna worry about train/test split here
Questions:
My main question is whether or not I've constructed both of my CRF models correctly. What worries me is that (1) there is not a lot of documentation out there on CRF models, (2) CRFs are mainly used for predicting a single label given a sequence, (3) the input features are nested and (4) when used in a multi-tasked fashion, I'm not sure if it is valid.
I have a few extra questions as well:
Is a CRF appropriate for this problem?
How are the 2 approaches (one based on pycrfuite and one based on keras_contrib) different and what are their advantages/disadvantages?
In a more general sense, what is the advantage of combining CRF and LSTM models into one (like one discussed here)
Many thanks!

How to set up the number of inputs neurons in sklearn MLPClassifier?

Given a dataset of n samples, m features, and using [sklearn.neural_network.MLPClassifier][1], how can I set hidden_layer_sizes to start with m inputs? For instance, I understand that if hidden_layer_sizes= (10,10) it means there are 2 hidden layers each of 10 neurons (i.e., units) but I don't know if this also implies 10 inputs as well.
Thank you
This classifier/regressor, as implemented, is doing this automatically when calling fit.
This can be seen in it's code here.
Excerpt:
n_samples, n_features = X.shape
# Ensure y is 2D
if y.ndim == 1:
y = y.reshape((-1, 1))
self.n_outputs_ = y.shape[1]
layer_units = ([n_features] + hidden_layer_sizes +
[self.n_outputs_])
You see, that your potentially given hidden_layer_sizes is surrounded by layer-dimensions defined by your data within .fit(). This is the reason, the signature reads like this with a subtraction of 2!:
Parameters
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
The ith element represents the number of neurons in the ith hidden layer.

Resources