sklearn PCA number of components_ - scikit-learn

Using sklearn's PCA:
m = np.random.randn(10, 5)
mod = PCA()
mod.fit_transform(m)
mod.components_ will have 5 components, which makes sense to me since there are 5 features in the data.
However if m = np.random.randn(10, 20)
mod.components_ will contain 10 components
Assuming the rows in mod.components_ correspond to the number of features, shouldn't there be 20 components in the second example? Shouldn't there be as many components as features in the data?

From scikit-learn PCA
n_components : int, None or string
Number of components to keep. if n_components is not set all components are kept:
n_components == min(n_samples, n_features)
so in first case min(10,5)=5, output shape is (5,5) and in second case min(10,20)=10, output shape is (10,20)
from sklearn.decomposition import *
import numpy as np
m = np.random.randn(10, 5)
mod = PCA()
mod.fit_transform(m)
print(mod.components_.shape) # (5, 5)
m = np.random.randn(10, 20)
mod = PCA()
mod.fit_transform(m)
print(mod.components_.shape) # (10, 20)
Feature Vs Components :
Suppose you have a dataset, contain 3 Column Named (Age, Sex, Risk_Factor ) and 500 rows. Here, number of features is 3 Not 500. The number of instance/observation/component is 500. How it can be possible every row is a unique feature, rather here, Age, Sex or Risk_Factor is unique feature.
Hope everything is clear.

Related

How to use random_split with percentage split (sum of input lengths does not equal the length of the input dataset)

I tried to use torch.utils.data.random_split as follows:
import torch
from torch.utils.data import DataLoader, random_split
list_dataset = [1,2,3,4,5,6,7,8,9,10]
dataset = DataLoader(list_dataset, batch_size=1, shuffle=False)
random_split(dataset, [0.8, 0.1, 0.1], generator=torch.Generator().manual_seed(123))
However, when I tried this, I got the error raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
I looked at the docs and it seems like I should be able to pass in decimals that sum to 1, but clearly it's not working.
I also Googled this error and the closest thing that comes up is this issue.
What am I doing wrong?
You're likely using an older version of PyTorch, such as Pytorch 1.10, which does not have this functionality.
To replicate this functionality in the older version, you can just copy the source code of the newer version:
import math
from torch import default_generator, randperm
from torch._utils import _accumulate
from torch.utils.data.dataset import Subset
def random_split(dataset, lengths,
generator=default_generator):
r"""
Randomly split a dataset into non-overlapping new datasets of given lengths.
If a list of fractions that sum up to 1 is given,
the lengths will be computed automatically as
floor(frac * len(dataset)) for each fraction provided.
After computing the lengths, if there are any remainders, 1 count will be
distributed in round-robin fashion to the lengths
until there are no remainders left.
Optionally fix the generator for reproducible results, e.g.:
>>> random_split(range(10), [3, 7], generator=torch.Generator().manual_seed(42))
>>> random_split(range(30), [0.3, 0.3, 0.4], generator=torch.Generator(
... ).manual_seed(42))
Args:
dataset (Dataset): Dataset to be split
lengths (sequence): lengths or fractions of splits to be produced
generator (Generator): Generator used for the random permutation.
"""
if math.isclose(sum(lengths), 1) and sum(lengths) <= 1:
subset_lengths: List[int] = []
for i, frac in enumerate(lengths):
if frac < 0 or frac > 1:
raise ValueError(f"Fraction at index {i} is not between 0 and 1")
n_items_in_split = int(
math.floor(len(dataset) * frac) # type: ignore[arg-type]
)
subset_lengths.append(n_items_in_split)
remainder = len(dataset) - sum(subset_lengths) # type: ignore[arg-type]
# add 1 to all the lengths in round-robin fashion until the remainder is 0
for i in range(remainder):
idx_to_add_at = i % len(subset_lengths)
subset_lengths[idx_to_add_at] += 1
lengths = subset_lengths
for i, length in enumerate(lengths):
if length == 0:
warnings.warn(f"Length of split at index {i} is 0. "
f"This might result in an empty dataset.")
# Cannot verify that dataset is Sized
if sum(lengths) != len(dataset): # type: ignore[arg-type]
raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
indices = randperm(sum(lengths), generator=generator).tolist() # type: ignore[call-overload]
return [Subset(dataset, indices[offset - length : offset]) for offset, length in zip(_accumulate(lengths), lengths)]
If you know the length of your dataset, ie, it has the len method,
proportions = [.75, .10, .15]
lengths = [int(p * len(dataset)) for p in proportions]
lengths[-1] = len(dataset) - sum(lengths[:-1])
tr_dataset, vl_dataset, ts_dataset = random_split(dataset, lengths)

how will i get the important features and eleminate the feature which is not selected after performing pca?

here i have tried to perform pca on my dataset but i dont have any idea how to get the important features and eleminate the feature which is not selected.
here i have given a condition that if data contains more than 10 features then perform PCA else dont perform PCA.
from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x):
no_of_col = len(x.columns)
percent = 90
my_num = int((percent/100)*no_of_col)
if no_of_col >= 10:
pca = PCA(n_components = my_num)
x_new = pca.fit_transform(x)
print("More than 10 columns found Performing PCA")
return selected_var
else:
print("Less than 10 columns found no PCA performed")
return x
x = Perform_PCA(x)
x
I will first review your function:
from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x):
no_of_col = len(x.columns)
percent = 90
my_num = int((percent/100)*no_of_col)
if no_of_col >= 10:
pca = PCA(n_components = my_num)
x_new = pca.fit_transform(x)
print("More than 10 columns found Performing PCA")
return selected_var
else:
print("Less than 10 columns found no PCA performed")
return x
You are performing PCA only if there are more than ten columns, but your function returns selected_var, which does not exist.
Also, PCA does not "select features", it transforms the input data by computing a lower-dimensional representation. If you want to
remove columns, use the pca.transform(x) function.
Here is your code modified (it would be possible to optimise it further, but I tried to change it as little as possible):
from sklearn.decomposition import PCA
from sklearn import preprocessing
columns = x.columns
def Perform_PCA(x):
no_of_col = len(x.columns)
percent = 90
my_num = int((percent/100)*no_of_col)
if no_of_col >= 10:
pca = PCA(n_components = my_num)
x_new = pca.fit_transform(x)
print("More than 10 columns found Performing PCA")
return x_new
else:
print("Less than 10 columns found no PCA performed")
return x
Hope this will help you.
In your current code you create my_num components, but only if you have more then 10 columns.
If you want to have a look and select the features yourself you could modify your code:
pca = PCA()
x_new = pca.fit_transform(x)
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
print(pd.DataFrame(pca.components_,columns=x.columns))
This will give you the explained variance for every feature in your dataset. From here you can set the bar how many features should be selected.

Converting labels to one-hot encoding

So I was learning one-hot encoding using iris dataset
iris = load_iris()
X = iris['data'] # the complete data -2D
Y = iris['target'] # 1-D only the 150 rows
names = iris['target_names'] #['setosa','versicolor','viginica']
feature_names = iris['feature_names']# [sl,sw,pl,pw]
isamples = np.random.randint(len(Y), size = 5)
Ny = len(np.unique(Y))
Y = keras.utils.to_categorical(Y[:], num_classes = Ny)
print('X:', X[isamples,:])
print('Y:', Y[isamples])
I am confused in this part:
Y = keras.utils.to_categorical(Y[:], num_classes = Ny)
what does Y[:] mean and what is the use of : in print(X[isamples,:])
The iris data set consists of 150 samples from each of three species of Iris flower (Iris setosa, Iris Virginia, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. in your code, the X represents the set of features to train your model on which you can get from iris.data, and y represents the target label for each row on the X set of features which you can get from iris.target. the labels are represented by using numerical value (e.g. 0 for setosa class, 1 for Virginia class, and 2 for versicolor class) you can get the name of each class by using iris.target_names. the colon you see between brackets called the slice operator in Python which let you take a subset of elements from the elemenst of the list for example if you have a list l = [1,2,3,4] if you want just the second and the third element of the list you can just use l[1:3]. ok now using the colon operator without using numbers like this l[:] will give you a copy of the whole list so Y[:] mean give me a copy of the Y list and for print(X[isamples,:]) isamples is a list of 5 randomly generated Indices between 0 and 600 to get a sample of features from the X list print(X[isamples,:]) means take 5 random samples from the list of features and print all of the four features for each sample

Multivariate binary sequence prediction with CRF

this question is an extension of this one which focuses on LSTM as opposed to CRF. Unfortunately, I do not have any experience with CRFs, which is why I'm asking these questions.
Problem:
I would like to predict a sequence of binary signal for multiple, non-independent groups. My dataset is moderately small (~1000 records per group), so I would like to try a CRF model here.
Available data:
I have a dataset with the following variables:
Timestamps
Group
Binary signal representing activity
Using this dataset I would like to forecast group_a_activity and group_b_activity which are both 0 or 1.
Note that the groups are believed to be cross-correlated and additional features can be extracted from timestamps -- for simplicity we can assume that there is only 1 feature we extract from the timestamps.
What I have so far:
Here is the data setup that you can reproduce on your own machine.
# libraries
import re
import numpy as np
import pandas as pd
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
df.head() # check it out
# shift (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
[5 rows x 15 columns]
Before we get to the CRF part, I suspect that I cannot use approach this problem from a multi-task learning point of view (predicting patterns for both A and B via one model) and therefore I'm going to have to predict each of them individually.
Now the CRF part. I've found some relevant example (here is one) but they all tend to predict a single class value based on a prior sequence.
Here is my attempt at using a CRF here:
import pycrfsuite
crf_features = [] # a container for features
crf_labels = [] # a container for response
# lets focus on group A only for this one
current_response = [c for c in df.columns if c.startswith('a_next')]
# predictors are going to have to be nested otherwise I'll run into problems with dimensions
current_predictors = [c for c in df.columns if not 'next' in c]
current_predictors = set([re.sub('_\d+$','',v) for v in current_predictors])
for index, row in df.iterrows():
# not sure if its an effective way to iterate over a DF...
iter_features = []
for p in current_predictors:
pred_feature = []
# note that 0/1 values have to be converted into booleans
for k in range(shift_length):
iter_pred_feature = p + '_{0:02d}'.format(k+1)
pred_feature.append(p + "=" + str(bool(row[iter_pred_feature])))
iter_features.append(pred_feature)
iter_response = [row[current_response].apply(lambda z: str(bool(z))).tolist()]
crf_labels.extend(iter_response)
crf_features.append(iter_features)
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(crf_features, crf_labels):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 0.0, # coefficient for L1 penalty
'c2': 0.0, # coefficient for L2 penalty
'max_iterations': 10, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('testcrf.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('testcrf.crfsuite')
tagger.tag(xseq)
# ['False', 'True', 'False']
It seems that I did manage to get it working, but I'm not sure if I've approached it correctly. I'll formulate my questions in the Questions section, but first, here is an alternative approach using keras_contrib package:
from keras import Sequential
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
# we are gonna have to revisit data prep stage again
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
model = Sequential()
model.add(CRF(2, input_shape=(predictor_array.shape[1],predictor_array.shape[2])))
model.summary()
model.compile(loss=crf_loss, optimizer='adam', metrics=['accuracy'])
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict(predictor_array) # not gonna worry about train/test split here
Questions:
My main question is whether or not I've constructed both of my CRF models correctly. What worries me is that (1) there is not a lot of documentation out there on CRF models, (2) CRFs are mainly used for predicting a single label given a sequence, (3) the input features are nested and (4) when used in a multi-tasked fashion, I'm not sure if it is valid.
I have a few extra questions as well:
Is a CRF appropriate for this problem?
How are the 2 approaches (one based on pycrfuite and one based on keras_contrib) different and what are their advantages/disadvantages?
In a more general sense, what is the advantage of combining CRF and LSTM models into one (like one discussed here)
Many thanks!

How to set up the number of inputs neurons in sklearn MLPClassifier?

Given a dataset of n samples, m features, and using [sklearn.neural_network.MLPClassifier][1], how can I set hidden_layer_sizes to start with m inputs? For instance, I understand that if hidden_layer_sizes= (10,10) it means there are 2 hidden layers each of 10 neurons (i.e., units) but I don't know if this also implies 10 inputs as well.
Thank you
This classifier/regressor, as implemented, is doing this automatically when calling fit.
This can be seen in it's code here.
Excerpt:
n_samples, n_features = X.shape
# Ensure y is 2D
if y.ndim == 1:
y = y.reshape((-1, 1))
self.n_outputs_ = y.shape[1]
layer_units = ([n_features] + hidden_layer_sizes +
[self.n_outputs_])
You see, that your potentially given hidden_layer_sizes is surrounded by layer-dimensions defined by your data within .fit(). This is the reason, the signature reads like this with a subtraction of 2!:
Parameters
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
The ith element represents the number of neurons in the ith hidden layer.

Resources