Does resampling two matrices of same size with same random state result in rows of same indices? - python-3.x

I have a data points in a csr numpy matrix and labels in a pandas series.
I want to do down sampling of the dataset.
I tried re-sampling the data points(matrix) and labels(pandas series) separately using same random state.
X4_train_undersampled = resample(X4_train,replace=False, n_samples=41615, random_state=123)
y_train_undersampled = resample(y_train, replace=False , n_samples=41615, random_state=123)
I want to whether this is the right method to do it.
if yes, how can i test if the same rows are sampled in data points and labels.
if No, please provide another way to do down-sampling.

Related

Should I standardize the second datset with the same scaling as in the first dataset?

I am in very much confusion.
I have two datasets. One dataset is considered a source domain (Dataset A) and other dataset is considered a target domain (Dataset B).
First, I standardized each column of Dataset A using mean and standard deviation value of respective columns. I have 600 points in the dataset A. Then I splitted my dataset into Training, Validation and Testing dataset. I trained CNN model and then I tested model using testing dataset. I gives pretty accurate results (prediction).
I have calculated mean and standard deviation of each column available in Dataset A as follow,
thicknessMean = np.mean(thick_SD)
MaxForceMean = np.mean(maxF_SD)
MeanForceMean = np.mean(meanF_SD)
thicknessstd = np.std(thick_SD)
MaxForcestd = np.std(maxF_SD)
MeanForcestd = np.std(meanF_SD)
thick_SD_scaled = (thick_SD - thicknessMean)/thicknessstd
maxF_SD_scaled = (maxF_SD - MaxForceMean)/MaxForcestd
meanF_SD_scaled = (meanF_SD - MeanForceMean)/MeanForcestd
Now, I want to make prediction from the model by feeding the Dataset B. Therefore, I saved the already trained model (with .pth file). Then I standardize the dataset B, but this time I have transformed the dataset using 'mean' and 'standard deviation' of the dataset A. After doing this, I evaluate the already trained model using dataset B. But it is giving a worse prediction.
thick_TD_scaled = (thick_TD - thicknessMean)/thicknessstd
maxF_TD_scaled = (maxF_TD - MaxForceMean)/MaxForcestd
meanF_TD_scaled = (meanF_TD - MeanForceMean)/MeanForcestd
You can see, to scale my dataset B, I have used mean value for eg.thicknessMean and standard deviation for eg. thicknessstd value of the Dataset A .
My question is:
(1) where I am doing wrong? What should I do to make my prediction near to accurate?
(2) When I check prediction's accuracy on two different dataset, should I standardize the second dataset at a same scaling as in the first dataset?

Multi-step time series forecast using Holt-Winters algorithm in python

This is regarding a time-series forecast problem, with a dataset which has almost no seasonality, with a trend that follows the input data. The data is stationary (p-value is less than 5%)
Trying to convert the single-step forecast into a multi-step forecast, by feeding back the predictions as inputs to the Holt-Winters algorithm to achieve predictions for multiple days.
PFB a small snippet of the logic.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
data = pd.read_csv('test_data.csv')
#After time series decomposition and stationarity check using the AD Fuller's test
model = ExponentialSmoothing(data).fit()
number_of_days = 5
for i in range(0,number_of_days):
yhat = model.predict(len(data), len(data))
data = pd.DataFrame(data)
data = data.append(pd.DataFrame(yhat),ignore_index=True)
data_length = data.size
The forecast (output) for all the days is the same value.
Can anyone please help me understand how to tune the algorithm (and / or the logic above) for a better forecast?

In Pytorch, how can i shuffle a DataLoader?

I have a dataset with 10000 samples, where the classes are present in an ordered manner. First I loaded the data into an ImageFolder, then into a DataLoader, and I want to split this dataset into a train-val-test set. I know the DataLoader class has a shuffle parameter, but thats not good for me, because it only shuffles the data when enumeration happens on it. I know about the RandomSampler function, but with it, i can only take n amount of data randomly from the dataset, and i have no control of what is being taken out, so one sample might be present in the train,test and val set at the same time.
Is there a way to shuffle the data in a DataLoader? The only thing i need is the shuffle, after that i can subset the data.
The Subset dataset class takes indices (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset). You can probably exploit that to get this functionality as below. Essentially, you can get away by shuffling the indices and then picking the subset of the dataset.
# suppose dataset is the variable pointing to whole datasets
N = len(dataset)
# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too
# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]
train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible.
Because of the size of data, I am using incremental training - where following sklearn API - .fit(X, y) I am not able to fit the entire matrix X into memory and therefore I am training the model in a couple of rows at the time. The problem is that in every batch, the model is expecting the same number of columns in X.
This is where it gets tricky because some variables are categorical it may be that one-hot encoding on a batch of data will same some shape (e.g. 20 columns). However, the next batch will have (26 columns) simply because in the previous batch not every unique level of the categorical feature was present. Sklearn allows for accounting for this and costume function can also be used: To keep some number of columns in matrix X.
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def one_hot_known(dataf, list_levels, col):
"""Creates a dummy coded matrix with as many columns as unique levels"""
return np.array(
[np.eye(len(list_levels))[list_levels.index(i)] for i in dataf[col]])
# Load Some Dataset with categorical variable
df_orig = sns.load_dataset('tips')
# List of unique levels - known apriori
day_level = list(df_orig['day'].unique())
# Image, we have a batch of data (subset of original data) and one categorical level (DAY) is not present here
df = df_orig.loc[lambda d: d['day'] != 'Sun']
# Missing category is filled with 0 and in next batch, if present its columns will have 1.
OneHotEncoder(categories = [day_level], sparse=False).fit_transform(np.array(df['day']).reshape(-1, 1))
#Costum function, can be used in incremental(data batches chunk fashion)
one_hot_known(df, day_level, 'day')
What I would like to do not is to utilize the TargerEncoding approach, so that we do not have matrix X with a huge number of columns. However, it still needs to be done in an Incremental fashion, just like the OneHot Encoding above.
I am writing this as a post because I know this is very useful to many people and would like to know how to utilize the same strategy for TargetEncoding.
I am aware that Deep Learning allows for Embedding layers, which represent categorical features in continuous space but I would like to apply TargetEncoding.

how to predict the cluster label of a new observation using a hierarchical clustering?

I want to study a population of 47532 individuals with 16230 features. Thus I created a matrix with 16230 lines and 47532 columns
>>> import scipy.cluster.hierarchy as hcluster
>>> from scipy.spatial import distance
>>> import sklearn.cluster import AgglomerativeClustering
>>> matrix.shape
(16230, 47532)
# remove all duplicate vectors in order to not waste computation time
>>> uniq_vectors, row_index = np.unique(matrix, return_index=True, axis=0)
>>> uniq_vectors.shape
(22957, 16230)
# compute distance between each observations
>>> distance_matrix = distance.pdist(uniq_vectors, metric='jaccard')
>>> distance_matrix_2d = distance.squareform(distance_matrix, force='tomatrix')
>>> distance_matrix_2d.shape
(22957, 22957)
# Perform linkage
>>> linkage = hcluster.linkage(distance_matrix, method='complete')
So now I can use scikit-learn to perform a clustering
>>> model = AgglomerativeClustering(n_clusters=40, affinity='precomputed', linkage='complete')
>>> cluster_label = model.fit_predict(distance_matrix_2d)
How to predict future observations using this model ?
Indeed AgglomerativeClustering do not own a predict method and it will be too long to compute again the distance for 16230 x (47532 + 1)
Is it possible to compute a distance between new observations and all pre-computed cluster ?
Indeed the use of pdist from scipy will compute the distance n x n In my case I would like compute the distance from one observation o vs n samples o x n
Thanks for your highlight
The answer is simple: you cannot. Hierarchical clustering is not designed to predict cluster labels for new observations. The reason why this is happening is because it just links data points according to their distances and it is not defining "regions" for each cluster.
There are two solutions for you at this stage I believe:
For new data points, find the nearest observation in your data set (using the same distance function as during the training) and assign the same cluster label. This requires a bit more coding, and obviously, it is a bit of a hack. But keep in mind that the results might not make a lot of sense as you will be extrapolating cluster labels using a different methodology than the training procedure.
Use another clustering algorithm! It seems like you are using hierarchical clustering when your use case does not match the model. KMeans could be a good choice, as it explicitly can assign new data points to the closest cluster.

Resources