Evaluate an idea of generating noise based on standard deviation - statistics

I generate synthetic dataset using this method:
import numpy as np
import random
def generate_dataset(size, dim):
dataset = [random.randint(0, 2 ** dim) for _ in range(size)]
# Removes duplicates
dataset = list(set(dataset))
return dataset
As you can see, the data points are generated randomly from [0 - 2^dim]. For any dataset generated by this method, I want to add noise to it. Now, I am thinking of a simple way to do so but I am not sure if it is logically correct, so here it is:
Find the standard deviation of data points from the generated dataset.
Generate new data points that are NOT within this standard deviation.
Add them to your original dataset, and shuffle.
Is this way of generating noise sound?
Thank you.

it seems like you are creating outliers. noise to me is more like adding a small number(+/- number) to the data points. for example, how many steps did you walk today? it could be 100, but some tracing device might read 95 or 110. that difference is noise.
not sure if this helps.

Related

PCA with sklearn discrepancies

I am trying to apply a PCA in a very specific context and ran into a behavior that I can not explain.
As a test I am running the following code with the file data that you can retrieve here: https://www.dropbox.com/s/vdnvxhmvbnssr34/test.npy?dl=0 (numpy array format).
from sklearn.decomposition import PCA
import numpy as np
test = np.load('test.npy')
pca = PCA()
X_proj = pca.fit_transform(test) ### Project in the basis of eigenvectors
proj = pca.inverse_transform(X_proj) ### Reconstruct vector
My issue is the following: Because I do not specify any number of components, I should here be reconstructing with all the computed components. I therefore expect my ouput proj to be the same as my input test. But a quick plot proves this not to be the case:
plt.figure()
plt.plot(test[0]-proj[0])
plt.show()
The plot here will show some large discrepancies between projection and the input matrix.
Does anyone have an idea or explanation to help me understand why proj is different from test in my case?
I checked the your test data and found the following:
mean = test.mean() # 1.9545972004854737e+24
std = test.std() # 9.610595443778275e+26
I interpret the standard deviation to represent, in some sense, the least count or the uncertainty in the values that are reported. By that I mean that if a numerical algorithm reports the answer to be a, then the real answer should be in the interval [a - std, a + std]. This is because numerical algorithms are imprecise by their very nature. They depend on floating point operations which obviously can't represent real numbers in all there glory.
So if I plot:
plt.plot((test[0]-proj[0])/std)
plt.show()
I get the following plot which seems more reasonable.
You may be interested in plotting relative errors as well. Alternately, you can normalize your data to have 0 mean and unit variance and then the PCA results should be more accurate.

How to handle shared data between samples and batches in Keras

I'm using Keras for timeseries prediction and I want to create a model that is based on the self-attention mechanism that will not use any RNNs. For each sample we look at the last x timesteps of samples to predict the next sample.
In other words I want to feed the network (num_batches, num_samples, timesteps, features) and get (num_batches, predictions).
There is 1 problems with this.
There is a lot of unnecessary duplication of data where sample n has basically the same timesteps and features as sample n+1, only shifted 1 to the left.
How would you handle this assuming you dataset is very large?
I am not very familiar with this, but if your issue is "I have too many replicated data" I think you can solve your problem devising a generator for your data, and then pass the generator as input for the Keras/TensorFlow fit function (according to TensorFlow APIs specification, it is stated that it supports generators as input).
If your question is related to the logic behind the model, I do not see the issue. It is like that you have a sliding window, for each window you predict one value, and then you move the window by a certain amount (in your case, one). Could you argue a little more about your concern?

Classification of unknown dataset into known categories

I have a number of datasets where I have an array of x, y, z coordinates of the endpoints of segments. First and second point represents a segment, so does third, fourth and so on...
The above data represents just a part of dataset... The entire dataset is a lot bigger.
I am required to train my machine with several datasets like this, so that it can predict the category of any unknown dataset further... The test dataset will also be the same as the above.
I need help with the approach. Which algorithm or approach can I use here to classify any unknown dataset into these known categories?
Its an unsupervised learning problem. If you know roughly in how many classes your data should be split use K-Means (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
Otherwise, a combination of TSNE (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) and Kmeans usually works well. Basically transform data using TSNE and run Kmeans in transformed data.

How to correctly scale new data points sklearn

Imagine a simple regression problem, where you are using Gradient Descent. For correct implementation you will need to scale values using mean of entire training dataset. Imagine your model is already trained and you feed it another example you wish to predict. How do you scale it correctly with respect to previous dataset? Do you include new example in training set and then scale it with mean of this training dataset + new data points? How this should be done the right way?
By referring to new data points i mean something the model hasn't seen before, neither in training nor testing. How do you handle scaling for anything you pass to regr.predict() if scaling of training set is done with respect to the whole set and not a single observation.
Imagine you have ndarray of features:
to_predict = [10, 12, 1, 330, 1311, 225].
The dataset used for train and test is already oscillating around 0 for every feature. Taking into account the below answer (pseudo code, that's why I'm asking how to do it right):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_Xs = X_train.append(to_predict)
X_train_std_with_new = scalar.fit_transform(new_Xs)
scaled_to_predit = X_train_std_with_new[-1]
regr.predict(scaled_to_predict) ??

How to take the average of n random forest iterations?

Is there a parameter in sklearn that can be tweaked to run a random forest (or other estimator) multiple times to smooth out variation between runs? What's the simplest way to do this?
You can't just simply smooth out the variations between the runs manually. What you can do is perform hyper parameter tuning using GridSearchCV ( or you can look at other similar methods as well at this link. Also you can also look at doing Cross-validation of your dataset for better performance of your estimator. You can have a look at the methods in Sklearn for cross-validation.
Also please provide more information for your problem, like the type of problem you are solving, dataset, etc. so that we can help you better.
VotingClassifier with soft voting may be what you are looking for. In general, given two sets of predictions, you may take the geometric mean of the prediction to smooth it out.
from scipy.stats.mstats import gmean
df = pd.DataFrame()
#prediction renamed in 1.csv,2.csv... for convenience
for i in range(1,4):
data = pd.read_csv('{}.csv'.format(i),index_col='id')
data = data.rename(columns={'proba':i})
df = pd.concat([df,data],axis=1)
df['proba'] = gmean(df.iloc[:,1:4],axis=1)
output = pd.DataFrame(data={'id':df.index,'proba':df.proba})
output.to_csv('submissions.csv',index=False)

Resources