I want to create data that follows the same distribution and trend of the sample data taken using numpy.
For example say I have an array x whose trend is increasing and the distribution is suppose log normal. Can I create another random array which follows same distribution and trend using numpy ?
Well, numpy doesn't have the capability to fit distributions to your data. You can either do it manually using the method you like (MLE or MM) or you can use scipy that can fit distributions over your data like shown below:
import scipy.stats as st
# Inferred parameters of the distribution
s, loc, scale = st.lognorm.fit(x)
# Distribution object
dist = st.lognorm(s, loc, scale)
# generate 1000 random samples
samples = dist.rvs(size=1000)
Scipy used MLE by default.
You will have to explore your data and look into the distributions that fit the best. Numpy or scipy can't do that for you.
Documentation of fit method: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html
Related
I don't understand why I have three different behaviors depending on the classifier I use, even though they should go hand in hand.
This is the code in order to go deeply in the question:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np
#load data
wine = datasets.load_wine()
X = wine.data
y = wine.target
# some helper functions
def repeat_feature(X,which=1,times=1):
return np.hstack([X,np.hstack([X[:, :which]]*times)])
def do_the_job(X,y,clf):
return np.mean(cross_validate(clf, X, y,cv=5)['test_score'])
# define the classifiers
clf1=DecisionTreeClassifier(max_depth=25,random_state=42)
clf2=RandomForestClassifier(n_estimators=5,random_state=42)
clf3=LGBMClassifier(n_estimators=5,random_state=42)
# repeat up to 50 times the same feature and test the classifiers
clf1_result=[]
clf2_result=[]
clf3_result=[]
for i in range(1,50):
my_x=repeat_feature(X,times=i)
clf1_result.append(do_the_job(my_x,y,clf1))
clf2_result.append(do_the_job(my_x,y,clf2))
clf3_result.append(do_the_job(my_x,y,clf3))
# plot the mean of the cv-scores for each classifier
plt.figure(figsize=(12,7))
plt.plot(clf1_result,label='tree')
plt.plot(clf2_result,label='forest')
plt.plot(clf3_result,label='boost')
plt.legend()
The result of the previous script is the following graph:
What I want to verify is that by adding the same information (like a repeated feature) I would get a decrease in the score (which happens as expected for random forest).
The question is why does this not happen with the other two classifiers instead?
Why do their scores remain stable?
Am I missing something from the theoretical point of view?
Ty all
When fitting a single decision tree (sklearn.tree.DecisionTreeClassifier)
or a LightGBM model using its default behavior (lightgbm.LGBMClassifier), the training algorithm considers all features as candidates for every split, and always chooses the split with the best "gain" (reduction in the training loss).
Because of this, adding multiple identical copies of the same feature will not change the fit to the training data.
For random forest, on the other hand, the training algorithm randomly selects a subset of features to consider at each split. The random forest learns how to explain the training data by ensembling together multiple slightly-different models, and this can be effective because the different models explain different characteristics of the target. If you hold the number of trees + the number of leaves per tree constant, then adding copies of a feature reduces the diversity of the trees in the forest, which reduces the forest's fit to the training data.
I am trying to apply a PCA in a very specific context and ran into a behavior that I can not explain.
As a test I am running the following code with the file data that you can retrieve here: https://www.dropbox.com/s/vdnvxhmvbnssr34/test.npy?dl=0 (numpy array format).
from sklearn.decomposition import PCA
import numpy as np
test = np.load('test.npy')
pca = PCA()
X_proj = pca.fit_transform(test) ### Project in the basis of eigenvectors
proj = pca.inverse_transform(X_proj) ### Reconstruct vector
My issue is the following: Because I do not specify any number of components, I should here be reconstructing with all the computed components. I therefore expect my ouput proj to be the same as my input test. But a quick plot proves this not to be the case:
plt.figure()
plt.plot(test[0]-proj[0])
plt.show()
The plot here will show some large discrepancies between projection and the input matrix.
Does anyone have an idea or explanation to help me understand why proj is different from test in my case?
I checked the your test data and found the following:
mean = test.mean() # 1.9545972004854737e+24
std = test.std() # 9.610595443778275e+26
I interpret the standard deviation to represent, in some sense, the least count or the uncertainty in the values that are reported. By that I mean that if a numerical algorithm reports the answer to be a, then the real answer should be in the interval [a - std, a + std]. This is because numerical algorithms are imprecise by their very nature. They depend on floating point operations which obviously can't represent real numbers in all there glory.
So if I plot:
plt.plot((test[0]-proj[0])/std)
plt.show()
I get the following plot which seems more reasonable.
You may be interested in plotting relative errors as well. Alternately, you can normalize your data to have 0 mean and unit variance and then the PCA results should be more accurate.
I want to calculate in python the correlation of all my features (all of float type) and the class label (Binary, 0 or 1). In addition, I would like to plot the data to visualize their distribution by class.
This is needed so I can find features coupled to a single label and find out their real importance. Note that I don't want the pairwise feature correlation and that my classifier is binary.
I have tried the following (from a similar post in stackoverflow) but it is not exactly what I am looking for.
df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))
Please see in the picture attached how the distribution would look like for one the features (from Weka).
Class distribution for one of the features
Any feedback is really appreciated.
Correlation is not supposed to be used for categorical variables. For more explanation see here
You can understand the relationship between your independent variables and target variables with the following approach.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(return_X_y=False)
import pandas as pd
df=pd.DataFrame(data.data[:,:5])
df.columns = data.feature_names[:5]
df['target'] = data.target.astype(str)
import seaborn as sns;
import matplotlib.pyplot as plt
g= sns.pairplot(df,hue = 'target', diag_kind= 'hist',
vars=df.columns[:-1],
plot_kws=dict(alpha=0.5),
diag_kws=dict(alpha=0.5))
plt.show()
I generate synthetic dataset using this method:
import numpy as np
import random
def generate_dataset(size, dim):
dataset = [random.randint(0, 2 ** dim) for _ in range(size)]
# Removes duplicates
dataset = list(set(dataset))
return dataset
As you can see, the data points are generated randomly from [0 - 2^dim]. For any dataset generated by this method, I want to add noise to it. Now, I am thinking of a simple way to do so but I am not sure if it is logically correct, so here it is:
Find the standard deviation of data points from the generated dataset.
Generate new data points that are NOT within this standard deviation.
Add them to your original dataset, and shuffle.
Is this way of generating noise sound?
Thank you.
it seems like you are creating outliers. noise to me is more like adding a small number(+/- number) to the data points. for example, how many steps did you walk today? it could be 100, but some tracing device might read 95 or 110. that difference is noise.
not sure if this helps.
Sklearn provide different data generation functions such as make_blobs and make_regression in sklearn.datasets.
However, I am not aware of any functions that can generate sequential data. Is there any existing libraries that can generate artificial sequential data?
It really depends on what kind of series you want. Check out this repository for generating different kinds of simulated series. It's called TimeSynth
But if you just want something you can easily modify yourself, try writing a function similar to this:
def SynthSeries(start,end,stepSize,coefficients):
import numpy as np
samples = np.array(np.arange(start,end,stepSize))
array = np.array(np.zeros(np.shape(samples)))
for coeff in coefficients:
array = np.add(array,(np.sin(coeff*samples)))
return array, samples
This is sort of a reverse of a fourier transform, if you know the base frequencies of series you want to create, you can pass it into this function to recreate the signal.
You can use it like this:
import matplotlib.pyplot as plt
(SeqData,samples) = SynthSeries(0,20,0.1,[12,3,1,22])
plt.plot(samples, SeqData)
plt.show()