Dataset with X and Y values and I don't know the function that describes the relationship. I want to determine the function that provides good fit in python. Shared the scatterplot with original data.
I have tried polynomial regression but reading about it shows that the data might over-fit.
enter image description here
This looks like a logarithmic function to me. Can someone help me on how to find the best fit for such a graph.
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(degree=4)
poly_data=poly.fit_transform(file[["x"]])
model=LinearRegression()
model.fit(poly_data, file["y"])
plt.scatter(file["x"], file["y"])
plt.plot(file["x"], model.predict(poly.fit_transform(file[["x"]])))
plt.show()
enter image description here
Please let me know if you want to see the dataset.
Related
I have some data points on the (x,y) field that are supposed to be cotegorised into two classes denoted here with X and O like this
import numpy as np
import matplotlib.pyplot as plt
x1 = np.array([0.1,0.3,0.1,0.6,0.4,0.6,0.5,0.9,0.4,0.7])
x2 = np.array([0.1,0.4,0.5,0.9,0.2,0.3,0.6,0.2,0.4,0.6])
c=np.array([ 1,1,1,1,1,0,0,0,0,0 ])
plt.plot(x1[c==0], x2[c==0], 'bo')
plt.plot(x1[c==1], x2[c==1], 'rx')
Now I want to find a way so I can find the “best fitting curve” separating those like this
First I though maybe I try nearest neighbor method but I’ve been told it cannot apply here and that there’s a much simpler way to do it with an ANN maybe and using PyTorch but I can’t understand how.
Any ideas on what to use and/or how to do it and what should I check to estimate my accuracy?? There are too few points so what's the idea here??
Thank you everyone in advance!
It seems to be as a basic case and neural networks might be an overkill for this task. Also, if you later want to do some visualization as the one you posted, you should probably go for a more fundamental approach.
You can try Linear Classifiers(Support Vector Machines, Linear Regression and etc.)
Check the links below
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I am trying to apply a PCA in a very specific context and ran into a behavior that I can not explain.
As a test I am running the following code with the file data that you can retrieve here: https://www.dropbox.com/s/vdnvxhmvbnssr34/test.npy?dl=0 (numpy array format).
from sklearn.decomposition import PCA
import numpy as np
test = np.load('test.npy')
pca = PCA()
X_proj = pca.fit_transform(test) ### Project in the basis of eigenvectors
proj = pca.inverse_transform(X_proj) ### Reconstruct vector
My issue is the following: Because I do not specify any number of components, I should here be reconstructing with all the computed components. I therefore expect my ouput proj to be the same as my input test. But a quick plot proves this not to be the case:
plt.figure()
plt.plot(test[0]-proj[0])
plt.show()
The plot here will show some large discrepancies between projection and the input matrix.
Does anyone have an idea or explanation to help me understand why proj is different from test in my case?
I checked the your test data and found the following:
mean = test.mean() # 1.9545972004854737e+24
std = test.std() # 9.610595443778275e+26
I interpret the standard deviation to represent, in some sense, the least count or the uncertainty in the values that are reported. By that I mean that if a numerical algorithm reports the answer to be a, then the real answer should be in the interval [a - std, a + std]. This is because numerical algorithms are imprecise by their very nature. They depend on floating point operations which obviously can't represent real numbers in all there glory.
So if I plot:
plt.plot((test[0]-proj[0])/std)
plt.show()
I get the following plot which seems more reasonable.
You may be interested in plotting relative errors as well. Alternately, you can normalize your data to have 0 mean and unit variance and then the PCA results should be more accurate.
I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC
I have a created a binary classifier in Tensorflow that will output a generator object containing predictions. I extract the predictions (e.g [0.98, 0.02]) from the object into a list, later converting this into a numpy array. I have the corresponding array of labels for these predictions. Using these two arrays I believe I should be able to plot a roc curve via:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(labels, predictions[:,1])
plt.plot(fpr, tpr)
plt.show()
print(fpr)
print(tpr)
print(thr)
Where predictions[:,1] gives the positive prediction score. However, running this code leads to only a flat line and only three values for each fpr, tpr and thr:
Flat line roc plot and limited function outputs.
The only theory I have as to why this is happening is because my classifier is too sure of it's predictions. Many, if not all, of the positive prediction scores are 1.0, or incredibly close to zero:
[[9.9999976e-01 2.8635742e-07]
[3.3693312e-11 1.0000000e+00]
[1.0000000e+00 9.8642090e-09]
...
[1.0106111e-15 1.0000000e+00]
[1.0000000e+00 1.0030269e-09]
[8.6156778e-15 1.0000000e+00]]
According to a few sources including this stackoverflow thread and this stackoverflow thread, the very polar values of my predictions could be creating an issue for roc_curve().
Is my intuition correct? If so is there anything I can do about it to plot my roc_curve?
I've tried to include all the information I think would be relevant to this issue but if you would like any more information about my program please ask.
ROC is generated by changing the threshold on your predictions and finding the sensitivity and specificity for each threshold. This generally means that as you increase the threshold, your sensitivity decreases but your specificity increases and it draws a picture of the overall quality of your predicted probabilities. In your case, since everything is either 0 or 1 (or very close to it) there are no meaningful thresholds to use. That's why the thr value is basically [ 1, 1, 1 ].
You can try to arbitrarily pull the values closer to 0.5 or alternatively implement your own ROC curve calculation with more tolerance for small differences.
On the other hand you might want to review your network because such result values often mean there is a problem there, maybe the labels leaked into the network somehow and therefore it produces perfect results.
So, I can get sklearn.linear_model.LinearRegression to process my data - at least to run the script without raising any exceptions or warnings. The only issue is, that I am not trying to plot the results with matplotlib, but instead I want to see the estimators and diagnostic statistics for the model.
How can I get a model summary such as the slope and intercept (B0,B1), R squared adjusted, etc. to display in the console or populate into a variable instead of plotting this?
This is a generic copy of the script I ran:
import numpy as p
import pandas as pn
from sklearn import datasets, linear_model
z = pn.DataFrame(
{'a' : [1,2,3,4,5,6,7,8,9],
'b' : [9,8,7,6,5,4,3,2,1]
})
a2 = z['a'].values.reshape(9,1)
b2 = z['b'].values.reshape(9,1)
reg = linear_model.LinearRegression(fit_intercept=True)
reg.fit(a2,b2)
# print(reg.get_params(deep=True)) I tried this and it didn't print out the #information I wanted
# print(reg) # I tried this too
This ran without errors, but no output other than this appeared in the console:
{'n_jobs': 1, 'fit_intercept': True, 'copy_X': True, 'normalize': False}
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Thanks for any info on how to get this to print a summary of the model.
sklearn's API is designed around fitting training data and then generating predictions on test data without exposing much if any information about how the model is fit. While you can sometimes find the estimated parameters of the model by accessing the fitted model object's coef_ attribute, you won't find much in the way of parameter description functionality. This is because there may be no way to provide this information in a uniform way. The API is designed to let you treat a linear regression the same a random forest.
Since you are interested in a linear model, you can get the information you're looking for, including confidence intervals, goodness-of-fit statistics, and the like from the statsmodels library. See their OLS example: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html for details.