Sklearn TruncatedSVD() ValueError: n_components must be < n_features - scikit-learn

Hi I'am trying to run script for a Kaggle competition.
you can see the whole script here
But when I run this script i get an ValueError
ValueError: n_components must be < n_features; got 1 >= 1
Can somebody tell me please how to find out how many features there are at this point.
I don't think it will be usefull when I set n_components to 0.
I also read the documentation but I can't solve that issue.
Greetz
Alex

It is highly likely that the shape of your data matrix is wrong: It seems to have only one column. That needs to be fixed. Use a debugger to figure out what goes into the fit method of the TruncatedSVD, or unravel the pipeline and do the steps by hand.
As for the error message, if it is due to a matrix with one column, this makes sense: You can only have maximally as many components as features. Since you are using TruncatedSVD it additionally assumes that you don't want the full feature space, hence the strict inequality.

Related

Time Series Forecasting in Tensorflow 2.0 - How to predict using the last of the Validation Dataset?

I'm (desperately) trying to figure out Tensorflow 2.0 without much luck so far, but I think I'm close with what I need right now.
I've followed the doc here to make a simple network to forecast stock data (not weather data), and what I'd like to do now is, forecast the future using the latest/most recent section of the validation dataset. I'm hoping someone else has read through it already and can help me here.
The code to predict the future using the validation dataset looks like this:
for x, y in val_data_multi.take(3):
multi_step_plot(x[0], y[0], multi_step_model.predict(x)[0])
...where to the best of my knowledge, it takes a random chunk (3 separate times), and in my case is a 20 row x 9 column section, from the val_data_multi "Repeat dataset" type, and then uses the model's multi_step_plot function to spit out a plot that has the predicted values based on that random section of the validation dataset. But what if I don't want to just take a random validation section, I want to use the bottom of my actual dataset? So that if I have recent stock data at the bottom of my validation dataset, and I want to forecast for the future that hasn't happened yet, how can I take a 20x9 section from the bottom of that set, and not just have it "take" a random section to predict with?
As a pseudo code attempt to explain what I'm trying to do, I was trying something like:
for x, y in val_data_multi[-20:].take(1): #.take(3)
...to try and make it take one section 20 rows up from the bottom, and all columns. But of course this didn't work as TypeError: 'RepeatDataset' object is not subscriptable.
I hope that makes sense, and if it'll help for me to post my code, I can do that, but I'm just using what's already shown in that page, just made some modifications to use a stock dataset, that's all.
I was able to find a much better guide from this Github repo:
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
...which basically gets into better detail what I'm looking to do and made it very easy to understand. Thanks!

How can I get the initial values from my dataset for a combined lorentzian and gaussian fit?

I try to fit data using standard defined functions (Lorentzian & Gaussian) from lmfit package. The program works quite well for some data set but for another one its not able to fit because the initial values doesnt seem right. Is there any algorithm which can extract the initial values from the data set and do some iterations in order to find the best fit?
I tried some common methods like bruethe-force algorithm but the results are not satisfactory and it cost a lot of time.
It is always recommended to provide a small, complete example script that shows the problem you are having. How could we know why it works in some cases and not in others?
lmfit.GaussianModel and lmfit.LorentzianModel both have guess methods. This should work reasonably well data with an isolated peak, working like
import lmfit
model = lmfit.models.GaussianModel()
params = model.guess(ydata, x=xdata)
for p in params.values():
print(p)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
If the data doesn't have a clear isolated peak, that might not work so well.
If finding the peak(s) is the actual problem, try scipy.signal.find_peaks
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)or peakutils (https://peakutils.readthedocs.io/en/latest/). Either of these should give you a good estimate of the center parameter, which is probably the most likely to cause bad fits if a poor initial value is give.

Can kernels other than periodic be used in SGPR in gpflow

I am pretty new to GPR. I will appreciate it if you provide me some suggestion regarding the following questions:
Can we use the Matern52 kernel in a sparse Gaussian process?
What is the best way to select pseudo inputs (Z) ? Is random sampling reasonable?
I would like to mention that when I am using the Matern52 kernel, the following error stops optimization process. My code:
k1 = gpflow.kernels.Matern52(input_dim=X_train.shape[1], ARD=True)
m = gpflow.models.SGPR(X_train, Y_train, kern=k1, Z=X_train[:50, :].copy())
InvalidArgumentError (see above for traceback): Input matrix is not invertible.
[[Node: gradients_25/SGPR-31ceaea6-412/Cholesky_grad/MatrixTriangularSolve = MatrixTriangularSolve[T=DT_DOUBLE, adjoint=false, lower=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](SGPR-31ceaea6-412/Cholesky, SGPR-31ceaea6-412/eye_1/MatrixDiag)]
Any help will be appreciated, thank you.
Have you tried it out on a small test set of data, that you could perhaps post here? There is no reason Matern52 shouldn't work. Randomly sampling inducing points should be a reasonable initialisation, especially in higher dimensions. However, you may run into issues if you end up with some inducing points very close to each other (this can make the K_{zz} = cov(f(Z), f(Z)) matrix badly conditioned, which would explain why the Cholesky fails). If your X_train isn't already shuffled, you may want to use Z=X_train[np.random.permutation(len(X_train))[:50] to get shuffled indices. It may also help to add a white noise kernel, kern=k1+gpflow.kernels.White() ...

Automatic model selection

I am writing a machine learning master algorithm from scratch where the user just inputs the training and testing data, i was wondering is there a way to automatically decide what algorithm is to be used : regression vs classification
like for example,
(assuming the last column is always the output and it is always a number )
if we could search through the last column and decide what model it is by seeing if they are discrete class labels or continuous values.
How would one go about this?
and if not this method, is there a better one?
It is to be in python3.
Thank you.
type of target using sk-learn
we can use the typeoftarget() function to get the type of targget : continuous, or label and hence can figure what type of problem it is, based on that.
This answer provided by #Vivekkumar works as i needed it to.
Thank you!
May not always work. still needs to be better as we can fool it. but for the best case we can run an ML algorithm for this and use this as a model to train it better.
The following bit of code should help determine if the value in the last column can be converted to a floating point number:
s_1 = 'text'
s_2 = '1.23'
for s in [s_1, s_2]:
try:
f = float(s)
print(s, 'conversion to float was OK, value:', f)
except:
print(s, 'could not be converted to a number')

Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.
Yet, two issues are unclear to me:
1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why would that be the case?
2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?
Secondly
Images
WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. http://abload.de/img/bildschirmfoto2013-089out9.png
Decision Tree Example: Sketch showing the issue when it comes to running decision trees on datasets with dummy-coded instances after attribute selection. http://abload.de/img/sketchziu5s.jpg
The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies.
But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!
I believe feature selection and dummy coding just doesn't fit. It equals dropping some values from the attribute. Why do you insist on doing feature selection?
You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.
As for the decision tree: don't perform, feature selection on your output attribute...
Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

Resources