Automatic model selection - python-3.x

I am writing a machine learning master algorithm from scratch where the user just inputs the training and testing data, i was wondering is there a way to automatically decide what algorithm is to be used : regression vs classification
like for example,
(assuming the last column is always the output and it is always a number )
if we could search through the last column and decide what model it is by seeing if they are discrete class labels or continuous values.
How would one go about this?
and if not this method, is there a better one?
It is to be in python3.
Thank you.

type of target using sk-learn
we can use the typeoftarget() function to get the type of targget : continuous, or label and hence can figure what type of problem it is, based on that.
This answer provided by #Vivekkumar works as i needed it to.
Thank you!
May not always work. still needs to be better as we can fool it. but for the best case we can run an ML algorithm for this and use this as a model to train it better.

The following bit of code should help determine if the value in the last column can be converted to a floating point number:
s_1 = 'text'
s_2 = '1.23'
for s in [s_1, s_2]:
try:
f = float(s)
print(s, 'conversion to float was OK, value:', f)
except:
print(s, 'could not be converted to a number')

Related

Assessing features to labelencode or get_dummies() on dataset in Python

I'm working on the heart attack analysis on Kaggle in python.
I am a beginner and I'm trying to figure whether it's still necessary to one-hot-encode or LableEncode these features. I see so many people encoding the values for this project, but I'm confused because everything already looks scaled (apart from age, thalach, oldpeak and slope).
age: age in years
sex: (1 = male; 0 = female)
cp: ordinal values 1-4
thalach: maximum heart rate achieved
exang: (1 = yes; 0 = no)
oldpeak: depression induced by exercise
slope: the slope of the peak exercise
ca: values (0-3)
thal: ordinal values 0-3
target: 0= less chance, 1= more chance
Would you say it's still necessary to one-hot-encode, or should I just use a StandardScaler straight away?
I've seen many people encode the whole dataset for this project, but it makes no sense to me to do so. Please confirm if only using StandardScaler would be enough?
When you apply StandardScaler, the columns would have values in the same range. That helps models to keep weights under bound and gradient descent will not shoot off when converging. This will help the model converge faster.
Independently, in order to decide between Ordinal values and One hot encoding, consider if the column values are similar or different based on the distance between them. If yes, then choose ordinal values. If you know the hierarchy of the category, then you can manually assign the ordinal values. Otherwise, you should use LabelEncoder. It seems like the heart attack data is already given with ordinal values manually assigned. For example, higher chest pain = 4.
Also, it is important to refer to notebooks that perform better. Take a look at the one below for reference.
95% Accuracy - https://www.kaggle.com/code/abhinavgargacb/heart-attack-eda-predictor-95-accuracy-score

Time Series Forecasting in Tensorflow 2.0 - How to predict using the last of the Validation Dataset?

I'm (desperately) trying to figure out Tensorflow 2.0 without much luck so far, but I think I'm close with what I need right now.
I've followed the doc here to make a simple network to forecast stock data (not weather data), and what I'd like to do now is, forecast the future using the latest/most recent section of the validation dataset. I'm hoping someone else has read through it already and can help me here.
The code to predict the future using the validation dataset looks like this:
for x, y in val_data_multi.take(3):
multi_step_plot(x[0], y[0], multi_step_model.predict(x)[0])
...where to the best of my knowledge, it takes a random chunk (3 separate times), and in my case is a 20 row x 9 column section, from the val_data_multi "Repeat dataset" type, and then uses the model's multi_step_plot function to spit out a plot that has the predicted values based on that random section of the validation dataset. But what if I don't want to just take a random validation section, I want to use the bottom of my actual dataset? So that if I have recent stock data at the bottom of my validation dataset, and I want to forecast for the future that hasn't happened yet, how can I take a 20x9 section from the bottom of that set, and not just have it "take" a random section to predict with?
As a pseudo code attempt to explain what I'm trying to do, I was trying something like:
for x, y in val_data_multi[-20:].take(1): #.take(3)
...to try and make it take one section 20 rows up from the bottom, and all columns. But of course this didn't work as TypeError: 'RepeatDataset' object is not subscriptable.
I hope that makes sense, and if it'll help for me to post my code, I can do that, but I'm just using what's already shown in that page, just made some modifications to use a stock dataset, that's all.
I was able to find a much better guide from this Github repo:
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
...which basically gets into better detail what I'm looking to do and made it very easy to understand. Thanks!

Sklearn TruncatedSVD() ValueError: n_components must be < n_features

Hi I'am trying to run script for a Kaggle competition.
you can see the whole script here
But when I run this script i get an ValueError
ValueError: n_components must be < n_features; got 1 >= 1
Can somebody tell me please how to find out how many features there are at this point.
I don't think it will be usefull when I set n_components to 0.
I also read the documentation but I can't solve that issue.
Greetz
Alex
It is highly likely that the shape of your data matrix is wrong: It seems to have only one column. That needs to be fixed. Use a debugger to figure out what goes into the fit method of the TruncatedSVD, or unravel the pipeline and do the steps by hand.
As for the error message, if it is due to a matrix with one column, this makes sense: You can only have maximally as many components as features. Since you are using TruncatedSVD it additionally assumes that you don't want the full feature space, hence the strict inequality.

How to find a regression line for a closed set of data with 4 parameters in matlab or excel?

I have a set of data I have acquired from simulations. There are 3 parameters that go into my simulations and I get one result out.
I can graph the data from the small subset i have and see the trends for each input, but I need to be able to extrapolate this and get some form of a regression equation seeing as the simulation takes a long time.
In matlab or excel, is it possible to list the inputs and outputs to obtain a 4 parameter regression line for a given set of information?
Before this gets flagged as a duplicate, i understand polyfit will give me an equation of best fit and will be as accurate as i want it, but i need the equation to correspond to the inputs, not just a regression line.
In other words if i 20 simulations of inputs a, b, c and output y, is there a way to obtain a "best fit":
y=B0+B1*a+B2*b+B3*c
using the data?
My usual recommendation for higher-dimensional curve fitting is to pose the problem as a minimization problem (that may be unneeded here with the nice linear model you've proposed, but I'm a hammer-nail guy sometimes).
It starts by creating a correlation function (the functional form you think maps your inputs to the output) given a vector of fit parameters p and input data xData:
correl = #(p,xData) p(1) + p(2)*xData(:,1) + p(3)*xData(:2) + p(4)*xData(:,3)
Then you need to define a function to minimize given the parameter vector, which I call the objective; this is typically your correlation minus you output data.
The details of this function are determined from the solver you'll use (see below).
All of the method need a starting vector pGuess, which is dependent on the trends you see.
For nonlinear correlation function, finding a good pGuess can be a trial but necessary for a good solution.
fminsearch
To use fminsearch, the data must be collapsed to a scalar value using some norm (2 here):
x = [a,b,c]; % your input data as columns of x
objective = #(p) norm(correl(p,x) - y,2);
p = fminsearch(objective,pGuess); % you need to define a good pGuess
lsqnonlin
To use lsqnonlin (which solves the same problem as above in different ways), the norm-ing of the objective is not needed:
objective = #(p) correl(p,x) - y ;
p = lsqnonlin(objective,pGuess); % you need to define a good pGuess
(You can also specify lower and upper bounds on the parameter solution, which is nice.)
lsqcurvefit
To use lsqcurvefit (which is simply a wrapper for lsqnonlin), only the correlation function is needed along with the data:
p = lsqcurvefit(correl,pGuess,x,y); % you need to define a good pGuess

Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.
Yet, two issues are unclear to me:
1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why would that be the case?
2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?
Secondly
Images
WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. http://abload.de/img/bildschirmfoto2013-089out9.png
Decision Tree Example: Sketch showing the issue when it comes to running decision trees on datasets with dummy-coded instances after attribute selection. http://abload.de/img/sketchziu5s.jpg
The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies.
But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!
I believe feature selection and dummy coding just doesn't fit. It equals dropping some values from the attribute. Why do you insist on doing feature selection?
You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.
As for the decision tree: don't perform, feature selection on your output attribute...
Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

Resources