Dealing with no data values - scikit-learn

During learning, none of my features has '0' values; so I have successfully made my SVM model.
However, when I use that model for prediction with my features, have '0' values in some location of samples. The '0' are no data values. How can I deal with no data values during prediction. I could impute during learning. But if I remove no data value during prediction I will have missing prediction results in those sample locations.
In those sample points, not all features are void but some are.
any suggestions are appreciated.

If some data values are NaN, then you need an imputer to impute those missing values. General strategy is to use 'mean' or 'median' strategy for replacement.
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='mean')
X_data = imputer.fit_transform(X_data_with_missing_values)
You can then to fit SVM using this imputed X_data.

Related

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible.
Because of the size of data, I am using incremental training - where following sklearn API - .fit(X, y) I am not able to fit the entire matrix X into memory and therefore I am training the model in a couple of rows at the time. The problem is that in every batch, the model is expecting the same number of columns in X.
This is where it gets tricky because some variables are categorical it may be that one-hot encoding on a batch of data will same some shape (e.g. 20 columns). However, the next batch will have (26 columns) simply because in the previous batch not every unique level of the categorical feature was present. Sklearn allows for accounting for this and costume function can also be used: To keep some number of columns in matrix X.
import seaborn as sns
import numpy as np
from sklearn.preprocessing import OneHotEncoder
def one_hot_known(dataf, list_levels, col):
"""Creates a dummy coded matrix with as many columns as unique levels"""
return np.array(
[np.eye(len(list_levels))[list_levels.index(i)] for i in dataf[col]])
# Load Some Dataset with categorical variable
df_orig = sns.load_dataset('tips')
# List of unique levels - known apriori
day_level = list(df_orig['day'].unique())
# Image, we have a batch of data (subset of original data) and one categorical level (DAY) is not present here
df = df_orig.loc[lambda d: d['day'] != 'Sun']
# Missing category is filled with 0 and in next batch, if present its columns will have 1.
OneHotEncoder(categories = [day_level], sparse=False).fit_transform(np.array(df['day']).reshape(-1, 1))
#Costum function, can be used in incremental(data batches chunk fashion)
one_hot_known(df, day_level, 'day')
What I would like to do not is to utilize the TargerEncoding approach, so that we do not have matrix X with a huge number of columns. However, it still needs to be done in an Incremental fashion, just like the OneHot Encoding above.
I am writing this as a post because I know this is very useful to many people and would like to know how to utilize the same strategy for TargetEncoding.
I am aware that Deep Learning allows for Embedding layers, which represent categorical features in continuous space but I would like to apply TargetEncoding.

Stock Prediction on basis of Symbol, Date, AveragePrice

I am trying to predict stock prices for next 7 days based on the data available for last 5 years. Data looks like this
I am trying to apply Support vector regression on this data set. i have already converted date column to pandas datetime using data.Date = pd.to_datetime(data.Date), but still i get this error
float() argument must be a string or a number, not 'Timestamp'.
My code is as follows
from sklearn.svm import SVR
adaniPorts = data[data.Symbol == 'ADANIPORTS']
from sklearn.cross_validation import train_test_split
X = adaniPorts[['Symbol', 'Date']]
Y = adaniPorts['Average Price']
x_train, x_test, y_train, y_test = train_test_split(X, Y)
classifier = SVR().fit(x_train, y_train)
is there any way to resolve this problem of datetime?
When you train the SVR you can only use numerical features. One way to include the datetime information would be to use pd.to_timedelta(df.date).dt.total_seconds()so you also feed the regressor with a numerical feature representing the date in this case. Another way would be to include the different fields of the datetime object, year, month, day as predictors.
However, using a SVR for time series forecasting would make more sense if the features provided enough information to overcome the temporal component, which dubiously is the case.
Furthermore you are using train_test_split, which will generate random train and test subsets from the original data.
This cannot be applied directly with time series data as it assumes that there is no relationship between the observations. When dealing with time series the data must be split respecting the temporal order in which values were observed.
I suggest you also give a look at Recurrent neural networks or ARIMA models
Like the answer from Alexandre said just numerical features are supported. If you used string feature it automaticaly transforms to numerical. You have several options. the first one is like he does, transforms each date to numerical seconds, but i think is better to transforme to one-hot encoding for each part of the date.
data['day'] = data.Date.dt.day
data['month'] = data.Date.dt.month
data['year'] = data.Date.dt.year
With this you have day, month and year separated. Now you can encode like one-hot. this is to build a vector of 0s for each element and then fill with one on the date you are working. For example the 3rd for a month will be:
[0,0,1,0,....,0] -> 1x31
To do that with pandas you can use something like this.
data = pd.concat([data, pd.get_dummies(data.year, prefix='year')], axis=1, sort=False)
data = pd.concat([data, pd.get_dummies(data.month, prefix='month')], axis=1, sort=False)
data = pd.concat([data, pd.get_dummies(data.day, prefix='day')], axis=1, sort=False)
Also can be interesting to add the week day because on weekends the world is stopper.
data['week_day'] = data.Date.dt.dayofweek
And before to pass to SVR drop the Date column. data.drop(['Date'], axis=1, inplace=True)
I hope this works
PS. I would recommend you LSTM (neural network) or Arima (estadistic model) for this task.

Hardcode a spark logistic regression model

I've trained a model using PySpark and would like to compare its performance to that of an existing heuristic.
I just want to hardcode an LR model with the coefficients 0.1, 0.5, and 0.7, call .transform on the test data to get the predictions, and compute the accuracies.
How do I hardcode a model?
Unfortunately it's not possible to just set the coefficients of a pyspark LR model. The pyspark LR model is actually a wrapper around a java ml model (see class JavaEstimator).
So when the LR model is fit, it transfers the params from the paramMap to a new java estimator, which is fit to the data. All the LogisticRegressionModel methods/attributes are just calls to the java model using the _call_java method.
Since the coefficients aren't params (you can see a comprehensive list using explainParams on a LR instance), you can't pass them to the java LR model that's created, and there is not a setter method.
For example, for a logistic regression model lrm, you can see that the only setters are for the params you can set when you instantiate a pyspark LR instance: lowerBoundsOnCoefficients and upperBoundsOnCoefficients.
print([c for c in lmr._java_obj.__dir__() if "coefficient" in c.lower()])
# >>> ['coefficientMatrix', 'lowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$lowerBoundsOnCoefficients_$eq',
# 'getLowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$upperBoundsOnCoefficients_$eq',
# 'getUpperBoundsOnCoefficients', 'upperBoundsOnCoefficients', 'coefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionModel$$_coefficients']
Trying to set the "coefficients" attribute yields this:
print(lmr.coefficients)
# >>> DenseVector([18.9303, -18.9303])
lmr.coefficients = [10, -10]
# >>> AttributeError: can't set attribute
So you'd have to roll your own pyspark transformer if you want to be able to provide coefficients. It would probably be easier just to calculate results using the standard logistic function as per #pault's comment.
You can set lower and upper bounds on coefficients of a LR model.
In your case when you exactly know what you want - you can set the lower and upper bound coefficients to the same numbers and thats what you will get the same exact coefficients.
You can set the coeffcients as dense matrix like this -
from pyspark.ml.linalg import Vectors,Matrices
a=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
b=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
and incroporate them into the model as hyperparamaters
lr = LogisticRegression(featuresCol='features', labelCol='label', maxIter=10,
lowerBoundsOnCoefficients=a,\
upperBoundsOnCoefficients=b, \
threshold = 0.5)
and voila! you have your model.
You can then call fit & tranform on your model -
best_mod=lr.fit(train)
predict_train=best_mod.transform(train) # train data
predict_test=best_mod.transform(test) # test data

Spark/Pyspark: SVM - How to get Area-under-curve?

I have been dealing with random forest and naive bayes lately. Now i want to use a Support vector machine.
After fitting the model i wanted to use the output columns "probability" and "label" to compute the AUC value. But now I have seen that there is no column "probability" for SVM?!
Here you can see how I have done so far:
from pyspark.ml.classification import LinearSVC
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
scores = model.transform(train)
results = scores.select('probability', 'label')
# Create Score-Label Set for 'BinaryClassificationMetrics'
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)
metrics = BinaryClassificationMetrics(scoreAndLabels)
print("AUC-value: " + str(round(metrics.areaUnderROC,4)))
That was my approach how I have done this in the past for random forest and naive bayes. I thought I could do it with svm too... But that does not work because there is no output column "probability".
Does anyone know why the column "probability" does not exist? And how i can compute the AUC-value now?
Using the most recent spark/pyspark to the time of this answer:
If you use the pyspark.ml module (unlike mllib), you can work with Dataframe as the interface:
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
test_prediction = model.transform(test)
Create the evaluator (see it's source code for settings):
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
Apply evaluator to data (again, source code shows more options):
evaluation = evaluator.evaluate(test_prediction)
The result of evaluate is, by default, the "Area Under Curve":
print("evaluation (area under ROC): %f" % evaluation)
SVM algorithm doesn't provide probability estimates, but only some scores.
There is an algorithm proposed by Platt to compute probabilities given SVM scores, but it's criticized but some and apparently not implemented in Spark.
Btw, there was a similar question What does the score of the Spark MLLib SVM output mean?

How can I correctly use Pipleline with MinMaxScaler + NMF to predict data?

This is a very small sklearn snipplet:
logistic = linear_model.LogisticRegression()
pipe = Pipeline(steps=[
('scaler_2', MinMaxScaler()),
('pca', decomposition.NMF(6)),
('logistic', logistic),
])
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
pipe.fit(Xtrain, ytrain)
ypred = pipe.predict(Xtest)
I will get this error:
raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to NMF (input X)
According to this question:
Scaling test data to 0 and 1 using MinMaxScaler
I know this is because
This is due to the fact that the lowest value in my test data was
lower than the train data, of which the min max scaler was fit
But I am wondering, is this a bug?
MinMaxScaler (all scalers) seems should be applied before I do the prediction, it should not depends on previous fitted training data, am I right?
Or how could I correctly use preprocessing scalers with Pipeline?
Thanks.
This is not a bug. The main reason that you add the scaler to the pipeline is to prevent leaking the information from your test set to your model. When you fit the pipeline to your training data, the MinMaxScaler keeps the min and max of your training data. It will use these values to scale any other data that it may see for prediction. As you also highlighted, this min and max are not necessarily the min and max of your test data set! Therefore you may end up having some negative values in your training set when the min of your test set is smaller than the min value in the training set. You need a scaler that does not give you negative values. For instance, you may usesklearn.preprocessing.StandardScaler. Make sure that you set the parameter with_mean = False. This way, it will not center the data before scaling but scales your data to unit variance.
If your data is stationary and sampling is done properly, you can assume that your test set resembles your train set to some big extent.
Therefore, you can expect that min/max over test set is close to min/max over train set, with exceptions to few "outliers".
To decrease chances of producing negative values with MinMaxScaler on test set, simply scale your data not to (0,1) range, but ensure that you have allowed some "safety space" for your transformer like this:
MinMaxScaler(feature_range=(1,2))

Resources