Accelerating the prediction - scikit-learn

Prediction with the SVM model created with 5 features and 3000 samples using default parameters is taking unexpectedely longer time (more than hour) with 5 features and 100000 samples. Is there way of accelerating the prediction?

A few issues to consider here:
Have you standardized your input matrix X? SVM is not scale-invariant, so it could be difficult for the algo to do classification if they takes a large number of raw inputs without proper scaling.
The choice of parameter C: Higher C allows a more complicated non-smooth decision boundary and it takes much more time to fit under this complexity. So decreasing the value C from default 1 to a lower value could accelerate the process.
It's also recommended to choose a proper value of gamma. This could be done via Grid-Search-Cross-Validation.
Here is the code to do grid-search cross validation. I ignore the test set here for simplicity.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
# generate some artificial data
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9])
# make a pipeline for convenience
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
# set up parameter space, we want to tune SVC params C and gamma
# the range below is 10^(-5) to 1 for C and 0.01 to 100 for gamma
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
# choose your customized scoring function, popular choices are f1_score, accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
# construct grid search
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
# what's the best estimator
gscv.best_params_
Out[20]: {'svc__C': 1.0, 'svc__gamma': 0.21544346900318834}
# what's the best score, in our case, roc_auc_score
gscv.best_score_
Out[22]: 0.86819366014152421
Note: the SVC is still not running very fast. It takes more than 40s to compute 50 possible combinations of params.
%time gscv.fit(X, y)
CPU times: user 42.6 s, sys: 959 ms, total: 43.6 s
Wall time: 43.6 s

Because the number of features is relatively low, I would start with decreasing the penalty parameter. It controls the penalty for mislabeled samples in the train data, and as your data contains 5 features, I guess it is not exactly linearly separable.
Generally, this parameter (C) allows the classifier to have larger margin on the account of higher accuracy (see this for more information)
By default, C=1.0. Start with svm = SVC(C=0.1) and see how it goes.

One reason might be that the parameter gamma is not the same.
By default sklearn.svm.SVC uses RBF kernel and gamma is 0.0, in which case 1/n_features will be used instead. So gamma is different given different number of features.
In terms of suggestions, I agree with Jianxun's answer.

Related

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

What is the difference between decision function and score_samples in isolation_forest in SKLearn

I have read the documentation of the decision function and score_samples here, but could not figure out what is the difference between these two methods and which one should I use for an outlier detection algorithm.
Any help would be appreciated.
See the documentation for the attribute offset_:
Offset used to define the decision function from the raw scores. We have the relation: decision_function = score_samples - offset_. offset_ is defined as follows. When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than “auto” is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.
The User Guide references the paper Isolation forest written by Fei Tony, Kai Ming and Zhi-Hua.
I did not read the paper, but I think you can use either output to detect outliers. The documentation says score_samples is the opposite of decision_function, so I thought they would be inversely related, but both outputs seem to have the exact same relationship with the target. The only difference is that they are on different ranges. In fact, they even have the same variance.
To see this, I fit the model to the breast cancer dataset available in sklearn and visualized the average of the target variable grouped by the deciles of each output. As you can see, they both have the exact same relationship.
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import IsolationForest
# Load data
X = load_breast_cancer()['data']
y = load_breast_cancer()['target']
# Fit model
clf = IsolationForest()
clf.fit(X, y)
# Split the outputs into deciles to see their relationship with target
t = pd.DataFrame({'target':y,
'decision_function':clf.decision_function(X),
'score_samples':clf.score_samples(X)})
t['bins_decision_function'] = pd.qcut(t['decision_function'], 10)
t['bins_score_samples'] = pd.qcut(t['score_samples'], 10)
# Visualize relationship
plt.plot(t.groupby('bins_decision_function')['target'].mean().values, lw=3, label='Decision Function')
plt.plot(t.groupby('bins_score_samples')['target'].mean().values, ls='--', label='Score Samples')
plt.legend()
plt.show()
Like I said, they even have the same variance:
t[['decision_function','score_samples']].var()
> decision_function 0.003039
> score_samples 0.003039
> dtype: float64
In conclusion, you can use them interchangeably as they both share the same relationship with the target.
As was previously stated in #Ben Reiniger's answer,
decision_function = score_samples - offset_. For further clarification...
If contamination = 'auto', then offset_ is fixed to 0.5
If contamination is set to something other than 'auto', then
offset is no longer fixed.
This can be seen under the fit function in the source code:
def fit(self, X, y=None, sample_weight=None):
...
if self.contamination == "auto":
# 0.5 plays a special role as described in the original paper.
# we take the opposite as we consider the opposite of their score.
self.offset_ = -0.5
return self
# else, define offset_ wrt contamination parameter
self.offset_ = np.percentile(self.score_samples(X),
100. * self.contamination)
Thus, it's important to take note of what contamination is set to, as well as which anomaly scores you are using. score_samples returns what can be thought of as the "raw" scores, as it is unaffected by offset_, whereas decision_function is dependent on offset_

Cross-validation gives unexpected results on Boston Housing without shuffling

I am getting surprising results for Boston Housing. The following code produces very different results when I apply cross-validation to the original Boston Housing dataset and to its randomly shuffled version:
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
boston = load_boston()
knn = KNeighborsRegressor(n_neighbors=1)
print(cross_val_score(knn, boston.data, boston.target))
X, y = shuffle(boston.data, boston.target, random_state=0)
print(cross_val_score(knn, X, y))
The output is:
[-1.07454938 -0.50761407 0.00351173]
[0.30715435 0.36369852 0.51817514]
Even if the order of the original dataset is not random, why are the 1 Nearest Neighbor predictions so poor for it? Thank you.
The order of the original dataset is not random at all. Since I do not normalize the dataset, two features essentially determine the Euclidean distances: feature 9 (with mean around 408) and feature 11 (mean around 357); I think these are PTRATIO (pupil-teacher ratio by town) and LSTAT (% lower status of the population). The graph of PTRATIO is
.
Roughly the last third of all samples have very high values of this feature; so the third fold (and scikit-learn uses 3 folds by default) is highly anomalous. The graph for LSTAT looks less striking, but nevertheless the last third is still highly anomalous. So it's not surprising that the 1NN results are so poor without shuffling.

Python Simple Regression Plot

I'm learning python and I want to perform a simple linear regression on a .csv dataset. I've successfully imported the data file. If I have data for 8 five year periods and I want to do simple linear regression how would I do this? The data is by county/state. So my headers are county, state, 1980,1985 etc.. Appreciate any help.
Please specify what target label you have in mind.
Anyways, use the sklearn library and pandas.
val= pd.DataFrame(your.data);
val.columns = your.headers;
Assuming you have a target header called "Price".
from sklearn.linear_model import LinearRegression
X = val.drop('Price',axis=1)
X contains the data on which LR will be performed.
Now create a Linear Regression object.
lm = LinearRegression()
Start the fitting:
lm.fit()
Predict your targets:
lm.predict(x)
That's it.
Almost all real world problems that you are going to encounter will have more than two variables, so let's skip the basic linear regression example. Linear regressions involving multiple variables is called "multiple linear regression". The steps to perform multiple linear regression are almost similar to that of simple linear regression. The difference lies in the evaluation. You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.
In this section we will use multiple linear regression to predict the gas consumptions (in millions of gallons) in 48 US states based upon gas taxes (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population that has a driver’s license.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
dataset = pd.read_csv('C:/your_Path_here/petrol_consumption.csv')
dataset.head()
dataset.describe()
Result:
Index ... Consumption_In_Millions
count 48.00 ... 48.000000
mean 24.50 ... 576.770833
std 14.00 ... 111.885816
min 1.00 ... 344.000000
25% 12.75 ... 509.500000
50% 24.50 ... 568.500000
75% 36.25 ... 632.750000
max 48.00 ... 968.000000
Preparing the Data
The next step is to divide the data into attributes and labels as we did previously. However, unlike last time, this time around we are going to use column names for creating an attribute set and label. Execute the following script:
X = dataset[['Petrol_Tax', 'Average_Income', 'Paved_Highways', 'ProportionWithLicenses']]
y = dataset['Consumption_In_Millions']
Execute the following code to divide our data into training and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Training the Algorithm
And finally, to train the algorithm we execute the same code as before, using the fit() method of the LinearRegression class:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
As said earlier, in case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
Result:
Coefficient
Petrol_Tax -40.016660
Average_Income -0.065413
Paved_Highways -0.004741
ProportionWithLicenses 1341.862121
This means that for a unit increase in "petrol_tax", there is a decrease of 24.19 million gallons in gas consumption. Similarly, a unit increase in proportion of population with a driver’s license results in an increase of 1.324 billion gallons of gas consumption. We can see that "Average_income" and "Paved_Highways" have a very little effect on the gas consumption.
Making Predictions
To make pre-dictions on the test data, execute the following script:
y_pred = regressor.predict(X_test)
To compare the actual output values for X_test with the predicted values, execute the following script:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
Result:
Actual Predicted
29 534 469.391989
4 410 545.645464
26 577 589.668394
30 571 569.730413
32 577 649.774809
37 704 646.631164
34 487 511.608148
40 587 672.475177
7 467 502.074782
10 580 501.270734
Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. We'll do this by finding the values for MAE, MSE and RMSE. Execute the following script:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Results:
Mean Absolute Error: 56.822247479
Mean Squared Error: 4666.34478759
Root Mean Squared Error: 68.3106491522
You can see that the value of root mean squared error is 60.07, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.
There are many factors that may have contributed to this inaccuracy, a few of which are listed here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years worth could have helped us improve the accuracy quite a bit.
2. Bad assumptions: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.
3. Poor features: The features we used may not have had a high enough correlation to the values we were trying to predict.
Note: the data set used in this example is available from here.
http://people.sc.fsu.edu/~jburkardt/datasets/regression/x16.txt
Finally, see the two links below for more info on this topic.
https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html

Random forest regression severely overfits single variable data

I am trying to use sklearn's random forest regression for a toy example. I generated 500 uniform random numbers between 1 and 100 as the predictor variables, and then took their logs and added Gaussian noise to form the response variables.
I've heard that random forests typically work well out of the box, so I was expecting a reasonable looking curve, but this is what I got:
I don't understand why the random forest seems to hit each data point. Because of bagging, each tree is missing some fraction of the data, so when all of the trees are averaged it seems like the curve should be more smoothed out, and not hit the outliers.
I'd appreciate any help in understanding why this model overfits so much.
Here's the code I used to generate the plot:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
def create_design_matrix(x_array):
return x_array.reshape((x_array.shape[0],1))
N = 1000
x_array = np.random.uniform(1, 100, N)
y_array = np.log(x_array) + np.random.normal(0, 0.5, N)
model = RandomForestClassifier(n_estimators=100)
model = model.fit(create_design_matrix(x_array), y_array)
test_x = np.linspace(1.0, 100.0, num=10000)
test_y = model.predict(create_design_matrix(test_x))
plt.plot(x_array, y_array, 'ro', linewidth=5.0)
plt.plot(test_x, test_y)
plt.show()
Thank you!
First of all, this is a regression problem, not classification.
If you have as many classes as samples, a decision tree will fit it,
for me this is not an overfit.
If you think this is an overfit, you may reduce the depth of the decision tree.

Resources