Multi-output spatial statistics with gaussian processes - scikit-learn

I've been investigating Gaussian processes lately. The perspective of probabilistic multi-output is promising in my field. In particular, spatial statistics. But I encountered three problems:
multi-ouput
overfitting and
anisotropy.
Let me run a simple case study with the meuse data set (from the R package sp).
UPDATE: The Jupyter notebook used for this question, and updated according to Grr's answer, is here.
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
meuse = pd.read_csv(filepath_or_buffer='https://gist.githubusercontent.com/essicolo/91a2666f7c5972a91bca763daecdc5ff/raw/056bda04114d55b793469b2ab0097ec01a6d66c6/meuse.csv', sep=',')
For the example, we will focus on copper and lead.
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(121, aspect=1)
ax1.set_title('Lead')
ax1.scatter(x=meuse.x, y=meuse.y, s=meuse.lead, alpha=0.5, color='grey')
ax2 = fig.add_subplot(122, aspect=1)
ax2.set_title('Copper')
ax2.scatter(x=meuse.x, y=meuse.y, s=meuse.copper, alpha=0.5, color='orange')
In fact, concentrations of copper and lead are correlated.
plt.plot(meuse['lead'], meuse['copper'], '.')
plt.xlabel('Lead')
plt.ylabel('Copper')
This is thus a multi-output problem.
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
reg = GPR(kernel=RBF())
reg.fit(X=meuse[['x', 'y']], y=meuse[['lead', 'copper']])
predicted = reg.predict(meuse[['x', 'y']])
First question: Is the kernel built for correlated multi-output when y has more than one dimension? If not, how may I specify the kernel?
I keep on with the analysis to show the second problem, overfitting:
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax1.set_title('Lead')
ax1.set_xlabel('Measured')
ax1.set_ylabel('Predicted')
ax1.plot(meuse.lead, predicted[:,0], '.')
ax2 = fig.add_subplot(122)
ax2.set_title('Copper')
ax2.set_xlabel('Measured')
ax2.set_ylabel('Predicted')
ax2.plot(meuse.copper, predicted[:,1], '.')
I created a grid of x and y coordinates and all concentrations on that grid were predicted as zeros.
Finally, a last concern which particularly arises in 3D for soils: how may I specify anisotropy in such models?

First off you need to split your data. Training a model and then predicting on that same training data will look like overfitting as you have observed, but you did not test your model on any hold out data, so you have no idea how it performs in the wild. Try splitting your data with sklearn.model_selection.train_test_split like so:
X_train, X_test, y_train, y_test = train_test_split(meuse[['x', 'y']], meuse[['lead', 'copper']])
Then you can train your model. However, you also have an issue there. When you train the model the way you do you end up with a kernel with a length_scale=1e-05. Essentially there is no noise in your model. The predictions made with this setup will be centered so tightly around your input points (X_train) that you won't be able to make any predictions about the sites around them. You need to change the alpha parameter of the GaussianProcessRegressor to fix this. This is something that you will likely need to do a grid search on as the default value is 1e-10. For an example I used alpha=0.1.
reg = GPR(RBF(), alpha=0.1)
reg.fit(X_train, y_train)
predicted = reg.predict(X_test)
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax1.set_title('Lead')
ax1.set_xlabel('Measured')
ax1.set_ylabel('Predicted')
ax1.plot(y_test.lead, predicted[:,0], '.')
ax2 = fig.add_subplot(122)
ax2.set_title('Copper')
ax2.set_xlabel('Measured')
ax2.set_ylabel('Predicted')
ax2.plot(y_test.copper, predicted[:,1], '.')
That results in this graph:
As you can see there is no overfitting issue here, in fact this may be underfit. Like I said you will need to do some GridSearchCV on this model to come up with the optimal setup given your data.
So to answer your questions:
The model handles multi output quite well as is.
The overfitting can be addressed by properly splitting the data or testing on a different hold out set.
Take a look at the Radial Basis Function RBF Kernel section of the Gaussian Processes guide for some insight on applying the anisotropic kernel as opposed to the isotropic kernel we applied above.
Update for Question in Comments
When you write "the model handles multi output quite well as is", are you saying that the model "as is" is built for correlated targets, or that the model handles them quite well as a collection of independent models?
Good question. From what I understand about the GaussianProcessRegressor I do not believe that it is capable of storing multiple models internally. So this is a single model. That being said what is interesting about your question is the statement "built for correlated targets". In this situation our two targets do seem to be fairly correlated (Pearson Correlation Coefficient = 0.818, p=1.25e-38) so I really see two questions here:
For correlated data, if we built models for both targets as well as the individual targets how would the results compare?
For non-correlated data does the above hold true?
Unfortunately we cannot test the second question without creating a new "fake" dataset, which is somewhat beyond the scope of what we are doing here. We can however, answer the first question quite easily. Using our same train/test split we can train two new models with the same hyperparameters for predicting lead and copper individually. Then we can train a MultiOutputRegressor using both classes. And finally compare them all to the original model. Like so:
reg = GPR(RBF(), alpha=1)
reg.fit(X_train, y_train)
preds = reg.predict(X_test)
reg_lead = GPR(RBF(), alpha=1)
reg_lead.fit(X_train, y_train.lead)
lead_preds = reg_lead.predict(X_test)
reg_cop = GPR(RBF(), alpha=1)
reg_cop.fit(X_train, y_train.copper)
cop_preds = reg_cop.predict(X_test)
multi_reg = MultiOutputRegressor(GPR(RBF(), alpha=1))
multi_reg.fit(X_train, y_train)
multi_preds = multi_reg.predict(X_test)
Now we have several models we can compare. Let's plot the predictions and see what we get.
Interestingly there is no visible difference in the lead predictions, but there is some in the copper predictions. And these only exist between the origin GPR model and our other models. Moving on to more quantitative measures of error we can see that for explained variance the original model performs ever so slightly better than our MultiOutputRegressor. What is interesting is that the explained variance for the copper model is significantly lower than for the lead model (this in fact corresponds to the behavior of the individual components of the two other models as well). This is all very interesting and would lead us down a number of different development routes in coming to our final model.
I think the important take away here is that all of the model iterations appear to be in the same ballpark and there is no clear cut winner in this scenario. This is a case where you are going to need to do some significant grid searching and perhaps implementing the anisotropic kernel and any other domain specific knowledge would help, but as it is our example is a far cry from a useful model.

Related

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Which classifier is good for a scheduling problem with sklearn?

With sklearn, I am trying to model a pickup and dropoff vehicle routing problem. If you can recommend one of classifiers, it will be appreciated. For a simplicity, there is one vehicle and there are 5 customers. The training data has 20 features and 10 outputs.
Features include the x-y cords of 5 customers. Each customer has pickup and dropoff locations.
c1p_x, c1p_y,c2p_x, c2p_y,c3p_x, c3p_y,c4p_x, c4p_y,c5p_x, c5p_y,
c1d_x, c1d_y,c2d_x, c2d_y,c3d_x, c3d_y,c4d_x, c4d_y,c5d_x, c5d_y,
c1p_x, c1p_y: customer 1 pickup x-y cord.
c1d_x: c1d_y: customer 1 dropoff x-y cord.
For example,
123,106,332,418,106,477,178,363,381,349,54,214,297,34,5,122,3,441,455,322
Outputs include the optimal sequence of visit.For example, 5,10,2,7,1,6,4,9,3,8
Customer 5 (pkup) => 10 (drop) => 2 (pkup) => 7 (drop) ... => 8 (drop)
Note each pickup will be immediately followed by dropoff.
Here are codes I tried.
import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
train = pd.read_csv('ML_DARP_train.txt',header=None,sep=',')
print (train.head())
x = train[range(0,19)]
y = train[range(20,30)]
classifier = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False,
epsilon=1e-08, hidden_layer_sizes=(15,),
learning_rate='constant', learning_rate_init=0.001,
max_iter=200, momentum=0.9, n_iter_no_change=10,
nesterovs_momentum=True, power_t=0.5, random_state=1,
shuffle=True, solver='lbfgs', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
classifier.fit(x, y)
print(classifier.score(x, y))
test = pd.read_csv('ML_DARP_test.txt',header=None,sep=',')
test = test[range(0,19)]
print (classifier.predict(test))
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
import pandas as pd
train = pd.read_csv('ML_DARP_train.txt',header=None,sep=',')
print (train.head())
x = train[range(0,19)]
y = train[range(20,30)]
print (y)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
classifier = MultiOutputClassifier(forest, n_jobs=-1)
classifier.fit(x, y)
print(classifier.score(x, y))
test = pd.read_csv('ML_DARP_test.txt',header=None,sep=',')
test = test[range(0,19)]
print (classifier.predict(test))
Here are training data.
123,106,332,418,106,477,178,363,381,349,54,214,297,34,5,122,3,441,455,322,5,10,2,7,1,6,4,9,3,8
154,129,466,95,135,191,243,13,289,227,300,40,171,286,219,403,232,113,378,428,5,10,2,7,1,6,4,9,3,8
215,182,163,321,259,500,434,304,355,276,77,414,93,83,42,292,101,459,488,237,5,10,4,9,3,8,2,7,1,6
277,220,313,29,304,229,500,454,263,154,339,255,484,351,287,87,330,147,411,343,1,6,3,8,2,7,4,9,5,10
308,258,464,223,349,460,64,120,188,62,100,96,374,118,16,368,73,352,365,480,2,7,1,6,5,10,3,8,4,9
369,296,97,385,363,174,161,317,128,472,346,423,217,338,246,163,349,87,335,132,2,7,4,9,1,6,5,10,3,8
400,318,263,94,471,467,321,45,146,475,107,264,139,136,53,36,155,370,382,380,3,8,2,7,4,9,5,10,1,6
477,387,461,350,62,244,417,242,102,399,401,137,76,451,330,364,431,90,368,47,3,8,1,6,4,9,2,7,5,10
38,441,95,12,45,412,452,361,496,276,162,479,420,155,12,112,128,263,290,138,4,9,1,6,3,8,2,7,5,10
69,447,245,205,106,157,79,89,467,216,393,289,311,422,273,440,435,30,291,323,2,7,4,9,3,8,1,6,5,10
115,0,427,430,214,451,207,302,439,172,185,178,232,220,64,282,210,266,292,22,2,7,5,10,1,6,3,8,4,9
192,53,92,123,259,180,273,468,363,81,447,19,122,488,310,77,454,471,246,159,3,8,1,6,5,10,2,7,4,9
223,91,227,317,304,411,385,180,319,5,208,361,498,239,54,389,245,222,231,328,5,10,2,7,4,9,1,6,3,8
269,113,424,57,396,188,12,378,322,493,470,218,435,52,331,231,20,474,263,59,4,9,5,10,1,6,2,7,3,8
315,151,42,204,410,387,78,43,215,355,215,28,278,273,44,11,264,178,170,149,5,10,2,7,3,8,4,9,1,6
393,236,239,444,487,148,191,240,202,326,23,417,200,71,321,338,39,414,203,365,4,9,2,7,3,8,1,6,5,10
454,274,390,153,62,410,303,453,173,266,286,259,106,354,96,165,331,165,203,48,4,9,2,7,3,8,5,10,1,6
15,327,86,378,154,187,447,181,160,237,62,131,27,152,389,23,137,448,220,264,3,8,4,9,2,7,1,6,5,10
61,365,237,71,184,417,43,379,131,178,324,474,403,388,133,334,413,167,205,417,5,10,3,8,1,6,4,9,2,7
123,418,403,280,261,178,124,59,56,86,101,331,309,170,394,145,172,404,175,70,3,8,2,7,5,10,4,9,1,6
169,441,36,458,275,378,190,194,466,465,332,141,167,406,108,426,385,76,97,176,3,8,2,7,1,6,5,10,4,9
215,494,249,213,398,186,365,470,500,483,124,29,120,236,431,315,238,391,161,439,3,8,5,10,4,9,1,6,2,7
246,0,337,345,396,370,399,88,377,329,355,324,449,441,113,48,420,32,68,28,2,7,3,8,1,6,5,10,4,9
339,100,49,84,489,131,496,286,317,253,163,213,370,238,390,375,195,268,37,181,5,10,2,7,3,8,4,9,1,6
385,122,215,294,80,424,139,29,320,240,425,70,292,36,181,233,17,50,70,413,2,7,4,9,1,6,5,10,3,8
416,144,366,2,125,154,236,211,291,180,170,396,182,304,427,44,277,286,86,112,5,10,3,8,2,7,4,9,1,6
477,198,15,180,170,384,348,424,231,105,448,253,41,39,171,356,68,22,56,265,1,6,4,9,2,7,5,10,3,8
38,251,181,390,231,130,414,58,171,13,209,95,448,322,417,151,281,211,10,387,2,7,5,10,1,6,3,8,4,9
69,258,347,98,276,360,495,239,111,439,455,437,354,89,161,447,40,447,480,55,3,8,1,6,4,9,5,10,2,7
131,327,28,323,384,153,169,500,130,426,248,309,275,388,469,321,379,229,27,286,1,6,5,10,2,7,4,9,3,8
161,334,116,454,351,305,172,102,492,272,463,88,87,76,120,37,60,387,419,361,1,6,5,10,3,8,4,9,2,7
238,403,313,194,428,82,300,331,479,227,255,462,9,375,412,396,367,154,420,45,2,7,1,6,4,9,5,10,3,8
285,456,10,419,35,375,460,74,498,230,47,350,447,189,219,254,189,437,484,308,2,7,5,10,4,9,1,6,3,8
346,494,144,112,95,120,71,287,469,170,294,176,322,441,480,81,480,188,469,476,3,8,2,7,4,9,5,10,1,6
393,47,310,322,156,367,168,453,378,63,71,33,227,223,240,393,208,377,423,113,5,10,1,6,3,8,4,9,2,7
470,100,22,77,264,159,312,197,380,34,364,406,180,52,47,266,30,159,440,329,4,9,2,7,5,10,1,6,3,8
15,138,157,239,294,358,362,332,289,429,109,248,23,273,246,14,243,348,393,466,5,10,4,9,3,8,2,7,1,6
61,176,323,449,355,119,490,59,261,384,371,89,430,39,22,357,49,99,394,134,2,7,4,9,1,6,3,8,5,10
92,198,427,110,353,303,39,210,201,293,117,400,274,260,220,121,278,304,348,272,1,6,4,9,5,10,3,8,2,7
185,267,154,367,476,111,183,439,172,249,410,288,226,89,27,496,84,70,365,488,5,10,4,9,1,6,3,8,2,7
231,321,305,59,36,357,264,119,112,157,187,145,117,357,288,291,345,291,319,124,4,9,1,6,5,10,2,7,3,8
262,327,440,238,66,71,345,270,36,66,417,456,477,92,17,86,72,480,273,262,1,6,2,7,4,9,3,8,5,10
323,381,89,431,96,287,411,436,477,476,179,298,367,359,231,367,317,200,242,415,1,6,3,8,2,7,5,10,4,9
369,418,286,171,219,94,85,195,10,494,456,170,304,173,54,256,154,499,321,192,1,6,2,7,3,8,4,9,5,10
416,456,437,365,249,325,166,377,452,403,217,12,179,425,299,51,415,218,275,330,3,8,2,7,4,9,5,10,1,6
477,9,86,42,294,55,248,58,360,296,479,354,53,176,28,347,158,423,214,452,3,8,2,7,4,9,1,6,5,10
7,31,237,251,355,301,360,255,332,236,240,179,460,459,289,158,434,158,214,151,2,7,3,8,4,9,5,10,1,6
84,84,418,476,447,78,473,468,319,207,17,52,381,241,65,0,225,426,231,335,5,10,4,9,2,7,3,8,1,6
146,154,83,169,477,277,22,102,180,37,295,410,256,493,279,265,438,83,107,395,4,9,5,10,2,7,1,6,3,8
177,176,234,363,21,7,134,299,183,24,40,236,131,244,24,76,213,334,155,125,4,9,2,7,1,6,3,8,5,10
238,214,416,87,144,332,309,74,201,27,318,108,68,58,347,466,66,148,202,372,1,6,2,7,4,9,5,10,3,8
316,298,128,344,236,108,438,287,173,468,126,498,5,372,154,340,357,400,188,40,4,9,1,6,5,10,3,8,2,7
346,305,215,475,219,276,441,391,50,330,341,292,334,76,337,57,38,41,126,162,3,8,1,6,4,9,5,10,2,7
408,358,397,199,296,37,68,103,37,285,118,133,239,359,97,384,330,309,127,346,3,8,5,10,1,6,4,9,2,7
439,381,47,377,357,284,180,316,494,225,380,491,114,110,358,211,120,44,112,30,1,6,3,8,5,10,2,7,4,9
15,450,228,117,449,60,309,28,465,165,156,348,51,425,150,69,412,311,97,199,3,8,4,9,5,10,2,7,1,6
77,2,363,295,463,260,358,163,358,43,434,205,411,160,348,318,124,469,35,305,5,10,1,6,3,8,2,7,4,9
123,40,59,19,23,21,487,392,345,500,211,62,317,428,124,160,416,236,36,4,5,10,3,8,1,6,2,7,4,9
169,62,194,197,83,267,83,72,317,455,441,389,207,194,385,472,175,472,37,189,2,7,1,6,5,10,4,9,3,8
231,131,376,438,160,28,164,254,241,348,234,261,129,493,145,298,435,176,492,326,3,8,5,10,2,7,4,9,1,6
277,154,25,115,221,274,276,452,213,304,480,87,3,244,406,109,210,428,493,10,3,8,2,7,4,9,5,10,1,6
323,191,176,309,266,489,358,132,153,213,241,429,394,11,135,405,470,147,463,179,5,10,1,6,3,8,2,7,4,9
354,214,311,487,296,203,423,283,46,90,488,255,253,247,349,185,198,336,401,300,3,8,1,6,4,9,5,10,2,7
447,298,7,211,372,481,66,11,64,93,296,143,175,45,141,43,4,103,449,31,3,8,4,9,2,7,5,10,1,6
493,336,157,420,449,242,178,224,35,33,41,470,65,312,417,370,296,370,434,200,5,10,3,8,1,6,2,7,4,9
7,343,323,129,40,19,291,421,477,458,287,311,488,110,193,197,71,90,404,369,2,7,5,10,4,9,3,8,1,6
69,396,474,307,54,218,372,102,432,382,64,168,346,346,407,477,331,326,389,21,1,6,3,8,5,10,4,9,2,7
146,465,155,31,115,465,469,284,357,291,342,25,252,113,167,304,90,30,343,158,3,8,1,6,5,10,4,9,2,7
192,2,305,240,191,226,96,11,375,278,103,368,158,396,444,146,397,313,375,390,3,8,1,6,4,9,5,10,2,7
238,24,471,450,268,488,209,209,315,202,365,209,64,178,220,474,156,33,360,58,3,8,2,7,4,9,1,6,5,10
285,78,105,111,282,202,274,359,224,80,126,50,424,415,434,238,385,222,298,179,5,10,3,8,1,6,2,7,4,9
346,116,286,336,358,464,387,71,195,35,388,393,330,197,210,80,176,474,299,364,1,6,3,8,5,10,2,7,4,9
424,185,484,92,466,256,46,316,229,38,196,281,283,26,17,454,499,272,347,110,3,8,4,9,2,7,5,10,1,6
470,223,133,270,11,471,111,482,138,432,442,122,157,278,231,218,242,476,301,248,3,8,2,7,4,9,5,10,1,6
15,261,284,464,56,201,208,163,94,357,203,465,32,29,477,29,1,196,271,401,4,9,1,6,5,10,2,7,3,8
46,267,418,141,70,400,258,313,2,250,434,275,392,265,190,310,230,401,209,6,4,9,3,8,5,10,1,6,2,7
107,336,130,381,193,224,417,72,5,237,242,178,345,79,12,183,68,183,257,253,2,7,4,9,1,6,3,8,5,10
154,358,234,43,176,392,452,176,415,114,473,474,188,284,195,417,250,341,179,359,5,10,3,8,1,6,4,9,2,7
200,412,400,252,252,153,79,405,370,54,250,331,94,66,472,259,56,92,165,27,1,6,4,9,5,10,3,8,2,7
277,465,81,462,298,383,129,38,295,448,26,188,485,334,201,54,269,281,134,196,3,8,4,9,1,6,2,7,5,10
339,18,262,186,405,176,304,314,313,451,304,60,406,131,8,429,122,94,182,428,4,9,1,6,2,7,3,8,5,10
370,56,429,411,498,453,432,26,269,375,65,403,328,430,300,271,398,331,152,80,1,6,5,10,4,9,2,7,3,8
431,94,78,88,11,152,466,161,162,252,327,244,187,166,499,19,110,3,74,186,5,10,3,8,1,6,4,9,2,7
462,116,228,282,71,414,94,390,180,239,72,70,93,449,274,362,417,286,122,433,2,7,4,9,1,6,5,10,3,8
38,185,410,6,164,190,237,102,136,179,366,459,500,231,66,220,208,22,107,101,5,10,4,9,2,7,3,8,1,6
100,238,75,215,225,421,319,284,92,104,142,300,390,499,311,15,468,258,77,238,3,8,5,10,1,6,2,7,4,9
131,245,179,362,192,73,337,387,454,451,358,95,233,203,478,249,149,400,485,329,1,6,3,8,5,10,4,9,2,7
177,298,392,118,331,397,27,178,3,469,150,484,186,32,317,154,18,229,47,91,3,8,2,7,1,6,5,10,4,9
254,352,57,311,392,143,108,344,460,409,428,341,76,299,61,450,262,450,48,275,3,8,4,9,1,6,5,10,2,7
285,390,191,489,406,358,174,9,369,302,173,167,436,35,291,229,6,138,2,413,5,10,2,7,3,8,1,6,4,9
362,443,373,229,482,119,318,253,387,289,451,23,357,334,67,71,329,437,34,143,4,9,1,6,3,8,2,7,5,10
408,481,54,439,105,428,462,466,358,245,227,397,279,131,375,446,119,188,35,328,5,10,3,8,1,6,4,9,2,7
470,34,204,131,134,142,42,147,283,138,489,238,154,383,104,241,364,393,475,450,5,10,1,6,2,7,3,8,4,9
15,71,355,325,164,357,92,282,176,15,250,64,28,134,318,5,76,65,413,71,3,8,5,10,2,7,4,9,1,6
61,94,4,18,225,102,204,495,178,488,12,406,435,417,78,317,368,333,429,286,1,6,5,10,3,8,2,7,4,9
123,163,202,259,348,411,364,254,181,475,289,279,357,215,402,206,205,115,462,487,2,7,4,9,1,6,5,10,3,8
185,201,336,437,362,125,429,405,90,352,50,120,231,452,115,487,434,320,400,107,1,6,5,10,3,8,2,7,4,9
231,238,17,145,407,356,41,117,77,323,312,463,121,218,360,298,225,70,432,323,1,6,4,9,2,7,5,10,3,8
262,276,167,355,484,101,138,298,1,216,73,320,27,0,120,108,485,291,370,461,1,6,4,9,3,8,2,7,5,10
292,283,302,32,44,347,203,449,442,141,304,130,403,252,366,420,213,480,356,129,4,9,3,8,1,6,2,7,5,10
Sklearn is built for generic algorithms, TSP/VRP are too specific for it. Are you open to trying more specific libraries then Sklearn?
Recent advance in Reinforcement Learning seems to address TSP and VRP problems in a way that challenges the traditional Combinatorial Optimization approach.
To start with, you can look at this tutorial.
A recent paper shows a method for VRP. They also shared their code on Github.
A more recent paper claims to have a shorter training period.
Generally speaking, the architecture proposed in these papers looks on the VRP job as a whole and is better than a greedy approach by:
The training phase which goes back and forth to include future
rewards
The solution architecture includes (at least) two NN.
Encoder and Decoder. The Encoder goes thru the entire input BEFORE the Decoder starts producing the output
To summarize, if you want a quick and robust solution you can use existing open libraries such as Jsprit. If you have time for research, the resources for training a NN and can take the risk of failing, go after Reinforcement Learning.
Based on your comments, using ML purely to generate a starting point for a traditional MIP/constraint/heuristic solver is a better idea than using ML to solve the whole thing, but I believe it to still be a bad idea. In my opinion, you will find it very hard to get a useful initial solution using ML. In a few lines of code you could probably put together a heuristic to greedily grow routes for a search starting point; getting ML to do something of even roughly equivalent quality would be a lot more work, and maybe not even possible.
If you really wanted to try this (and I emphasize again that it's a bad idea), the choice of features is likely much more important than the choice of classifier. For example at the moment you're asking the classifier to learn both (a) pythagoras and then (b) what's a good route. It has to learn pythagoras because you're passing in the coords directly. ML works best when the features are engineered to make the learning task easier. Passing in a normalised distance matrix instead of the raw coords might be more succesful, because then the classifier doesn't have to learn pythagoras. However, then you have n^2 scaling in features, which would likely cause overfitting and the problems associated with that...
Alternative you could grow the route from empty using ML to decide the next stop to add each time. So ML classifier chooses the first stop, then you classify again to choose the second stop, then classify again to choose the third and so-on. This would be simpler too, though the ML will primarily just learn 'what the closest stop is to the last one'. I have known some companies to use this kind of 'ML chooses the next stop or job' approach when they're scheduling/dispatching jobs one at a time for food takeaway/on-demand deliveries - i.e. problems similar to Uber Eats delivering hot food from restaurants. This is a bit different from your case, as it's dynamic/realtime route optimisation problem, but still some companies are actually using ML in vehicle route optimisation for real. In my option it's still a bad approach though - e.g. we did a study in this video https://www.youtube.com/watch?v=EMhnXAH5dvM where we look at the effect of this kind of one-at-a-time scheduling/dispatching (which you can use ML for) vs proper route optimisation, and one-at-a-time scheduling/dispatching comes off significantly worse.

How to correctly scale new data points sklearn

Imagine a simple regression problem, where you are using Gradient Descent. For correct implementation you will need to scale values using mean of entire training dataset. Imagine your model is already trained and you feed it another example you wish to predict. How do you scale it correctly with respect to previous dataset? Do you include new example in training set and then scale it with mean of this training dataset + new data points? How this should be done the right way?
By referring to new data points i mean something the model hasn't seen before, neither in training nor testing. How do you handle scaling for anything you pass to regr.predict() if scaling of training set is done with respect to the whole set and not a single observation.
Imagine you have ndarray of features:
to_predict = [10, 12, 1, 330, 1311, 225].
The dataset used for train and test is already oscillating around 0 for every feature. Taking into account the below answer (pseudo code, that's why I'm asking how to do it right):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_Xs = X_train.append(to_predict)
X_train_std_with_new = scalar.fit_transform(new_Xs)
scaled_to_predit = X_train_std_with_new[-1]
regr.predict(scaled_to_predict) ??

How do I go about predicting Closing Price of a Financial Symbol (EURUSD) using Machine Learning?

I did a simple experiment using EURUSD OHLC 1-Day data.My features were Open Price, Low Price, High Price, and I was trying to predict the future Closing price.
The code worked, as expected, but the results were very misleading.
I got a 99% Accuracy score, which as we all know is impossible.
1) So what I am I doing wrong?
2) How can I correct my mistakes?
The official system I am building would have BoP, PPI, Interest Rate, GDP, and a lot of Momentum indicators, etc. as Features, over some 60 features.
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import pickle
# 1. Read the EURUSD csv data.
# 2. Process the DataFrame, using only the Open, High, Low, Close columns.
df = pd.read_csv( 'EURUSD1440.csv', index_col= 'Date' )
df = df[['Open','High','Low','Close']]
array = df.values
# Features consist of Open, High, Low column, and stored in x.
# Label is the Close column stored in y.
x = array[:,0:3]
y = array[:,3]
# Split Data into Test and Train.
# 60% Train and 40% Test.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.4 )
# 1. Train the Model using .fit method.
# 2. Predict the future Closing prices using the .predict method.
# 3. Know how Accurate the Model is using the .score method.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
model = LinearRegression()
model.fit( x_train, y_train )
forecast = model.predict( x_test )
accuracy = model.score( x_test, y_test )
print( forecast, accuracy )
user3666197 discussion of the flaws of the concept is right on spot.
following extensive research, I would attest that the only option for utilizing the basic model of machine learning, that is load > transform> fit > predict using sklearn or keras or even tbot to automate model parameter optimization would be to incorporate some future-predicted/calculated "data of some relation"
to point you in the right direction, experiment with the following :
Astrology data, provided by NASA horizon system
Solar wind and Geomagnatism data provided by NASA.
furthermore, its more practical to focus your work on Features engineering and selection rather than model selection.
best of luck.
Prologue:Having been several decades in quantitative modelling and operating a set of 4th Gen distributed system with M/L predictors, I can guarantee even your 60-features' to be overly optimistic. One might assume about an order of magnitude higher dimensionality space, containing both technical and fundamental factors, to reasonably train a model with, if the ambition is to go beyond just an academic paper. Why? The Market Rules.
Your experiment exhibits two types of principal errors:
The first - a conceptual miss:the Machine Learning task, striving to predict a continuous value is Regression, ( no "classification" Labels, but Regression target values ) for which a metric for "a prediction success" is not a score, but some sort of absolute, PriceDOMAIN distance measures. Yes, distance, not a percent, as it is translated into a monetary reward by a trade execution.
Any attempt to use a percentage does not provide means to compare any two Regression models one against another and is incoherent with highly non-linear professional risk-management.
This post's footprint does not provide space enough to discuss additional dependencies for defining + assessment of a successful Trading TruStrategy, operating in at least 5-dimensions of policies -{ Select, Detect, Act, Allocate, Terminate }-Policy. Without a full TruStrategy SDAAT-model parameters definition, there is no chance to compute any performance expectations of a Market ride of any trading model under review.
Next:
Your model exhibits peeking into the future. You have allowed the model to learn from values, the reality will never give you at hand at the time of prediction, so except some clairvoyance, the model is principally skewed from the training DataSET and will never provide a fair service in real circumstances.
Epilogue:
One need not be shy to make this mistake, as Google has published their own Machine Learning "success" doing the very same error. ( If interested in details, search for Michal Illich + Google Machine Learning blogs on their experience ).
Ex post:
Do not give up. If your project is well-funded, has a reasonable technical infrastructure in place & has a reasonable grounding in the business domain, one can hire a mix of professional knowledge to have a FOREX market predictions engine working within a reasonable time and budget.
Reinventing a wheel could not be more expensive in the FOREX costs of failure realms.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Resources