Normalizing the MFCC - audio

do i need to get the normalized mel spectogram from librosa library or I dont need it to be normalized and should be ready for the CNN model ?
I tried it without normalizin the values of mel spectogram and it works just fine. However, the training time is slow. I am wondering if i should normalize it first before proceeding with the modelling.
This is my function for getting the spectogram and inputting it into the CNN model:
def get_spectogram(path, mfcc):
x, sr = librosa.load(path, res_type='kaiser_fast')
S = librosa.feature.melspectrogram(y=aug, sr=sr, n_mels=mfcc)
spec = librosa.power_to_db(S, ref=np.max)
return spec

Yes, in majority of cases you should normalise MFCC, and the most popular procedure is Cepstral mean and variance normalization (CMVN). You will find it implemented in Python in e.g. SpeechPy.
Why your code "works just fine" despite seemingly lack of normalization? As you explained in the comments, you are actually doing normalization - the "batch normalization" step. It will do, but you might squeeze some extra points by doing the normalization yourself.

Related

Is there a mean-variance normalization layer in PyTorch?

I am new to PyTorch and I would like to add a mean-variance normalization layer to my network that will normalize features to zero mean and unit standard deviation. I got a bit confused reading the documentation, could anyone give me some leads?
As #Ivan commented, the normalization can be done on many levels. However, as You say
normalize features to zero mean and unit standard deviation
I suppose You just want to input unbiased data to the network. If that's the case, You should treat it as data preprocessing step rather than a layer of Your model and basically do:
X = (X - torch.mean(X, dim=0))/torch.std(X, dim=0)
As an alternative, You can use torchvision.transforms:
preprocess = torchvision.transforms.Normalize(mean=torch.mean(X, dim=0), std=torch.std(X, dim=0))
X = preprocess(X)
as in this ResNet native example. Note how it is reasonably assumed that the future data would always have roughly the same mean and std_dev as the set that is used for their initial calculation (supposedly the training set). For this reason, we should preserve the initially calculated values and use them for preprocessing in any future inference scenario.

Which classifier is good for a scheduling problem with sklearn?

With sklearn, I am trying to model a pickup and dropoff vehicle routing problem. If you can recommend one of classifiers, it will be appreciated. For a simplicity, there is one vehicle and there are 5 customers. The training data has 20 features and 10 outputs.
Features include the x-y cords of 5 customers. Each customer has pickup and dropoff locations.
c1p_x, c1p_y,c2p_x, c2p_y,c3p_x, c3p_y,c4p_x, c4p_y,c5p_x, c5p_y,
c1d_x, c1d_y,c2d_x, c2d_y,c3d_x, c3d_y,c4d_x, c4d_y,c5d_x, c5d_y,
c1p_x, c1p_y: customer 1 pickup x-y cord.
c1d_x: c1d_y: customer 1 dropoff x-y cord.
For example,
123,106,332,418,106,477,178,363,381,349,54,214,297,34,5,122,3,441,455,322
Outputs include the optimal sequence of visit.For example, 5,10,2,7,1,6,4,9,3,8
Customer 5 (pkup) => 10 (drop) => 2 (pkup) => 7 (drop) ... => 8 (drop)
Note each pickup will be immediately followed by dropoff.
Here are codes I tried.
import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
train = pd.read_csv('ML_DARP_train.txt',header=None,sep=',')
print (train.head())
x = train[range(0,19)]
y = train[range(20,30)]
classifier = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False,
epsilon=1e-08, hidden_layer_sizes=(15,),
learning_rate='constant', learning_rate_init=0.001,
max_iter=200, momentum=0.9, n_iter_no_change=10,
nesterovs_momentum=True, power_t=0.5, random_state=1,
shuffle=True, solver='lbfgs', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
classifier.fit(x, y)
print(classifier.score(x, y))
test = pd.read_csv('ML_DARP_test.txt',header=None,sep=',')
test = test[range(0,19)]
print (classifier.predict(test))
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
import pandas as pd
train = pd.read_csv('ML_DARP_train.txt',header=None,sep=',')
print (train.head())
x = train[range(0,19)]
y = train[range(20,30)]
print (y)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
classifier = MultiOutputClassifier(forest, n_jobs=-1)
classifier.fit(x, y)
print(classifier.score(x, y))
test = pd.read_csv('ML_DARP_test.txt',header=None,sep=',')
test = test[range(0,19)]
print (classifier.predict(test))
Here are training data.
123,106,332,418,106,477,178,363,381,349,54,214,297,34,5,122,3,441,455,322,5,10,2,7,1,6,4,9,3,8
154,129,466,95,135,191,243,13,289,227,300,40,171,286,219,403,232,113,378,428,5,10,2,7,1,6,4,9,3,8
215,182,163,321,259,500,434,304,355,276,77,414,93,83,42,292,101,459,488,237,5,10,4,9,3,8,2,7,1,6
277,220,313,29,304,229,500,454,263,154,339,255,484,351,287,87,330,147,411,343,1,6,3,8,2,7,4,9,5,10
308,258,464,223,349,460,64,120,188,62,100,96,374,118,16,368,73,352,365,480,2,7,1,6,5,10,3,8,4,9
369,296,97,385,363,174,161,317,128,472,346,423,217,338,246,163,349,87,335,132,2,7,4,9,1,6,5,10,3,8
400,318,263,94,471,467,321,45,146,475,107,264,139,136,53,36,155,370,382,380,3,8,2,7,4,9,5,10,1,6
477,387,461,350,62,244,417,242,102,399,401,137,76,451,330,364,431,90,368,47,3,8,1,6,4,9,2,7,5,10
38,441,95,12,45,412,452,361,496,276,162,479,420,155,12,112,128,263,290,138,4,9,1,6,3,8,2,7,5,10
69,447,245,205,106,157,79,89,467,216,393,289,311,422,273,440,435,30,291,323,2,7,4,9,3,8,1,6,5,10
115,0,427,430,214,451,207,302,439,172,185,178,232,220,64,282,210,266,292,22,2,7,5,10,1,6,3,8,4,9
192,53,92,123,259,180,273,468,363,81,447,19,122,488,310,77,454,471,246,159,3,8,1,6,5,10,2,7,4,9
223,91,227,317,304,411,385,180,319,5,208,361,498,239,54,389,245,222,231,328,5,10,2,7,4,9,1,6,3,8
269,113,424,57,396,188,12,378,322,493,470,218,435,52,331,231,20,474,263,59,4,9,5,10,1,6,2,7,3,8
315,151,42,204,410,387,78,43,215,355,215,28,278,273,44,11,264,178,170,149,5,10,2,7,3,8,4,9,1,6
393,236,239,444,487,148,191,240,202,326,23,417,200,71,321,338,39,414,203,365,4,9,2,7,3,8,1,6,5,10
454,274,390,153,62,410,303,453,173,266,286,259,106,354,96,165,331,165,203,48,4,9,2,7,3,8,5,10,1,6
15,327,86,378,154,187,447,181,160,237,62,131,27,152,389,23,137,448,220,264,3,8,4,9,2,7,1,6,5,10
61,365,237,71,184,417,43,379,131,178,324,474,403,388,133,334,413,167,205,417,5,10,3,8,1,6,4,9,2,7
123,418,403,280,261,178,124,59,56,86,101,331,309,170,394,145,172,404,175,70,3,8,2,7,5,10,4,9,1,6
169,441,36,458,275,378,190,194,466,465,332,141,167,406,108,426,385,76,97,176,3,8,2,7,1,6,5,10,4,9
215,494,249,213,398,186,365,470,500,483,124,29,120,236,431,315,238,391,161,439,3,8,5,10,4,9,1,6,2,7
246,0,337,345,396,370,399,88,377,329,355,324,449,441,113,48,420,32,68,28,2,7,3,8,1,6,5,10,4,9
339,100,49,84,489,131,496,286,317,253,163,213,370,238,390,375,195,268,37,181,5,10,2,7,3,8,4,9,1,6
385,122,215,294,80,424,139,29,320,240,425,70,292,36,181,233,17,50,70,413,2,7,4,9,1,6,5,10,3,8
416,144,366,2,125,154,236,211,291,180,170,396,182,304,427,44,277,286,86,112,5,10,3,8,2,7,4,9,1,6
477,198,15,180,170,384,348,424,231,105,448,253,41,39,171,356,68,22,56,265,1,6,4,9,2,7,5,10,3,8
38,251,181,390,231,130,414,58,171,13,209,95,448,322,417,151,281,211,10,387,2,7,5,10,1,6,3,8,4,9
69,258,347,98,276,360,495,239,111,439,455,437,354,89,161,447,40,447,480,55,3,8,1,6,4,9,5,10,2,7
131,327,28,323,384,153,169,500,130,426,248,309,275,388,469,321,379,229,27,286,1,6,5,10,2,7,4,9,3,8
161,334,116,454,351,305,172,102,492,272,463,88,87,76,120,37,60,387,419,361,1,6,5,10,3,8,4,9,2,7
238,403,313,194,428,82,300,331,479,227,255,462,9,375,412,396,367,154,420,45,2,7,1,6,4,9,5,10,3,8
285,456,10,419,35,375,460,74,498,230,47,350,447,189,219,254,189,437,484,308,2,7,5,10,4,9,1,6,3,8
346,494,144,112,95,120,71,287,469,170,294,176,322,441,480,81,480,188,469,476,3,8,2,7,4,9,5,10,1,6
393,47,310,322,156,367,168,453,378,63,71,33,227,223,240,393,208,377,423,113,5,10,1,6,3,8,4,9,2,7
470,100,22,77,264,159,312,197,380,34,364,406,180,52,47,266,30,159,440,329,4,9,2,7,5,10,1,6,3,8
15,138,157,239,294,358,362,332,289,429,109,248,23,273,246,14,243,348,393,466,5,10,4,9,3,8,2,7,1,6
61,176,323,449,355,119,490,59,261,384,371,89,430,39,22,357,49,99,394,134,2,7,4,9,1,6,3,8,5,10
92,198,427,110,353,303,39,210,201,293,117,400,274,260,220,121,278,304,348,272,1,6,4,9,5,10,3,8,2,7
185,267,154,367,476,111,183,439,172,249,410,288,226,89,27,496,84,70,365,488,5,10,4,9,1,6,3,8,2,7
231,321,305,59,36,357,264,119,112,157,187,145,117,357,288,291,345,291,319,124,4,9,1,6,5,10,2,7,3,8
262,327,440,238,66,71,345,270,36,66,417,456,477,92,17,86,72,480,273,262,1,6,2,7,4,9,3,8,5,10
323,381,89,431,96,287,411,436,477,476,179,298,367,359,231,367,317,200,242,415,1,6,3,8,2,7,5,10,4,9
369,418,286,171,219,94,85,195,10,494,456,170,304,173,54,256,154,499,321,192,1,6,2,7,3,8,4,9,5,10
416,456,437,365,249,325,166,377,452,403,217,12,179,425,299,51,415,218,275,330,3,8,2,7,4,9,5,10,1,6
477,9,86,42,294,55,248,58,360,296,479,354,53,176,28,347,158,423,214,452,3,8,2,7,4,9,1,6,5,10
7,31,237,251,355,301,360,255,332,236,240,179,460,459,289,158,434,158,214,151,2,7,3,8,4,9,5,10,1,6
84,84,418,476,447,78,473,468,319,207,17,52,381,241,65,0,225,426,231,335,5,10,4,9,2,7,3,8,1,6
146,154,83,169,477,277,22,102,180,37,295,410,256,493,279,265,438,83,107,395,4,9,5,10,2,7,1,6,3,8
177,176,234,363,21,7,134,299,183,24,40,236,131,244,24,76,213,334,155,125,4,9,2,7,1,6,3,8,5,10
238,214,416,87,144,332,309,74,201,27,318,108,68,58,347,466,66,148,202,372,1,6,2,7,4,9,5,10,3,8
316,298,128,344,236,108,438,287,173,468,126,498,5,372,154,340,357,400,188,40,4,9,1,6,5,10,3,8,2,7
346,305,215,475,219,276,441,391,50,330,341,292,334,76,337,57,38,41,126,162,3,8,1,6,4,9,5,10,2,7
408,358,397,199,296,37,68,103,37,285,118,133,239,359,97,384,330,309,127,346,3,8,5,10,1,6,4,9,2,7
439,381,47,377,357,284,180,316,494,225,380,491,114,110,358,211,120,44,112,30,1,6,3,8,5,10,2,7,4,9
15,450,228,117,449,60,309,28,465,165,156,348,51,425,150,69,412,311,97,199,3,8,4,9,5,10,2,7,1,6
77,2,363,295,463,260,358,163,358,43,434,205,411,160,348,318,124,469,35,305,5,10,1,6,3,8,2,7,4,9
123,40,59,19,23,21,487,392,345,500,211,62,317,428,124,160,416,236,36,4,5,10,3,8,1,6,2,7,4,9
169,62,194,197,83,267,83,72,317,455,441,389,207,194,385,472,175,472,37,189,2,7,1,6,5,10,4,9,3,8
231,131,376,438,160,28,164,254,241,348,234,261,129,493,145,298,435,176,492,326,3,8,5,10,2,7,4,9,1,6
277,154,25,115,221,274,276,452,213,304,480,87,3,244,406,109,210,428,493,10,3,8,2,7,4,9,5,10,1,6
323,191,176,309,266,489,358,132,153,213,241,429,394,11,135,405,470,147,463,179,5,10,1,6,3,8,2,7,4,9
354,214,311,487,296,203,423,283,46,90,488,255,253,247,349,185,198,336,401,300,3,8,1,6,4,9,5,10,2,7
447,298,7,211,372,481,66,11,64,93,296,143,175,45,141,43,4,103,449,31,3,8,4,9,2,7,5,10,1,6
493,336,157,420,449,242,178,224,35,33,41,470,65,312,417,370,296,370,434,200,5,10,3,8,1,6,2,7,4,9
7,343,323,129,40,19,291,421,477,458,287,311,488,110,193,197,71,90,404,369,2,7,5,10,4,9,3,8,1,6
69,396,474,307,54,218,372,102,432,382,64,168,346,346,407,477,331,326,389,21,1,6,3,8,5,10,4,9,2,7
146,465,155,31,115,465,469,284,357,291,342,25,252,113,167,304,90,30,343,158,3,8,1,6,5,10,4,9,2,7
192,2,305,240,191,226,96,11,375,278,103,368,158,396,444,146,397,313,375,390,3,8,1,6,4,9,5,10,2,7
238,24,471,450,268,488,209,209,315,202,365,209,64,178,220,474,156,33,360,58,3,8,2,7,4,9,1,6,5,10
285,78,105,111,282,202,274,359,224,80,126,50,424,415,434,238,385,222,298,179,5,10,3,8,1,6,2,7,4,9
346,116,286,336,358,464,387,71,195,35,388,393,330,197,210,80,176,474,299,364,1,6,3,8,5,10,2,7,4,9
424,185,484,92,466,256,46,316,229,38,196,281,283,26,17,454,499,272,347,110,3,8,4,9,2,7,5,10,1,6
470,223,133,270,11,471,111,482,138,432,442,122,157,278,231,218,242,476,301,248,3,8,2,7,4,9,5,10,1,6
15,261,284,464,56,201,208,163,94,357,203,465,32,29,477,29,1,196,271,401,4,9,1,6,5,10,2,7,3,8
46,267,418,141,70,400,258,313,2,250,434,275,392,265,190,310,230,401,209,6,4,9,3,8,5,10,1,6,2,7
107,336,130,381,193,224,417,72,5,237,242,178,345,79,12,183,68,183,257,253,2,7,4,9,1,6,3,8,5,10
154,358,234,43,176,392,452,176,415,114,473,474,188,284,195,417,250,341,179,359,5,10,3,8,1,6,4,9,2,7
200,412,400,252,252,153,79,405,370,54,250,331,94,66,472,259,56,92,165,27,1,6,4,9,5,10,3,8,2,7
277,465,81,462,298,383,129,38,295,448,26,188,485,334,201,54,269,281,134,196,3,8,4,9,1,6,2,7,5,10
339,18,262,186,405,176,304,314,313,451,304,60,406,131,8,429,122,94,182,428,4,9,1,6,2,7,3,8,5,10
370,56,429,411,498,453,432,26,269,375,65,403,328,430,300,271,398,331,152,80,1,6,5,10,4,9,2,7,3,8
431,94,78,88,11,152,466,161,162,252,327,244,187,166,499,19,110,3,74,186,5,10,3,8,1,6,4,9,2,7
462,116,228,282,71,414,94,390,180,239,72,70,93,449,274,362,417,286,122,433,2,7,4,9,1,6,5,10,3,8
38,185,410,6,164,190,237,102,136,179,366,459,500,231,66,220,208,22,107,101,5,10,4,9,2,7,3,8,1,6
100,238,75,215,225,421,319,284,92,104,142,300,390,499,311,15,468,258,77,238,3,8,5,10,1,6,2,7,4,9
131,245,179,362,192,73,337,387,454,451,358,95,233,203,478,249,149,400,485,329,1,6,3,8,5,10,4,9,2,7
177,298,392,118,331,397,27,178,3,469,150,484,186,32,317,154,18,229,47,91,3,8,2,7,1,6,5,10,4,9
254,352,57,311,392,143,108,344,460,409,428,341,76,299,61,450,262,450,48,275,3,8,4,9,1,6,5,10,2,7
285,390,191,489,406,358,174,9,369,302,173,167,436,35,291,229,6,138,2,413,5,10,2,7,3,8,1,6,4,9
362,443,373,229,482,119,318,253,387,289,451,23,357,334,67,71,329,437,34,143,4,9,1,6,3,8,2,7,5,10
408,481,54,439,105,428,462,466,358,245,227,397,279,131,375,446,119,188,35,328,5,10,3,8,1,6,4,9,2,7
470,34,204,131,134,142,42,147,283,138,489,238,154,383,104,241,364,393,475,450,5,10,1,6,2,7,3,8,4,9
15,71,355,325,164,357,92,282,176,15,250,64,28,134,318,5,76,65,413,71,3,8,5,10,2,7,4,9,1,6
61,94,4,18,225,102,204,495,178,488,12,406,435,417,78,317,368,333,429,286,1,6,5,10,3,8,2,7,4,9
123,163,202,259,348,411,364,254,181,475,289,279,357,215,402,206,205,115,462,487,2,7,4,9,1,6,5,10,3,8
185,201,336,437,362,125,429,405,90,352,50,120,231,452,115,487,434,320,400,107,1,6,5,10,3,8,2,7,4,9
231,238,17,145,407,356,41,117,77,323,312,463,121,218,360,298,225,70,432,323,1,6,4,9,2,7,5,10,3,8
262,276,167,355,484,101,138,298,1,216,73,320,27,0,120,108,485,291,370,461,1,6,4,9,3,8,2,7,5,10
292,283,302,32,44,347,203,449,442,141,304,130,403,252,366,420,213,480,356,129,4,9,3,8,1,6,2,7,5,10
Sklearn is built for generic algorithms, TSP/VRP are too specific for it. Are you open to trying more specific libraries then Sklearn?
Recent advance in Reinforcement Learning seems to address TSP and VRP problems in a way that challenges the traditional Combinatorial Optimization approach.
To start with, you can look at this tutorial.
A recent paper shows a method for VRP. They also shared their code on Github.
A more recent paper claims to have a shorter training period.
Generally speaking, the architecture proposed in these papers looks on the VRP job as a whole and is better than a greedy approach by:
The training phase which goes back and forth to include future
rewards
The solution architecture includes (at least) two NN.
Encoder and Decoder. The Encoder goes thru the entire input BEFORE the Decoder starts producing the output
To summarize, if you want a quick and robust solution you can use existing open libraries such as Jsprit. If you have time for research, the resources for training a NN and can take the risk of failing, go after Reinforcement Learning.
Based on your comments, using ML purely to generate a starting point for a traditional MIP/constraint/heuristic solver is a better idea than using ML to solve the whole thing, but I believe it to still be a bad idea. In my opinion, you will find it very hard to get a useful initial solution using ML. In a few lines of code you could probably put together a heuristic to greedily grow routes for a search starting point; getting ML to do something of even roughly equivalent quality would be a lot more work, and maybe not even possible.
If you really wanted to try this (and I emphasize again that it's a bad idea), the choice of features is likely much more important than the choice of classifier. For example at the moment you're asking the classifier to learn both (a) pythagoras and then (b) what's a good route. It has to learn pythagoras because you're passing in the coords directly. ML works best when the features are engineered to make the learning task easier. Passing in a normalised distance matrix instead of the raw coords might be more succesful, because then the classifier doesn't have to learn pythagoras. However, then you have n^2 scaling in features, which would likely cause overfitting and the problems associated with that...
Alternative you could grow the route from empty using ML to decide the next stop to add each time. So ML classifier chooses the first stop, then you classify again to choose the second stop, then classify again to choose the third and so-on. This would be simpler too, though the ML will primarily just learn 'what the closest stop is to the last one'. I have known some companies to use this kind of 'ML chooses the next stop or job' approach when they're scheduling/dispatching jobs one at a time for food takeaway/on-demand deliveries - i.e. problems similar to Uber Eats delivering hot food from restaurants. This is a bit different from your case, as it's dynamic/realtime route optimisation problem, but still some companies are actually using ML in vehicle route optimisation for real. In my option it's still a bad approach though - e.g. we did a study in this video https://www.youtube.com/watch?v=EMhnXAH5dvM where we look at the effect of this kind of one-at-a-time scheduling/dispatching (which you can use ML for) vs proper route optimisation, and one-at-a-time scheduling/dispatching comes off significantly worse.

How to get started with Tensorflow

I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Scikit SVM gives very poor accuracy for STL-10 dataset

I am using Scikit-learn SVM for training my model for STL-10 dataset which contains 5000 training images (10 pre-defined folds). So I have 5000*96*96*3 size dataset for training and test purposes. I used following code to train it and measure the accuracy for the test set. (80% 20%). Final result was 0.323 accuracy. How can I increase the accuracy for SVM.
This is STL10 dataset
def train_and_evaluate(clf, train_x, train_y):
clf.fit(train_x, train_y)
#make 2D array as we can apply only 2d to fit() function
nsamples, nx, ny, nz = images.shape
reshaped_train_dataset = images.reshape((nsamples, nx * ny * nz))
X_train, X_test, Y_train, Y_test = train_test_split(reshaped_train_dataset, read_labels(LABEL_PATH), test_size=0.20, random_state=33)
train_and_evaluate(my_svc, X_train, Y_train)
print(metrics.accuracy_score(Y_test, clf2.predict(X_test)))
So it seems you are using raw SVM directly on the images. That is usually not a good idea (it is rather bad actually).
I will describe the classic image-classification pipeline popular in the last decades! Keep in mind, that the highest performing approaches right now might use Deep Neural Networks to combine some of these steps (a very different approach; a lot of research in the last years!)
First step:
Preprocessing is needed!
Normalize mean and variance (i would not expect your dataset to be already normalized)
Optional: histogram-equalization
Second step:
Feature-extraction -> you should learn some features from these images. There are a lot of approaches including
(Kernel-)PCA
(Kernel-)LDA
Dictionary-learning
Matrix-factorization
Local binary patterns
... (just test with LDA initially)
Third:
SVM for classification
again there might be a Normalization-step needed before this and as mentioned in the comments by #David Batista: there might be some parameter-tuning needed (especially for Kernel-SVM)
It is also not clear, if using color-information is wise here. For more simple approaches i expect black-and-white images to be superior (you are losing information but tuning your pipeline is more robust; high-performance approaches will of course use color-information).
See here for some random tutorial describing a similar problem. While i don't know if it's good work, you could immediatly recognize the processing-pipeline mentioned above (preprocessing, feature-extraction, classifier-learning)!
Edit:
Why preprocessing?: some algorithms assume centered samples with unit-variance, therefore normalization is needed. This is (at least) very important for PCA, LDA and SVM's.

Resources