creating baseline regression model with average and min values in python - python-3.x

I want to compare results of my regression analysis with encoded categorical variables with two baseline models where the baseline predictions are specified as the average or min values of the groups. I've chosen Rsquare and MAE for comparison. Below is a simplified example of my code for illustration. It works in that it gives me an output which I think achieves my goal. Is this the correct and/or best way to do this?
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
df = pd.DataFrame([['a1','c1',10],
['a1','c2',15],
['a1','c3',20],
['a1','c1',15],
['a2','c2',20],
['a2','c3',15],
['a2','c1',20],
['a2','c2',15],
['a3','c3',20],
['a3','c3',15],
['a3','c3',15],
['a3','c3',20]], columns=['aid','cid','T'])
df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
df_dummies
X = df_dummies
y = df_dummies['T']
# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# Baseline model with group average as prediction
y_pred = df.groupby('aid').agg({'T': ['mean']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# Baseline model with group min as prediction
y_pred = df.groupby('aid').agg({'T': ['min']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

First of all, I would rename y_predall the time to not get confused.
In general:
y_pred = df.groupby('aid').agg({'T': ['mean']})
will give you the mean of the column 'aid'.
And y_pred = df.groupby('aid').agg({'T': ['min']}) will give you the minimum.
There is an interessting package for you: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html
This is helpful for dummy regression and has also other methods inside.
In your case it should work like this:
df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X = df_dummies
y = df['T']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
dummy_min=DummyRegressor(strategy='constant',constant=min_value)
dummy_min.fit(X_train,y_train)

Related

How to test unseen test data with cross validation and predict labels?

1.The CSV that contains data(ie. text description) along with categorized labels
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV contains unseen data(ie. text description) for which labels need to be predicted
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
Cross validation function that operates on the training data(item #1) above.
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
I need to know how to pass the unseen data(item #2) to the cross validation function and how to predict the labels?
Using scores = cross_val_score(clf, X_train, y, cv=cv) you can only get the cross-validated scores of the model. cross_val_score will internally split the data into training and testing based on the cv parameter.
So the values that you get are the cross-validated accuracy of the SVC.
To get the score on the unseen data, you can first fit the model e.g.
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
and then do clf.score(X_unseen,y)
The last will return the accuracy of the model on the unseen data.
EDIT: The best way to do what you want is the following using a GridSearch to first find the best model using the training data and then evaluate the best model using the unseen (test) data:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())

Using python 3 how to get co-variance/variance

I have a simple linear regression model and i need to count the variance and the co-variance. How to calculate variance and co-variance using linear regression ?
Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([2,3,4,5])
y = np.array([4,3,2,9] )
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model using the training sets
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(X_predict)
Try this for the output vector that you get for variance and co-variance:
y_variance = np.mean((y_predict - np.mean(y_predict))**2)
y_covariace = np.mean(y_predict - y_true_values)
Note: Co-variance here is mean of change of predictions with respect to there true values.

Error encountered: Classification metrics can't handle a mix of multiclass-multioutput and binary targets

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
file = './BBC.csv'
df = read_csv(file)
array = df.values
X = array[:, 0:11]
Y = array[:, 11]
test_size = 0.30
seed = 45
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = RandomForestClassifier()
model.fit(X_train, Y_train)
result = model.score(X_test, X_test)
print("Accuracy: %.3f%%") % (result*100.0)
dataset: https://www.dropbox.com/s/ar1c9yuv5x774cv/BBC.csv?dl=0
I have encountered this error:
Classification metrics can't handle a mix of multiclass-multioutput and binary targets
If i'm not wrong RandomForest should be able to handle both classes (classification) and means (regression). Am i wrong?
Edit:
Checked your dataset. So for classification task, your problem lies in your code.
result = model.score(X_test, X_test)
Note that the parameter here should be X_test and Y_test
-----kind of off-topic-----
If you want to use RandomForest for regression, you probably should call RandomForestRegressor

How to compute accuracy and the confusion matrix using K-fold cross-validation?

I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval?
Could someone help me?
My code is:
import numpy as np
from sklearn import model_selection
from sklearn import datasets
from sklearn import svm
import pandas as pd
from sklearn.linear_model import LogisticRegression
UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv')
previsores = UNSW.iloc[:,UNSW.columns.isin(('sload','dload',
'spkts','dpkts','swin','dwin','smean','dmean',
'sjit','djit','sinpkt','dinpkt','tcprtt','synack','ackdat','ct_srv_src','ct_srv_dst','ct_dst_ltm',
'ct_src_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm')) ].values
classe= UNSW.iloc[:, -1].values
X_train, X_test, y_train, y_test = model_selection.train_test_split(
previsores, classe, test_size=0.4, random_state=0)
print(X_train.shape, y_train.shape)
#((90, 4), (90,))
print(X_test.shape, y_test.shape)
#((60, 4), (60,))
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
print(previsores.shape)
########K FOLD
print('########K FOLD########K FOLD########K FOLD########K FOLD')
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
kf = KFold(n_splits=30, random_state=None, shuffle=False)
kf.get_n_splits(previsores)
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
print (confusion_matrix(y_test, logmodel.predict(X_test)))
print(10* '#')
For accuracy, I would use the function cross_val_score that does exactly what you are looking for. It outputs a list of 30 validation accuracies and you can then compute their mean, standard deviation, etc and create some kind of a confidence interval (mean +- 2*std)
.
Since confusion matrix cannot be seen as a performance metric (not a single number but a matrix) I would recommend creating a list and then iteratively just append it with a corresponding validation confusion matrix (currently you just print it). At the end, you can use this list to extract a lot of interesting information.
UPDATE:
...
...
cm_holder = []
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
cm_holder.append(confusion_matrix(y_test, logmodel.predict(X_test))))
Note that the len(cm_holder) = 30 and each of the elements is an array of shape=(n_classes, n_classes).

cross validation and text classification for imbalanced data

I am new to NLP and I am trying to build a text classifier but my data is currently imbalanced.The highest category having as much as 280 entries while the lowest as much as 30.
I am trying to use cross validation technique for the current data, but after looking for days now i am unable to implement it.It looks pretty straightforward but I am still unable to implement it. Here is my code
y = resample.Subsystem
X = resample['new description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#SVM
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
print('The best accuracy is : ',np.mean(predicted_svm == y_test))
I have done some gridsearch and Stemmer further but right now I would work on cross validation on this code.I have cleaned the data pretty well but i am stil getting an accuracy of 60%
Any help would be appreciated
Try to do oversampling or under sampling. As the data is highly imbalanced, There is more bias towards the class with more data points. After the over/under sampling the bias will be very less and accuracy will up.
Else instead of SVM you can use MLP. It gives good results even with unbalanced data.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
from sklearn.model_selection import RepeatedKFold
kf = RepeatedKFold(n_splits=20, n_repeats=10, random_state=None)
for train_index, test_index in kf.split(X):
#print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Resources