SVM classification - Bad input shape Error - svm

Im having an error bad input shape I tried searching but I can't understand yet since im new in SVM.
train.csv
testing.csv
# importing required libraries
import numpy as np
# import support vector classifier
from sklearn.svm import SVC
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
X = pd.read_csv("train.csv")
y = pd.read_csv("testing.csv")
clf = SVC()
clf.fit(X, y)
clf.decision_function(X)
print(clf.predict(X))
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1, 6)

The problem here is that you are just inserting your entire table with the training data (plus labels) as the input for just the training data and then try to predict the table of the testing data (data and labels) with the SVM.
This does not work that way.
What you need to do, is to train the SVM with your training data (so data points + label for each data point) and then test it against your testing data (testing data points + labels).
Your code should look like this instead:
# Load training and testing dataset from .csv files
training_dataset = pd.read_csv("train.csv")
testing_dataset = pd.read_csv("testing.csv")
# Load training data points with all relevant features
X_train = training_dataset[['feature1','feature2','feature3','feature4']]
# Load training labels from dataset
y_train = training_dataset['label']
# Load testing data points with all relevant features
X_test = testing_dataset[['feature1','feature2','feature3','feature4']]
# Load testing labels from dataset
y_test = testing_dataset['label']
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, y_train)
# Evaluate the decision function with test samples
clf.decision_function(X_test)
# Predict the test samples
print(clf.predict(X_test))
I hope that helps and that this code runs for you. Let me know if I misunderstood something or you have more questions. :)

Related

Data Normalization in Tensorflow Model

I have a Tensorflow regression model that i have with been working with. I have the model tuned well and getting good results while training. However, when i goto evalute the results are horrible. I did some research and found that i am not normalizing my test features and labels as well so i suspect that is where the problem is. My thought is to normalize the whole dataset before splitting the dataset into train and test sets but i am getting an attribute error that has me stumped.
here is the code sample. Please help :)
#concatenate the surface data and single_downhole_col into a single dataframe
training_Data =[]
training_Data = pd.concat([surface_Data, single_downhole_col], axis=1)
#print('training data shape:',training_Data.shape)
#print(training_Data.head())
#normalize the data using keras
model_normalizer_layer = tf.keras.layers.Normalization(axis=-1)
model_normalizer_layer.adapt(training_Data)
normalized_training_Data = model_normalizer_layer(training_Data)
#convert the data frame to array
dataset = normalized_training_Data.copy()
dataset.tail()
#create a training and test set
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)
#check the data
train_dataset.describe().transpose()
#split features from labels
train_features = train_dataset.copy()
test_features = test_dataset.copy()
and if there is any interest in knowing how the normalizer layer is used in the model then please see below
def build_and_compile_model(data):
model = keras.Sequential([
model_normalizer_layer,
layers.Dense(260, input_dim=401,activation='relu'),
layers.Dense(80, activation='relu'),
#layers.Dense(40, activation='relu'),
layers.Dense(1)
])
i found that quasimodos suggestion of using normalization of the data set before processing in my model was the ideal solution. It scaled the data 0 to 1 for all columns as expected and allowed me to display the data prior to training to validate it was correct.
For whatever reason the keras.layers.normalization was not working in my case.
x = training_Data.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
training_Data = pd.DataFrame(x_scaled)
# normalize the data using keras
model_normalizer_layer = tf.keras.layers.Normalization(axis=-1)
model_normalizer_layer.adapt(training_Data)
normalized_training_Data = model_normalizer_layer(training_Data)
The only part that i have yet to figure out is how do i scale the predict data from the model back to the original ranges of the column??? i'm sure its simple but i'm stumped.

Viewing model coefficients for a single prediction

I have a logistic regression model housed in a scikit-learn pipeline using the following:
pipeline = make_pipeline(
StandardScaler(),
LogisticRegressionCV(
solver='lbfgs',
cv=10,
scoring='roc_auc',
class_weight='balanced'
)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
I can view the model's coefficients for predictions as a whole with this code ...
# Look at model's coefficients to see what features are most important
plt.rcParams['figure.dpi'] = 50
model = pipeline.named_steps['logisticregressioncv']
coefficients = pd.Series(model.coef_[0], X_train.columns)
plt.figure(figsize=(10,12))
coefficients.sort_values().plot.barh(color='grey');
Which returns a bar plot of the features and their coefficients.
What I'm trying to do is be able to see how different input values for a single observation impact its prediction. The idea is to be able to run predictions on a sample population and examine the group with "low" predictions ... for example if I run predictions for 10 observations, I'd like to see how different input values impacted each of those 10 predictions, individually.
Recalled that I can achieve this via Shap Values using something along the following (but using LinearExplainer instead of TreeExplainer):
# Instantiate model and encoder outside of pipeline for
# use with shap
model = RandomForestClassifier( random_state=25)
# Fit on train, score on val
model.fit(X_train_encoded, y_train2)
y_pred_shap = model.predict(X_val_encoded)
# Get an individual observation to explain.
row = X_test_encoded.iloc[[-3]]
# Why did the model predict this?
# Look at a Shapley Values Force Plot
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
base_value=explainer.expected_value[1],
shap_values=shap_values[1],
features=row
)```

Error when classifying new Linear SVM dataframe

I created a multi-class classification model with Linear SVM. But I am not able to classify a new loaded dataframe (my base that must be classified) I have the following error.
What should I do to convert my new text(df.reason_text) to TFID and classify(call model.prediction(?)) with my model?
Training Model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=stopwords)
features = tfidf.fit_transform(training.Description).toarray()
labels = training.category_id
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, training.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Now I'm not able to convert my new dataframe to classify
Load New DataFrame by Classification
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://athenaxxxxxxxx/result/',
region_name='us-east-2')
df = pd.read_sql("select * from data.classification_text_reason", conn)
features2 = tfidf.fit_transform(df.reason_text).toarray()
features2.shape
After I convert the new data frame text with TFID and have it sorted, I get the following message
y_pred1 = model.predict(features2)
error
ValueError: X has 1272 features per sample; expecting 5319
'
When you are loading a new DF for classification, you are calling fit_tranform() again, but you should be calling only transform().
fit_transform() description: Learn vocabulary and idf, return term-document matrix.
transform() description: Transform documents to document-term matrix.
You need to use the transformer created when training the algorithm, so the code would be:
tfidf.transform(df.reason_text).toarray()
If you still have the feature shape error, there may be a problem with the shapes of the arrays. Solve the transform part and if the error still occurs, post an example of the train and the test data in array format, I will keep helping.

how to transform test data to pick same features as training dataset

simply, i am trying to apply same feature selection to test data as i did to train set, however test doesn't have the same exact shape.
def get_important_features (X_train, Y_train, X_test):
'''
:param X_train: features of training set of type scipy.sparse.csr_matrix
:param Y_train: labels of training set of type scipy.sparse.csr_matrix
:param X_test: features of test set of type scipy.sparse.csr_matrix
:return:
'''
select_percentile = SelectPercentile(chi2, percentile=75)
print(X_train.shape)
print(X_test.shape)
X_new_train = select_percentile.fit_transform(X_train, Y_train)
#print(select_percentile.get_support(indices=True))
X_new_test = select_percentile.transform(X_test)
return X_new_train, X_new_test
so training set shape (836, 3188) and test set shape (633, 3187) as you can see testing set does not has the same shape as training set however all i care about picking only features that exist in training set after applying chi2. Also, as you might know X_new_test = select_percentile.transform(X_test) throw value error ValueError: X has a different shape than during fitting. because of the reason i mentioned above. Is there any way i can extract theses features from X_test without using transform(X_test)?
Note: that input is csr matrix not a dataframe so i get this values from libsvm format document.
train= load_svmlight_file(train_file_name)
X_train = train[0]
Y_train = train[1]
test= load_svmlight_file(test_file_name)
X_test = test[0]
Y_test = test[1]
I tried with your function and it works. Make sure you are passing the data in a correct way. Below is the minimal example for your reference:
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
# dummy data
train = pd.DataFrame(np.random.randint(1000, size=(50, 10)), columns=['A'+str(x) for x in range(10)])
test = pd.DataFrame(np.random.randint(1000, size=(30, 9)), columns=['A'+str(x) for x in range(9)])
# assuming the last column is the target variable
X_new_train, X_new_test = get_important_features(train.iloc[:,:-1], train.iloc[:,-1], test)
print(X_new_train.shape, X_new_test.shape)
(50, 6) (30, 6)

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Resources