I am trying to recreate a tutorial made by Nick Becker. It is located at https://beckernick.github.io/oversampling-modeling/
The code he has posted works when you copy and paste it in to Jupyter Notebook.
I am trying to recreate this with a different data set that is also highly imbalanced. It is a Airbnb data set provided by Inside Airbnb which I have manipulated and reuploaded here: https://drive.google.com/file/d/0B4EEyCnbIf1fLTd2UU5SWVNxV29oNHVkc3ZyY2JId3UyRWtv/view?usp=drivesdk
I have created a notebook in which I have dropped rows with null values, averaged the review score and made 1,2,3 = to 1 or Negative and 4,5 = 0 or Positive.
I then followed the exact steps as were provided in Nick Beckers model and when I get to the the "Creating the Training and Test Sets" portion I get an error.
**** I have added an additional question toward the end because the error was solved in the comments****
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-21-1c632a59b870> in <module>
1 training_features, test_features, \
----> 2 training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1)
KeyError: "['average_review_score'] not found in axis"
The above is a shortened version of the full error message.
I did notice that in Nick's code even though he sets "bad_loans" in his model_variables which he then creates dummies for. When you actually look at the "price_relevant_encoded" dataframe there are actually no dummies created for "bad_loans". My equivelent to "bad_loans" is "average_review_score" and there are dummies created for that. I believe that is my problem. The bad part for me is that I do not know how to get around it. My end goal is to be able to get a more realistic prediction model for ratings depending on property type room type and neighborhood.
This is the code I have so far:
%matplotlib inline
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import warnings
import tensorflow as tf
import tensorflow_hub as hub
import bert
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from scipy import stats
plt.style.use('seaborn')
warnings.filterwarnings(action='ignore')
output_dir = 'modelOutput'
airbnbdata = pd.read_excel('Z:\\Business\\AA Project\\listings_cleaned_v1.xlsm')
dfclean = airbnbdata
dfclean.iloc[0]
#drop rows with nulls in columns
dfclean = dfclean.dropna(subset=['id'])
dfclean = dfclean.dropna(subset=['listing_url'])
dfclean = dfclean.dropna(subset=['name'])
dfclean = dfclean.dropna(subset=['summary'])
dfclean = dfclean.dropna(subset=['space'])
dfclean = dfclean.dropna(subset=['description'])
dfclean = dfclean.dropna(subset=['host_id'])
dfclean = dfclean.dropna(subset=['host_name'])
dfclean = dfclean.dropna(subset=['host_listings_count'])
dfclean = dfclean.dropna(subset=['neighbourhood_cleansed'])
dfclean = dfclean.dropna(subset=['city'])
dfclean = dfclean.dropna(subset=['state'])
dfclean = dfclean.dropna(subset=['zipcode'])
dfclean = dfclean.dropna(subset=['country'])
dfclean = dfclean.dropna(subset=['latitude'])
dfclean = dfclean.dropna(subset=['longitude'])
dfclean = dfclean.dropna(subset=['property_type'])
dfclean = dfclean.dropna(subset=['room_type'])
dfclean = dfclean.dropna(subset=['price'])
dfclean = dfclean.dropna(subset=['number_of_reviews'])
dfclean = dfclean.dropna(subset=['review_scores_rating'])
dfclean = dfclean.dropna(subset=['average_review_score'])
dfclean = dfclean.dropna(subset=['reviews_per_month'])
#round score rating
dfclean['average_review_score'] = dfclean['average_review_score']/2
dfclean.average_review_score = dfclean.average_review_score.round()
dfclean.neighbourhood_cleansed=dfclean.neighbourhood_cleansed.replace(' ', '_', regex=True)
#pd.Series(' '.join(dfclean.neighbourhood_cleansed).split()).value_counts()[:20]
dfclean.average_review_score[dfclean['average_review_score']== 1] = '1'
dfclean.average_review_score[dfclean['average_review_score']== 2] = '1'
dfclean.average_review_score[dfclean['average_review_score']== 3] = '1'
dfclean.average_review_score[dfclean['average_review_score']== 4] = '0'
dfclean.average_review_score[dfclean['average_review_score']== 5] = '0'
dfclean['average_review_score'].value_counts()/dfclean['average_review_score'].count()
dfclean.average_review_score.value_counts()
model_variables = ['neighbourhood_cleansed', 'property_type','room_type','average_review_score']
price_data_relevent = dfclean[model_variables]
price_relevant_enconded = pd.get_dummies(price_data_relevent)
training_features, test_features, \
training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1),
price_relevant_enconded['average_review_score'],
test_size = .15,
random_state=12)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-21-1c632a59b870> in <module>
1 training_features, test_features, \
----> 2 training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1),
3 price_relevant_enconded['average_review_score'],
4 test_size = .15,
5 random_state=12)
~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4115 level=level,
4116 inplace=inplace,
-> 4117 errors=errors,
4118 )
4119
~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
3912 for axis, labels in axes.items():
3913 if labels is not None:
-> 3914 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
3915
3916 if inplace:
~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
3944 new_axis = axis.drop(labels, level=level, errors=errors)
3945 else:
-> 3946 new_axis = axis.drop(labels, errors=errors)
3947 result = self.reindex(**{axis_name: new_axis})
3948
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
5338 if mask.any():
5339 if errors != "ignore":
-> 5340 raise KeyError("{} not found in axis".format(labels[mask]))
5341 indexer = indexer[~mask]
5342 return self.delete(indexer)
KeyError: "['average_review_score'] not found in axis"
The output for
for col in price_relevant_enconded.columns:
print(col)
neighbourhood_cleansed_Acton
neighbourhood_cleansed_Adams-Normandie
neighbourhood_cleansed_Agoura_Hills
neighbourhood_cleansed_Agua_Dulce
neighbourhood_cleansed_Alhambra
neighbourhood_cleansed_Alondra_Park
neighbourhood_cleansed_Altadena
neighbourhood_cleansed_Angeles_Crest
neighbourhood_cleansed_Arcadia
neighbourhood_cleansed_Arleta
neighbourhood_cleansed_Arlington_Heights
neighbourhood_cleansed_Artesia
neighbourhood_cleansed_Athens
neighbourhood_cleansed_Atwater_Village
neighbourhood_cleansed_Avalon
neighbourhood_cleansed_Avocado_Heights
neighbourhood_cleansed_Azusa
neighbourhood_cleansed_Baldwin_Hills/Crenshaw
neighbourhood_cleansed_Baldwin_Park
neighbourhood_cleansed_Bel-Air
neighbourhood_cleansed_Bell
neighbourhood_cleansed_Bell_Gardens
neighbourhood_cleansed_Bellflower
neighbourhood_cleansed_Beverly_Crest
neighbourhood_cleansed_Beverly_Grove
neighbourhood_cleansed_Beverly_Hills
neighbourhood_cleansed_Beverlywood
neighbourhood_cleansed_Boyle_Heights
neighbourhood_cleansed_Bradbury
neighbourhood_cleansed_Brentwood
neighbourhood_cleansed_Broadway-Manchester
neighbourhood_cleansed_Burbank
neighbourhood_cleansed_Calabasas
neighbourhood_cleansed_Canoga_Park
neighbourhood_cleansed_Carson
neighbourhood_cleansed_Carthay
neighbourhood_cleansed_Castaic
neighbourhood_cleansed_Castaic_Canyons
neighbourhood_cleansed_Central-Alameda
neighbourhood_cleansed_Century_City
neighbourhood_cleansed_Cerritos
neighbourhood_cleansed_Charter_Oak
neighbourhood_cleansed_Chatsworth
neighbourhood_cleansed_Chesterfield_Square
neighbourhood_cleansed_Cheviot_Hills
neighbourhood_cleansed_Chinatown
neighbourhood_cleansed_Citrus
neighbourhood_cleansed_Claremont
neighbourhood_cleansed_Commerce
neighbourhood_cleansed_Compton
neighbourhood_cleansed_Covina
neighbourhood_cleansed_Culver_City
neighbourhood_cleansed_Cypress_Park
neighbourhood_cleansed_Del_Aire
neighbourhood_cleansed_Del_Rey
neighbourhood_cleansed_Desert_View_Highlands
neighbourhood_cleansed_Diamond_Bar
neighbourhood_cleansed_Downey
neighbourhood_cleansed_Downtown
neighbourhood_cleansed_Duarte
neighbourhood_cleansed_Eagle_Rock
neighbourhood_cleansed_East_Hollywood
neighbourhood_cleansed_East_La_Mirada
neighbourhood_cleansed_East_Los_Angeles
neighbourhood_cleansed_East_Pasadena
neighbourhood_cleansed_East_San_Gabriel
neighbourhood_cleansed_Echo_Park
neighbourhood_cleansed_El_Monte
neighbourhood_cleansed_El_Segundo
neighbourhood_cleansed_El_Sereno
neighbourhood_cleansed_Elysian_Park
neighbourhood_cleansed_Elysian_Valley
neighbourhood_cleansed_Encino
neighbourhood_cleansed_Exposition_Park
neighbourhood_cleansed_Fairfax
neighbourhood_cleansed_Florence
neighbourhood_cleansed_Florence-Firestone
neighbourhood_cleansed_Gardena
neighbourhood_cleansed_Glassell_Park
neighbourhood_cleansed_Glendale
neighbourhood_cleansed_Glendora
neighbourhood_cleansed_Gramercy_Park
neighbourhood_cleansed_Granada_Hills
neighbourhood_cleansed_Green_Meadows
neighbourhood_cleansed_Green_Valley
neighbourhood_cleansed_Griffith_Park
neighbourhood_cleansed_Hacienda_Heights
neighbourhood_cleansed_Hancock_Park
neighbourhood_cleansed_Harbor_City
neighbourhood_cleansed_Harbor_Gateway
neighbourhood_cleansed_Harvard_Heights
neighbourhood_cleansed_Harvard_Park
neighbourhood_cleansed_Hasley_Canyon
neighbourhood_cleansed_Hawaiian_Gardens
neighbourhood_cleansed_Hawthorne
neighbourhood_cleansed_Hermosa_Beach
neighbourhood_cleansed_Highland_Park
neighbourhood_cleansed_Historic_South-Central
neighbourhood_cleansed_Hollywood
neighbourhood_cleansed_Hollywood_Hills
neighbourhood_cleansed_Hollywood_Hills_West
neighbourhood_cleansed_Huntington_Park
neighbourhood_cleansed_Hyde_Park
neighbourhood_cleansed_Industry
neighbourhood_cleansed_Inglewood
neighbourhood_cleansed_Irwindale
neighbourhood_cleansed_Jefferson_Park
neighbourhood_cleansed_Koreatown
neighbourhood_cleansed_La_Cañada_Flintridge
neighbourhood_cleansed_La_Crescenta-Montrose
neighbourhood_cleansed_La_Habra_Heights
neighbourhood_cleansed_La_Mirada
neighbourhood_cleansed_La_Puente
neighbourhood_cleansed_La_Verne
neighbourhood_cleansed_Ladera_Heights
neighbourhood_cleansed_Lake_Balboa
neighbourhood_cleansed_Lake_Hughes
neighbourhood_cleansed_Lake_Los_Angeles
neighbourhood_cleansed_Lake_View_Terrace
neighbourhood_cleansed_Lakewood
neighbourhood_cleansed_Lancaster
neighbourhood_cleansed_Larchmont
neighbourhood_cleansed_Lawndale
neighbourhood_cleansed_Leimert_Park
neighbourhood_cleansed_Lennox
neighbourhood_cleansed_Leona_Valley
neighbourhood_cleansed_Lincoln_Heights
neighbourhood_cleansed_Lomita
neighbourhood_cleansed_Long_Beach
neighbourhood_cleansed_Lopez/Kagel_Canyons
neighbourhood_cleansed_Los_Feliz
neighbourhood_cleansed_Lynwood
neighbourhood_cleansed_Malibu
neighbourhood_cleansed_Manchester_Square
neighbourhood_cleansed_Manhattan_Beach
neighbourhood_cleansed_Mar_Vista
neighbourhood_cleansed_Marina_del_Rey
neighbourhood_cleansed_Mayflower_Village
neighbourhood_cleansed_Maywood
neighbourhood_cleansed_Mid-City
neighbourhood_cleansed_Mid-Wilshire
neighbourhood_cleansed_Mission_Hills
neighbourhood_cleansed_Monrovia
neighbourhood_cleansed_Montebello
neighbourhood_cleansed_Montecito_Heights
neighbourhood_cleansed_Monterey_Park
neighbourhood_cleansed_Mount_Washington
neighbourhood_cleansed_North_El_Monte
neighbourhood_cleansed_North_Hills
neighbourhood_cleansed_North_Hollywood
neighbourhood_cleansed_North_Whittier
neighbourhood_cleansed_Northeast_Antelope_Valley
neighbourhood_cleansed_Northridge
neighbourhood_cleansed_Northwest_Antelope_Valley
neighbourhood_cleansed_Northwest_Palmdale
neighbourhood_cleansed_Norwalk
neighbourhood_cleansed_Pacific_Palisades
neighbourhood_cleansed_Pacoima
neighbourhood_cleansed_Palmdale
neighbourhood_cleansed_Palms
neighbourhood_cleansed_Palos_Verdes_Estates
neighbourhood_cleansed_Panorama_City
neighbourhood_cleansed_Paramount
neighbourhood_cleansed_Pasadena
neighbourhood_cleansed_Pico-Robertson
neighbourhood_cleansed_Pico-Union
neighbourhood_cleansed_Pico_Rivera
neighbourhood_cleansed_Playa_Vista
neighbourhood_cleansed_Playa_del_Rey
neighbourhood_cleansed_Pomona
neighbourhood_cleansed_Porter_Ranch
neighbourhood_cleansed_Quartz_Hill
neighbourhood_cleansed_Ramona
neighbourhood_cleansed_Rancho_Dominguez
neighbourhood_cleansed_Rancho_Palos_Verdes
neighbourhood_cleansed_Rancho_Park
neighbourhood_cleansed_Redondo_Beach
neighbourhood_cleansed_Reseda
neighbourhood_cleansed_Ridge_Route
neighbourhood_cleansed_Rolling_Hills
neighbourhood_cleansed_Rolling_Hills_Estates
neighbourhood_cleansed_Rosemead
neighbourhood_cleansed_Rowland_Heights
neighbourhood_cleansed_San_Dimas
neighbourhood_cleansed_San_Fernando
neighbourhood_cleansed_San_Gabriel
neighbourhood_cleansed_San_Marino
neighbourhood_cleansed_San_Pasqual
neighbourhood_cleansed_San_Pedro
neighbourhood_cleansed_Santa_Clarita
neighbourhood_cleansed_Santa_Fe_Springs
neighbourhood_cleansed_Santa_Monica
neighbourhood_cleansed_Sawtelle
neighbourhood_cleansed_Sepulveda_Basin
neighbourhood_cleansed_Shadow_Hills
neighbourhood_cleansed_Sherman_Oaks
neighbourhood_cleansed_Sierra_Madre
neighbourhood_cleansed_Signal_Hill
neighbourhood_cleansed_Silver_Lake
neighbourhood_cleansed_South_El_Monte
neighbourhood_cleansed_South_Gate
neighbourhood_cleansed_South_Park
neighbourhood_cleansed_South_Pasadena
neighbourhood_cleansed_South_San_Gabriel
neighbourhood_cleansed_South_San_Jose_Hills
neighbourhood_cleansed_South_Whittier
neighbourhood_cleansed_Southeast_Antelope_Valley
neighbourhood_cleansed_Stevenson_Ranch
neighbourhood_cleansed_Studio_City
neighbourhood_cleansed_Sun_Valley
neighbourhood_cleansed_Sun_Village
neighbourhood_cleansed_Sunland
neighbourhood_cleansed_Sylmar
neighbourhood_cleansed_Tarzana
neighbourhood_cleansed_Temple_City
neighbourhood_cleansed_Toluca_Lake
neighbourhood_cleansed_Topanga
neighbourhood_cleansed_Torrance
neighbourhood_cleansed_Tujunga
neighbourhood_cleansed_Tujunga_Canyons
neighbourhood_cleansed_Unincorporated_Catalina_Island
neighbourhood_cleansed_Unincorporated_Santa_Monica_Mountains
neighbourhood_cleansed_Unincorporated_Santa_Susana_Mountains
neighbourhood_cleansed_Universal_City
neighbourhood_cleansed_University_Park
neighbourhood_cleansed_Val_Verde
neighbourhood_cleansed_Valinda
neighbourhood_cleansed_Valley_Glen
neighbourhood_cleansed_Valley_Village
neighbourhood_cleansed_Van_Nuys
neighbourhood_cleansed_Venice
neighbourhood_cleansed_Vermont-Slauson
neighbourhood_cleansed_Vermont_Knolls
neighbourhood_cleansed_Vermont_Square
neighbourhood_cleansed_Vermont_Vista
neighbourhood_cleansed_Vernon
neighbourhood_cleansed_Veterans_Administration
neighbourhood_cleansed_View_Park-Windsor_Hills
neighbourhood_cleansed_Vincent
neighbourhood_cleansed_Walnut
neighbourhood_cleansed_Watts
neighbourhood_cleansed_West_Adams
neighbourhood_cleansed_West_Carson
neighbourhood_cleansed_West_Covina
neighbourhood_cleansed_West_Hills
neighbourhood_cleansed_West_Hollywood
neighbourhood_cleansed_West_Los_Angeles
neighbourhood_cleansed_West_Puente_Valley
neighbourhood_cleansed_West_Whittier-Los_Nietos
neighbourhood_cleansed_Westchester
neighbourhood_cleansed_Westlake
neighbourhood_cleansed_Westlake_Village
neighbourhood_cleansed_Westmont
neighbourhood_cleansed_Westwood
neighbourhood_cleansed_Whittier
neighbourhood_cleansed_Willowbrook
neighbourhood_cleansed_Wilmington
neighbourhood_cleansed_Windsor_Square
neighbourhood_cleansed_Winnetka
neighbourhood_cleansed_Woodland_Hills
property_type_Aparthotel
property_type_Apartment
property_type_Barn
property_type_Bed and breakfast
property_type_Boat
property_type_Boutique hotel
property_type_Bungalow
property_type_Bus
property_type_Cabin
property_type_Camper/RV
property_type_Campsite
property_type_Casa particular (Cuba)
property_type_Castle
property_type_Chalet
property_type_Condominium
property_type_Cottage
property_type_Dome house
property_type_Dorm
property_type_Earth house
property_type_Farm stay
property_type_Guest suite
property_type_Guesthouse
property_type_Hostel
property_type_Hotel
property_type_House
property_type_Houseboat
property_type_Hut
property_type_Island
property_type_Loft
property_type_Other
property_type_Resort
property_type_Serviced apartment
property_type_Tent
property_type_Tiny house
property_type_Tipi
property_type_Townhouse
property_type_Train
property_type_Treehouse
property_type_Villa
property_type_Yurt
room_type_Entire home/apt
room_type_Hotel room
room_type_Private room
room_type_Shared room
average_review_score_0
average_review_score_1
The output for
price_relevant_enconded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27557 entries, 1 to 35953
Columns: 306 entries, neighbourhood_cleansed_Acton to average_review_score_1
dtypes: uint8(306)
memory usage: 8.3 MB
I continued with the code as follows:
#Create Training and Test Sets
training_features, test_features, \
training_target, test_target, = train_test_split(price_relevant_enconded.drop(['average_review_score'], axis=1),
price_relevant_enconded['average_review_score'],
test_size = .15,
random_state=12)
#Oversample minority class on training data.
x_train, x_val, y_train, y_val = train_test_split(training_features, training_target,
test_size = .1,
random_state=12)
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(x_train, y_train)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)
print('Validation Results')
print('Mean Accuracy:',clf_rf.score(x_val, y_val))
print('Recall:',recall_score(y_val, clf_rf.predict(x_val)))
print('\nTest Results')
print('Mean Accuracy:',clf_rf.score(test_features, test_target))
print('Recall:',recall_score(test_target, clf_rf.predict(test_features)))
Validation Results
Mean Accuracy: 0.9709773794280837
Recall: 0.0625
Test Results
Mean Accuracy: 0.9775036284470247
Recall: 0.03225806451612903
Does anyone have any ideas on how I get better optimize my model or make changes to make more accurate predictions from this data?
Related
I am trying to find the best c parameter following the instructions to a task that asks me to ' Define a function, fit_generative_model, that takes as input a training set (train_data, train_labels) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, where
pi[j]: the frequency of that label
mu[j]: the 784-dimensional mean vector
sigma[j]: the 784x784 covariance matrix
It is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. c is now a parameter, and by setting it appropriately, we can improve the performance of the model.
%matplotlib inline
import sys
import matplotlib.pyplot as plt
import gzip, os
import numpy as np
from scipy.stats import multivariate_normal
if sys.version_info[0] == 2:
from urllib import urlretrieve
else:
from urllib.request import urlretrieve
# Downloads the dataset
def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
print("Downloading %s" % filename)
urlretrieve(source + filename, filename)
# Invokes download() if necessary, then reads in images
def load_mnist_images(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=16)
data = data.reshape(-1,784)
return data
def load_mnist_labels(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=8)
return data
## Load the training set
train_data = load_mnist_images('train-images-idx3-ubyte.gz')
train_labels = load_mnist_labels('train-labels-idx1-ubyte.gz')
## Load the testing set
test_data = load_mnist_images('t10k-images-idx3-ubyte.gz')
test_labels = load_mnist_labels('t10k-labels-idx1-ubyte.gz')
train_data.shape, train_labels.shape
So I have written this code for three different C-values. they each give me the same error?
def fit_generative_model(x,y):
lst=[]
for c in [20,200, 4000]:
k = 10 # labels 0,1,...,k-1
d = (x.shape)[1] # number of features
mu = np.zeros((k,d))
sigma = np.zeros((k,d,d))
pi = np.zeros(k)
for label in range(0,k):
indices = (y == label)
mu[label] = np.mean(x[indices,:], axis=0)
sigma[label] = np.cov(x[indices,:], rowvar=0, bias=1) + c*np.identity(784) # I define the identity matrix
predictions = np.argmax(score, axis=1)
errors = np.sum(predictions != y)
lst.append(errors)
print(c,"Model makes " + str(errors) + " errors out of 10000", lst)
Then I fit it to the training data and get these same errors:
mu, sigma, pi = fit_generative_model(train_data, train_labels)
20 Model makes 1 errors out of 10000 [1]
200 Model makes 1 errors out of 10000 [1, 1]
4000 Model makes 1 errors out of 10000 [1, 1, 1]
and to the test data:
mu, sigma, pi = fit_generative_model(test_data, test_labels)
20 Model makes 9020 errors out of 10000 [9020]
200 Model makes 9020 errors out of 10000 [9020, 9020]
4000 Model makes 9020 errors out of 10000 [9020, 9020, 9020]
What is it I'm doing wrong? the correct answer is c=4000 which yields an error of ~4.3%.
I am implementing an SVM project with this data
here is how I extract the features:
import itertools
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import classification_report, confusion_matrix
df = pd.read_csv('loan_train.csv')
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df['dayofweek'] = df['effective_date'].dt.dayofweek
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
X = Feature
y = df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[0,1],inplace=False)
creating model and prediction:
clf = svm.SVC(kernel='rbf')
clf.fit(X_train_svm, y_train_svm)
yhat_svm = clf.predict(X_test_svm)
evaluation phase:
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
cnf_matrix = confusion_matrix(y_test_svm, yhat_svm, labels=[2,4])
np.set_printoptions(precision=2)
print (classification_report(y_test_svm, yhat_svm))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')
here is the error:
Traceback (most recent call last):
File "E:/python/classification_project/classification.py", line 229,in
cnf_matrix = confusion_matrix(y_test_svm, yhat_svm, labels=[2,4])
File "C:\Program Files(x86)\Python38-32\lib\site-packages\sklearn\metrics_classification.py", line 277, in confusion_matrix
raise ValueError("At least one label specified must be in y_true")
ValueError: At least one label specified must be in y_true
I checked this question which was like mine and I changed the y from categorical to numerical but the error is still there!
values in y are 0 and 1 but in confusion_matrix call:
cnf_matrix = confusion_matrix(y_test_svm, yhat_svm, labels=[2,4])
the labels were 2 and 4.
labels in confusion_matrix should be equal to tokens in y vector, ie:
cnf_matrix = confusion_matrix(y_test_svm, yhat_svm, labels=[0,1])
On computing matrix step, Instead using labels=[2,4], I defined labels with the signs labels=['PAIDOFF','COLLECTION']
so here's the computing code :
cnf_matrix = confusion_matrix(y_test, yhat, labels=['PAIDOFF','COLLECTION'])
np.set_printoptions(precision=2)
print (classification_report(y_test, yhat))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['PAIDOFF','COLLECTION'],normalize= False, title='Confusion matrix')
Update:
So i have been looking into the issue, the problem is with scikit-multiflow datastream. in last quarter of code stream_clf.partial_fit(X,y, classes=stream.target_values) here the class valuefor stream.target_values should a number or string, but the method is returning (dtype). When i print or loop stream.target_values i get this:
I have tried to do conversion etc. but still of no use. can someone please help here ?
Initial Problem
I am running a code (took inspiration from here). It works perfectly alright when used vanilla python environment.
But if i run this code after certain modification in Apache Spark using Pyspark , i get the following error
TypeError: int() argument must be a string, a bytes-like object or a number, not 'type'
I have tried every possibile way to trace the issue but everything looks alright. The error arises from the last line of the code where hoefding tree is called for prediction. It expects an ndarray and the type of X variable is also ndarray. I am not sure what is trigerring the issue. Can some one please help or direct me to right trace?
complete stack of error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-1310132c88db> in <module>
30 D3_win.addInstance(X,y)
31 xx = np.array(X,dtype='float64')
---> 32 y_hat = stream_clf.predict(xx)
33
34
~/conceptDrift/projectTest/lib/python3.5/site-packages/skmultiflow/trees/hoeffding_tree.py in predict(self, X)
1068 r, _ = get_dimensions(X)
1069 predictions = []
-> 1070 y_proba = self.predict_proba(X)
1071 for i in range(r):
1072 index = np.argmax(y_proba[i])
~/conceptDrift/projectTest/lib/python3.5/site-packages/skmultiflow/trees/hoeffding_tree.py in predict_proba(self, X)
1099 votes = normalize_values_in_dict(votes, inplace=False)
1100 if self.classes is not None:
-> 1101 y_proba = np.zeros(int(max(self.classes)) + 1)
1102 else:
1103 y_proba = np.zeros(int(max(votes.keys())) + 1)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'type'
Code
import findspark
findspark.init()
import pyspark as ps
import warnings
from pyspark.sql import functions as fn
import sys
from pyspark import SparkContext,SparkConf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score as AUC
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from skmultiflow.trees.hoeffding_tree import HoeffdingTree
from skmultiflow.data.data_stream import DataStream
import time
def drift_detector(S,T,threshold = 0.75):
T = pd.DataFrame(T)
#print(T)
S = pd.DataFrame(S)
# Give slack variable in_target which is 1 for old and 0 for new
T['in_target'] = 0 # in target set
S['in_target'] = 1 # in source set
# Combine source and target with new slack variable
ST = pd.concat( [T, S], ignore_index=True, axis=0)
labels = ST['in_target'].values
ST = ST.drop('in_target', axis=1).values
# You can use any classifier for this step. We advise it to be a simple one as we want to see whether source
# and target differ not to classify them.
clf = LogisticRegression(solver='liblinear')
predictions = np.zeros(labels.shape)
# Divide ST into two equal chunks
# Train LR on a chunk and classify the other chunk
# Calculate AUC for original labels (in_target) and predicted ones
skf = StratifiedKFold(n_splits=2, shuffle=True)
for train_idx, test_idx in skf.split(ST, labels):
X_train, X_test = ST[train_idx], ST[test_idx]
y_train, y_test = labels[train_idx], labels[test_idx]
clf.fit(X_train, y_train)
probs = clf.predict_proba(X_test)[:, 1]
predictions[test_idx] = probs
auc_score = AUC(labels, predictions)
print(auc_score)
# Signal drift if AUC is larger than the threshold
if auc_score > threshold:
return True
else:
return False
class D3():
def __init__(self, w, rho, dim, auc):
self.size = int(w*(1+rho))
self.win_data = np.zeros((self.size,dim))
self.win_label = np.zeros(self.size)
self.w = w
self.rho = rho
self.dim = dim
self.auc = auc
self.drift_count = 0
self.window_index = 0
def addInstance(self,X,y):
if(self.isEmpty()):
self.win_data[self.window_index] = X
self.win_label[self.window_index] = y
self.window_index = self.window_index + 1
else:
print("Error: Buffer is full!")
def isEmpty(self):
return self.window_index < self.size
def driftCheck(self):
if drift_detector(self.win_data[:self.w], self.win_data[self.w:self.size], auc): #returns true if drift is detected
self.window_index = int(self.w * self.rho)
self.win_data = np.roll(self.win_data, -1*self.w, axis=0)
self.win_label = np.roll(self.win_label, -1*self.w, axis=0)
self.drift_count = self.drift_count + 1
return True
else:
self.window_index = self.w
self.win_data = np.roll(self.win_data, -1*(int(self.w*self.rho)), axis=0)
self.win_label =np.roll(self.win_label, -1*(int(self.w*self.rho)), axis=0)
return False
def getCurrentData(self):
return self.win_data[:self.window_index]
def getCurrentLabels(self):
return self.win_label[:self.window_index]
def select_data(x):
x = "/user/hadoop1/tellus/sea_1.csv"
peopleDF = spark.read.csv(x, header= True)
df = peopleDF.toPandas()
scaler = MinMaxScaler()
df.iloc[:,0:df.shape[1]-1] = scaler.fit_transform(df.iloc[:,0:df.shape[1]-1])
return df
def check_true(y,y_hat):
if(y==y_hat):
return 1
else:
return 0
df = select_data("/user/hadoop1/tellus/sea_1.csv")
stream = DataStream(df)
stream.prepare_for_use()
stream_clf = HoeffdingTree()
w = int(2000)
rho = float(0.4)
auc = float(0.60)
# In[ ]:
D3_win = D3(w,rho,stream.n_features,auc)
stream_acc = []
stream_record = []
stream_true= 0
i=0
start = time.time()
X,y = stream.next_sample(int(w*rho))
stream_clf.partial_fit(X,y, classes=stream.target_values)
while(stream.has_more_samples()):
X,y = stream.next_sample()
if D3_win.isEmpty():
D3_win.addInstance(X,y)
y_hat = stream_clf.predict(X)
Problem was with select_data() function, data type of variables was being changed during the execution. This issue is fixed now.
I wrote this linear regression code and now it is giving me an error:
at def iterate_weights function.error = index 200 is out of bounds for
axis 0 with size 200
I don't know what is wrong. Also when I am uploading my weights they are coming the same as above which I chose at random. I am using Jupyter notebook.
Are there any mistakes?
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
#importing dataset
data = pd.read_csv('F:\WOC\linearreg.csv')
print(data.shape)
data.head()
data_arr = np.genfromtxt("F:\WOC\linearreg.csv", delimiter=",", skip_header=1)
print(data_arr)
# In[3]:
#collecting x and y
x_train = data_arr[:,1:4]
y_train = data_arr[:,4:5]
print(x_train)
print(y_train)
# In[4]:
weights_shape = y_train.shape
print(weights_shape)
r,c = x_train.shape
print(r,c)
w = np.random.randn(c,1)
w_num = len(w)
print(w)
# In[5]:
h = np.dot(x_train,w)
def cost_function():
print(h)
j = (1/2*r)*((h-y_train)**2)
print('j',j)
cost_function()
# In[6]:
def iterate_weights():
L=0.01
iterations = 1000
for iterations_proceed in range(1,1001):
for i in range(w_num):
for m in range(1,201):
w[i,0] = w[i,0]-L*((1/r)*(sum(h-y_train)*(x_train[m,i])))
print(w)
iterate_weights()
# In[7]:
h = np.dot(x_train,w)
def cost_function1():
j = np.sum((1/2*r)*((h-y_train)**2))
print(j)
I am implementing the KMeans algorithm for clustering and i get this problem and its not working in jupyter platform. I am applying elbow method to find the optimal number of clusters.
#Now find the optimal number of clusters using elbow method
from sklearn.cluster import KMeans
wcss = []
for i in range[1,11]:
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-ebfededa579e> in <module>()
2 from sklearn.cluster import KMeans
3 wcss = []
----> 4 for i in range[1,11]:
5 kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
6 kmeans.fit(X)
TypeError: 'type' object is not subscriptable
The error says (or tries to say) that range is a method. Therefore you need to call it like this: range(1, 11) instead of range[1, 11].
If you change this in the 4th line it should work (at least this part).