Always getting accuracy of 1 how to fix it? - python-3.x

I'm trying to apply logistic regression on my dataset but its giving accuracy of 1
df = pd.read_csv("train.csv", header=0)
df = df[["PassengerId", "Survived", "Sex", "Age", "Embarked"]]
df.dropna(inplace=True)
X = df[["Sex", "Age"]]
X_train = np.array(X)
Y = df["Survived"]
Y_train = np.array(Y)
clf = LogisticRegression()
clf.fit(X_train, Y_train)
df1 = pd.read_csv("test.csv", header=0)
df1 = df1[["PassengerId", "Survived", "Sex", "Age", "Embarked"]]
df1.dropna(inplace=True)
X = df1[["Sex", "Age"]]
X_test = np.array(X)
Y = df1["Survived"]
Y_test = np.array(Y)
X_test = X_test.astype(float)
Y_test = Y_test.astype(float)
#to convert string data to float
accuracy = clf.score(X_test, Y_test)
print("Accuracy = ", accuracy)
I expect the output between 0 and 1, but always getting 1.0

Related

ValueError: could not convert string to float for X_t

I want to use the X_train and Y_train from the Dataframe df.
But I encountered ValueError: could not convert string to float
Thank you in advance.
df = pd.DataFrame()
df['images'] = X_train.tolist()
df['label'] = Y_train.tolist()
df = df.sample(frac=1).reset_index(drop=True)
df.head()
X_train = df['images'].astype('str')
Y_train = df['label'].astype('str')
X_train = np.asarray(X_train).astype(np.float32)
Y_train = np.asarray(Y_train).astype(np.float32)
X_test = np.asarray(X_test).astype(np.float32)
Y_test = np.asarray(Y_test).astype(np.float32)
print (Y_train)

error when exporting predictions of 4 machine learning models

I am training and testing my date on a kfold equal to 10 with 4 different models. I would like for each models to export the prédictions and the corrected classes for each split.
this is my code and the result :
for train_index, test_index in kf.split(X, labels):
print('TRAIN:', train_index,
'TEST:', test_index)
X_train, X_val = X[train_index], X[test_index]
y_train, y_val = labels[train_index], labels[test_index]
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression()
model4 = RandomForestClassifier()
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
result1 = model1.predict(X_val)
result2 = model2.predict(X_val)
result3 = model3.predict(X_val)
result4 = model4.predict(X_val)
df = pd.DataFrame(data = {"id": X_val, "Prediction": y_val})
df.to_excel('result.xlsx')
so far I have this below but it only prints the first lines (1-198) but i do not understand the export , could you help me
I have approximately 2000 sentences.
When you set K in KFold == 10, the .split() method splits your dataset into 10 portions. For each iteration, test_index will be indices of the i-th portion while train_index will be the rest of the 9 portions.
In your original code, the df shows the test set (X_val, Y_val) (instead of the predictions) for each iteration.
I am not sure that you intend to do but if you would like to see the prediction for each model, the following code will do:
df = pd.DataFrame(data={
"id": [],
"ground_true": [],
"original_sentence": [],
"pred_model1": [],
"pred_model2": [],
"pred_model3": [],
"pred_model4": []})
for train_index, test_index in kf.split(X, labels):
print('TRAIN:', train_index,'TEST:', test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = labels[train_index], labels[test_index]
model1 = LinearSVC()
model2 = MultinomialNB()
model3 = LogisticRegression()
model4 = RandomForestClassifier()
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model4.fit(X_train, y_train)
result1 = model1.predict(X_val)
result2 = model2.predict(X_val)
result3 = model3.predict(X_val)
result4 = model4.predict(X_val)
temp_df = pd.DataFrame(data={
"id": X_val,
"ground_true": y_val,
"original_sentence": verbatim_train_remove_stop_words[test_index],
"pred_model1": result1,
"pred_model2": result2,
"pred_model3": result3,
"pred_model4": result4})
df = pd.concat([df, temp_df])

How to fix Found input variables with inconsistent numbers of samples: [1080, 428] error

I am working on Indian Spontaneous Expression dataset which has 428 images, each of shape (1080, 1920, 3). Classification classes are 4 and its shape is (428, 4). While splitting into training, validation and testing data using train_test_split:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)
I am getting mentioned error.
I tried reshaping the data but anyhow I couldn't succeed.
import cv2 as cv
data=pd.read_excel('/content/drive/My Drive/ISED/details1.xlsx')
count=0
path = data['img_path']
for path in data['img_path']:
count=count+1
temp1 = path.replace("'", "")
imgpath = "/content/drive/My Drive/ISED/" + temp1
imgFile = cv.imread(imgpath)
X = np.asarray(imgFile)
print(X.shape)
print(count)
y = pd.get_dummies(data['emotion']).as_matrix()
# # #storing them using numpy
np.save('fdataXISED', X)
np.save('flabelsISED', y)
# #
print("Preprocessing Done")
print("Number of Features: "+str(len(X[0])))
print("Number of Labels: "+ str(len(y[0])))
print("Number of examples in dataset:"+str(len(X)))
print("X,y stored in fdataXISED.npy and flabelsISED.npy respectively")
num_features = 1920
num_labels = 4
batch_size = 64
epochs = 100
width, height = 1080, 1920
x = np.load('./fdataXISED.npy')
y = np.load('./flabelsISED.npy')
print(x.dtype)
x = x.astype(float)
x -= np.mean(x, axis=0)
x /= np.std(x, axis=0)
print(x.shape," ", y.shape)
#splitting into training, validation and testing data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1,
random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,
test_size=0.1, random_state=
I expect proper data split for training.
Problem is here, X = np.asarray(imgFile) in for path in data['img_path']: So, X carries only last image. Please change like this,
X=[]
for path in data['img_path']:
count=count+1
temp1 = path.replace("'", "")
imgpath = "/content/drive/My Drive/ISED/" + temp1
imgFile = cv.imread(imgpath)
imgFile = np.asarray(imgFile)
X.append(imgFile)
X = np.asarray(X)
print(X.shape)
print(count)
And at the end your X will be in shape of (428,1080,1920,3) and y must be in (428,4)
Error occurs because different number of samples in X and y.

Tensorflow Neural Network: My model is giving an accuracy of 1.0 every time

Amateur problem but i cannot solve this issue on my own.
I was trying to make a neural network for churn modelling dataset on bank data
Every time i run this network i get an accuracy of 1.0 so i think there is something wrong and its not working.
Can anyone help me figure out what is wrong?
Also please explain how i can avoid problems like these in the future
The code is :
import pandas as pd
import numpy as np
data = pd.read_csv('D:\Churn_Modelling.csv')
X = data.iloc[:, 3:13].values
Y = data.iloc[:, 13].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x_1 = LabelEncoder()
X[:, 1] = label_encoder_x_1.fit_transform(X[:, 1])
label_encoder_x_2 = LabelEncoder()
X[:, 2] = label_encoder_x_2.fit_transform(X[:, 2])
one_hot_encoder = OneHotEncoder(categorical_features = [1])
X = one_hot_encoder.fit_transform(X).toarray()
X = X[:, 1:]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test =
train_test_split(X, Y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
import tensorflow as tf
epochs = 20
batch_size = 50
learning_rate = 0.003
n_output = 1
n_input = X_train.shape[1]
X_placeholder = tf.placeholder("float32", [None, n_input], name = "X")
Y_placeholder = tf.placeholder("float32", [None, 1], name = "y")
n_neurons_1 = 64
n_neurons_2 = 32
n_neurons_3 = 16
layer_1 = {'weights': tf.Variable
(tf.random_normal([n_input, n_neurons_1])),
'biases': tf.Variable(tf.random_normal([n_neurons_1]))
}
layer_2 = {'weights': tf.Variable
(tf.random_normal([n_neurons_1, n_neurons_2])),
'biases': tf.Variable(tf.random_normal([n_neurons_2]))
}
layer_3 = {'weights': tf.Variable
(tf.random_normal([n_neurons_2, n_neurons_3])),
'biases': tf.Variable(tf.random_normal([n_neurons_3]))
}
output_layer = {'weights': tf.Variable(
tf.random_normal([n_neurons_3, n_output])),
'biases': tf.Variable(tf.random_normal([n_output]))
}
l1 = tf.add(tf.matmul(X_placeholder,
layer_1['weights']), layer_1['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1, layer_2['weights']),
layer_2['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2, layer_3['weights']),
layer_3['biases'])
l3 = tf.nn.relu(l3)
output_layer = tf.matmul(l3,
output_layer['weights']) + output_layer['biases']
output_layer = tf.nn.sigmoid(output_layer)
cost = tf.reduce_mean(tf.reduce_sum(
tf.square(Y_placeholder - output_layer), reduction_indices = [1]))
optimizer = tf.train.AdamOptimizer().minimize(cost)
correct_prediction = tf.equal(tf.argmax(
Y_placeholder, 1), tf.argmax(output_layer, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
def next_batch(size, x, y):
idx = np.arange(0, len(x))
np.random.shuffle(idx)
idx = idx[:size]
x_shuffle = [x[ i] for i in idx]
y_shuffle = [y[ i] for i in idx]
return np.asarray(x_shuffle), np.asarray(y_shuffle)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
total_batches = int(len(X_train) / batch_size)
for epoch in range(epochs):
avg_cost = 0
print('epoch: ', epoch)
for batch in range(total_batches):
x_batch_data, y_batch_data =
next_batch(batch_size, X_train, Y_train)
y_batch_data = y_batch_data.reshape((50, 1))
_, c = sess.run([optimizer, cost],
feed_dict = {X_placeholder: x_batch_data,
Y_placeholder: y_batch_data})
avg_cost += c / total_batches
print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost))
Y_test_temp = Y_test.reshape((2000, 1))
print('accuracy: ', sess.run(accuracy,
feed_dict = {X_placeholder: X_test, Y_placeholder: Y_test_temp}))

For loop and Linear regression

Good evening,
I would like to reiterate both a subsetting and a linear regression, over the same data frame.
#I get the unique codes of the articles
codes = np.unique(data["cod_id"])
#Split
X = data['price']
y = data["quantity"]
accuracy = []
for i in np.nditer(codes):
data = data.loc[df["cod_id"] == i]
#Arrange an if statement to avoid 0-element arrays, while splitting (80% train, 20% test)
if int(len(data)) <= 2:
X_train = X
y_train = y
# Test dataset
X_test = X
y_test = y
else:
t = 0.8
t = int(t*len(data))
#Split
t = int(t*len(data))
# Train dataset
X_train = X[:t]
y_train = y[:t]
# Test dataset
X_test = X[t:]
y_test = y[t:]
#Run the Algorithm
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
predicted_test_tr = lr.predict(X_test)
pred_cost = (X_test["price"] * predicted_test_tr).sum()
real_cost = (X_test["price"] * y_test).sum()
delta = (pred_cost - owner_cost)/owner_cost
accuracy.append(delta)
But it reports a list "accuracy", as long as the "codes" one, but with the same value at each position
print(accuracy)
5.43234
5.43234
5.43234
...
How can I fix this issue?
Thank you

Resources