I have two inputs X1, X2 and corresponding label Y. I want to split the data into training and validation using SkLearn's train_test_split. My X1 is of shape (1920,12) and X2 is of shape(1920,51,5). The code I use is :
from sklearn.model_selection import train_test_split
X1 = np.load('x_train.npy')
X2 = np.load('oneHot.npy')
y_train = np.load('y_train.npy')
X = np.array(list(zip(X1, X2))) ### To zip the two inputs.
X_train, X_valid, y_train, y_valid = train_test_split(X, y_train,test_size=0.2)
X1_train, oneHot_train = X_train[:, 0], X_train[:, 1]
However when I check the shape X1_train and oneHot_train it is (1536,) whereas X1_train should be (1536,12) and oneHot_train should be (1536,51,5). What am I doing wrong here? Insights will be appreciated.
train_test_split can take up any number of iterators for splitting. Hence, you can directly feed the x1 and x2 - like below:
x1 = np.random.rand(1920,12)
x2 = np.random.rand(1920,51,5)
y = np.random.choice([0,1], 1920)
x1_train, x1_test, x2_train, x2_test, y_train, y_test = train_test_split(\
x1, x2, y ,test_size=0.2)
x1_train.shape, x1_test.shape
# ((1536, 12), (384, 12))
x2_train.shape, x2_test.shape
# ((1536, 51, 5), (384, 51, 5))
y_train.shape, y_test.shape
# ((1536,), (384,))
Related
This question is about TensorFlow (and TensorBoard) version 2.2rc3, but I have experienced the same issue with 2.1.
Consider the following weird code:
from datetime import datetime
import tensorflow as tf
from tensorflow import keras
inputs = keras.layers.Input(shape=(784, ))
x1 = keras.layers.Dense(32, activation='relu', name='Model/Block1/relu')(inputs)
x1 = keras.layers.Dropout(0.2, name='Model/Block1/dropout')(x1)
x1 = keras.layers.Dense(10, activation='softmax', name='Model/Block1/softmax')(x1)
x2 = keras.layers.Dense(32, activation='relu', name='Model/Block2/relu')(inputs)
x2 = keras.layers.Dropout(0.2, name='Model/Block2/dropout')(x2)
x2 = keras.layers.Dense(10, activation='softmax', name='Model/Block2/softmax')(x2)
x3 = keras.layers.Dense(32, activation='relu', name='Model/Block3/relu')(inputs)
x3 = keras.layers.Dropout(0.2, name='Model/Block3/dropout')(x3)
x3 = keras.layers.Dense(10, activation='softmax', name='Model/Block3/softmax')(x3)
x4 = keras.layers.Dense(32, activation='relu', name='Model/Block4/relu')(inputs)
x4 = keras.layers.Dropout(0.2, name='Model/Block4/dropout')(x4)
x4 = keras.layers.Dense(10, activation='softmax', name='Model/Block4/softmax')(x4)
outputs = x1 + x2 + x3 + x4
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255
model.compile(loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.RMSprop(),
metrics=['accuracy'])
logdir = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
model.fit(x_train, y_train,
batch_size=64,
epochs=5,
validation_split=0.2,
callbacks=[tensorboard_callback])
When running it and looking at the graph created in TensorBoard
you will see the following.
As can be seen, the addition operations are really ugly.
When replacing the line
outputs = x1 + x2 + x3 + x4
With the lines:
outputs = keras.layers.add([x1, x2], name='Model/add/add1')
outputs = keras.layers.add([outputs, x3], name='Model/add/add2')
outputs = keras.layers.add([outputs, x4], name='Model/add/add3')
a much nicer graph is created by TensorBoard (in this second screenshot, the Model as well as one of the inner blocks are shown in details).
The difference between the two representations of the model is that in the second one, we could name the addition operations and group them.
I could not find any way to name these operations, unless by using the keras.layers.add(). In this model the problem does not look that critical as the model is simple, and it is easy to replace + with keras.layers.add(). However, in more complex models, it can become a real pain. For example, operations such as t[:, start:end] should be translated to complex calls to tf.strided_slice(). So my models representations are quite messy with plenty of cryptic gather, stride and concat operations.
I wonder if there is a way to wrap / group such operations to allow nicer graphs in TensorBoard.
outputs = keras.layers.Add()([x1, x2, x3, x4])
Following the hint from Marco Cerliani, Lambda layer is indeed very useful here. So the following code will group nicely the +:
outputs = keras.layers.Lambda(lambda x: x[0] + x[1], name='Model/add/add1')([x1, x2])
outputs = keras.layers.Lambda(lambda x: x[0] + x[1], name='Model/add/add2')([outputs, x2])
outputs = keras.layers.Lambda(lambda x: x[0] + x[1], name='Model/add/add3')([outputs, x2])
Or if needed to wrap strides, the following code will group nicely the t[]:
x1 = keras.layers.Lambda(lambda x: x[:, 0:5], name='Model/stride_concat/stride1')(x1) # instead of x1 = x1[:, 0:5]
x2 = keras.layers.Lambda(lambda x: x[:, 5:10], name='Model/stride_concat/stride2')(x2) # instead of x2 = x2[:, 5:10]
outputs = keras.layers.concatenate([x1, x2], name='Model/stride_concat/concat')
This answers the question asked. But actually, there is still an open issue that is described in another question: 'TensorFlowOpLayer messes up the TensorBoard graphs'
While running below code from Machine Learning A-Z Course, getting the warning.
# Logistic Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
X = X.astype(float)
y = y.astype(float)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
#lbfgs = Limited-memory BFGS It is a popular algorithm for parameter estimation in machine learning.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, solver='lbfgs')
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Full Error:
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'. Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
The issue is in below code:
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
However, not able to fix it. Any help?
You're just seeing an warning, which should not be a problem. The following code runs without any error in 3.2.1
Check your matplotlib version.
import matplotlib
print(matplotlib.__version__)
# Logistic Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
X = X.astype(float)
y = y.astype(float)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
#lbfgs = Limited-memory BFGS It is a popular algorithm for parameter estimation in machine learning.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, solver='lbfgs')
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Just don't use ListedColormap, from version 3 you're supposed to pass the color as a string for each scatter point.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
X = X.astype(float)
y = y.astype(float)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Logistic Regression to the Training set
#lbfgs = Limited-memory BFGS It is a popular algorithm for parameter estimation in machine learning.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0, solver='lbfgs')
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ['red', 'green'][i], label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
This shouldn't give any warnings.
my case is using lag_plot() function
Solved it lilke this
lag_plot(np.log(data['numeric_column']), c = ['blue'][0])
I have a university task to perform. It is regarding the classification of several buildings (with 6 parameters) based on the damage classification (1-5). I did the coding as per the guidance of the SVM, but not sure of the output accuracy. Can you please advise, how can I improve my result and what is the other choices of the algorithm.
'''
# Support Vector Machine (SVM)
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Ehsan Duzce.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 7].values
# Taking care of missing data
from sklearn.impute import SimpleImputer
# creating object for SimpleImputer class as "imputer"
imputer = SimpleImputer(missing_values = np.nan, strategy = "mean", verbose=0)
imputer = imputer.fit(X[:, 1:7]) #upper bound is not included, but lower bound
X[:, 1:7] = imputer.transform(X[:, 1:7])
# Avoiding the dummy Variable Trap
X = X[:, 1:] #To remove the first column from the dataset
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'poly', degree = 3)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() +
1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1,
step = 0.01))
Xpred = np.array([X1.ravel(), X2.ravel()] + [np.repeat(0, X1.ravel().size) for _ in
range(4)]).T
# Xpred now has a grid for x1 and x2 and average value (0) for x3 through x6
pred = classifier.predict(Xpred).reshape(X1.shape) # is a matrix of 0's and 1's !
plt.contourf(X1, X2, pred, alpha = 1.0, cmap = ListedColormap(('green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red'))(I))
plt.title('SVM (Training set)')
plt.xlabel('Damage Scale')
plt.ylabel('Building Database')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() +
1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1,
step = 0.01))
Xpred = np.array([X1.ravel(), X2.ravel()] + [np.repeat(0, X1.ravel().size) for _ in
range(4)]).T
# Xpred now has a grid for x1 and x2 and average value (0) for x3 through x6
pred = classifier.predict(Xpred).reshape(X1.shape) # is a matrix of 0's and 1's !
plt.contourf(X1, X2, pred, alpha = 1.0, cmap = ListedColormap(('green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red'))(I))
plt.title('SVM (Test set)')
plt.xlabel('Damage Scale')
plt.ylabel('Building Database')
plt.legend()
plt.show()
'''
)
First and foremost you should get acquainted with your training data. From what I've understood you simply feed the data to the model without any kind of pre processing on the data, you shouldn't do that.
I see you are inputing missing data with the mean, maybe try and remove the data points and see the results, remove outliers that may "confuse" your model.
Also your plots are not very friendly you tell us the data is classified 1-5, but in the plots [-2,2].
But since your questions is algorithmic specific try hyper-parameter tuning.
You can do it like this:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)
print(grid.best_estimator_)
I recommend reading this article, to understand SVM and tune your parameters]
https://towardsdatascience.com/svm-hyper-parameter-tuning-using-gridsearchcv-49c0bc55ce29
I am trying to visualise SVM classification results using Matplotlib and Scikit-learn, how to handle MemoryError ?!
For my example, I have a small dataset, a table X of 100 examples and 10 features (data table of integer). I did perform classification using SVM of Scikit learn, then I want to visualize the results. But since I have 10 features I can't visualize them directly, so I used PCA after classification to reduce the dimensionality of my data. It did work on IRIS dataset but for my data, it crashes giving me "MemoryError"
#SVM classification
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.20)
svclassifier = SVC(kernel='linear',gamma='auto',max_iter=1000, decision_function_shape='ovo')
models=svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
#Plot funtions
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max()+1
y_min, y_max = y.min() - 1, y.max()+1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min,
y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
#PCA for D.R
pca = PCA(n_components=2)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
X=X_pca
#Ploting results
fig, sub = plt.subplots()
plt.subplots_adjust(wspace=0.4, hspace=0.4)
X0, X1 = X[:, 0].flatten(), X[:, 1].flatten()
xx, yy = make_meshgrid(X0, X1)
plot_contours(sub, models, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
sub.scatter(X0, X1, c=Y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
sub.set_xlim(xx.min(), xx.max())
sub.set_ylim(yy.min(), yy.max())
sub.set_xlabel('Sepal length')
sub.set_ylabel('Sepal width')
sub.set_xticks(())
sub.set_yticks(())
sub.set_title("TITLE")
plt.show()
original shape: (100, 10)
transformed shape: (100, 2)
MySQL connection is closed
Traceback (most recent call last):
File "new_data.py", line 123, in <module>
xx, yy = make_meshgrid(X0, X1)
File "new_data.py", line 81, in make_meshgrid
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
File "/home/.local/lib/python3.5/site-packages/numpy/lib/function_base.py", line 4211, in meshgrid
output = [x.copy() for x in output]
File "/home/.local/lib/python3.5/site-packages/numpy/lib/function_base.py", line 4211, in <listcomp>
output = [x.copy() for x in output]
MemoryError
After fitting the kNN-classifier with the scaled features (age and salary), I would like to plot the resulting diagram with the unscaled feature values.
kNN-plot
I think that one way to do this is to change the xticks and yticks of the plot and leave everything like it is. Hopefully someone has got a better idea.
Moreover, it would be great if the diagram can show the correct (age / salary) values in the bottom left corner, when I go with the cursor over the diagram.
Unfortunately, I have no idea how to do that. Therefore, I am asking of help.
The dataset:
https://www.dropbox.com/sh/2mfr2kajrm7y2qq/AADFmZzYWLEjqYSLPjaQcLwka?dl=0
The code so far:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
X = dataset.iloc[:, [2, 3]].values.astype(float)
y = dataset.iloc[:,-1].values
# splitting into training and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
# no fit, because it is test
X_test = sc_X.transform(X_test)
# fitting kNN classification to the training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier.fit(X_train, y_train)
# Predict the Test set result
y_pred = classifier.predict(X_test)
# Visualising the Test set results
f = plt.figure()
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
f.show()
Alright, I noticed that the answer to my answer is pretty simple...
I thought too complicated.
However, this is the solution:
We just have to add the following lines:
wished_xticks = np.array([18, 22, 35])
temp_x = np.c_[ wished_xticks, [0]*len(wished_xticks) ]
transformed_x = sc_X.transform(temp_x)[:,0]
plt.xticks(transformed_x, wished_xticks)
wished_yticks = np.array([17000, 25000, 100000, 150000])
temp_y = np.c_[ [0]*len(wished_yticks), wished_yticks ]
transformed_y = sc_X.transform(temp_y)[:,1]
plt.yticks(transformed_y, wished_yticks)
So, we get our wished result:
Diagram