I have a multilabel dataset that I would like to use a wide-n-deep neural network to classify the samples.
This is a very small example just to test:
import numpy as np
import pandas as pd
import tensorflow as tf
tf.enable_eager_execution()
training_df: pd.DataFrame = pd.DataFrame(
data={
'feature1': np.random.rand(10),
'feature2': np.random.rand(10),
'feature3': np.random.rand(10),
'feature4': np.random.randint(0, 3, 10),
'feature5': np.random.randint(0, 3, 10),
'feature6': np.random.randint(0, 3, 10),
'target1': np.random.randint(0, 2, 10),
'target2': np.random.randint(0, 2, 10),
'target3': np.random.randint(0, 2, 10)
}
)
features = ['feature1', 'feature2', 'feature3','feature4', 'feature5', 'feature6']
targets = ['target1', 'target2', 'target3']
Categorical_Cols = ['feature4', 'feature5', 'feature6']
Numerical_Cols = ['feature1', 'feature2', 'feature3']
wide_columns = [tf.feature_column.categorical_column_with_vocabulary_list(key=x, vocabulary_list=[0, 1, -1])
for x in Categorical_Cols]
deep_columns = [tf.feature_column.numeric_column(x) for x in Numerical_Cols]
def wrap_dataset(df, features, labels):
dataset = (
tf.data.Dataset.from_tensor_slices(
(
tf.cast(df[features].values, tf.float32),
tf.cast(df[labels].values, tf.int32),
)
)
)
return(dataset)
input_fn_train = wrap_dataset(training_df, features, targets)
m = tf.contrib.estimator.DNNLinearCombinedEstimator(
head=tf.contrib.estimator.multi_label_head(n_classes=2),
# wide settings
linear_feature_columns=wide_columns,
# linear_optimizer=tf.train.FtrlOptimizer(...),
# deep settings
dnn_feature_columns=deep_columns,
# dnn_optimizer=tf.train.ProximalAdagradOptimizer(...),
dnn_hidden_units=[10, 30, 10])
m.train(input_fn=input_fn_train)
In this example, we have 6 features including:
3 numerical features: feature1, feature2, and feature3
3 categorical features: feature4, feature5, and feature6
where each sample has three labels and each label has a binary value: 0 or 1.
The error is about the input function and I cannot figure out how to define the input function in a correct way.
Any help to correct the code is appreciated.
UPDATE: The error is:
TypeError: <TensorSliceDataset shapes: ((6,), (3,)), types: (tf.float32, tf.int32)> is not a callable object
Since it says it is not a callable object, you can simply add lambda and it should work
input_fn_train = lambda: wrap_dataset(training_df, features, targets)
Also I think you need to sort out how you pass your data to the Estimator. It might take dictionaries since you are using feature columns. Right now you are passing tensors and not dictionary of Tensors. Check out this useful post.
Finally, I figured out how to make the code working. I post it here to help people who would like to do multi-label classification using built-in function DNNLinearCombinedEstimator from tensorflow package, version 1.13.
import numpy as np
import pandas as pd
import tensorflow as tf
# from tensorflow import contrib
tf.enable_eager_execution()
training_df: pd.DataFrame = pd.DataFrame(
data={
'feature1': np.random.rand(10),
'feature2': np.random.rand(10),
'feature3': np.random.rand(10),
'feature4': np.random.randint(0, 3, 10),
'feature5': np.random.randint(0, 3, 10),
'feature6': np.random.randint(0, 3, 10),
'target1': np.random.randint(0, 2, 10),
'target2': np.random.randint(0, 2, 10),
'target3': np.random.randint(0, 2, 10)
}
)
features = ['feature1', 'feature2', 'feature3','feature4', 'feature5', 'feature6']
targets = ['target1', 'target2', 'target3']
Categorical_Cols = ['feature4', 'feature5', 'feature6']
Numerical_Cols = ['feature1', 'feature2', 'feature3']
wide_columns = [tf.feature_column.categorical_column_with_vocabulary_list(key=x, vocabulary_list=[0, 1, -1])
for x in Categorical_Cols]
deep_columns = [tf.feature_column.numeric_column(x) for x in Numerical_Cols]
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in Numerical_Cols}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
dense_shape=[df[k].size, 1])
for k in Categorical_Cols}
# Merges the two dictionaries into one.
feature_cols = continuous_cols.copy()
feature_cols.update(categorical_cols)
labels =tf.convert_to_tensor(training_df.as_matrix(training_df[targets].columns.tolist()), dtype=tf.int32)
return feature_cols, labels
def train_input_fn():
return input_fn(training_df)
def eval_input_fn():
return input_fn(training_df)
m = tf.contrib.learn.DNNLinearCombinedEstimator(
head=tf.contrib.learn.multi_label_head(n_classes=3),
# wide settings
linear_feature_columns=wide_columns,
# linear_optimizer=tf.train.FtrlOptimizer(...),
# deep settings
dnn_feature_columns=deep_columns,
# dnn_optimizer=tf.train.ProximalAdagradOptimizer(...),
dnn_hidden_units=[10, 10])
m.train(input_fn=train_input_fn, steps=20)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
print("#########################################################")
for key in sorted(results):
print("%s: %s" % (key, results[key]))
Related
I am trying to solve this clustering problem that involves the K-means algorithm.
Question:
Considering the data inside the file - link below - execute the K-means algorithm where the initial centroids are positioned at:
[1,1,1,1],[-1,-1,-1,-1] and [1,-1,1,-1]. What is the position of each centroid after 10 iterations?
My solution that I am not sure about:
Basic Code:
kmeans = KMeans(n_clusters = 3 , max_iter= 10, init = np.array([[1, 1, 1, 1],[-1, -1, -1, -1],[1, -1, 1, -1]], np.float64) , random_state = 42)
...
kmeans.cluster_centers_
Answer:
array([[ 1.02575735, -0.00207592, -0.02395886, 0.63623732],
[ 0.10361404, 0.00370027, 0.00669603, -0.03432606],
[ 0.99690983, 0.48052607, 0.94034839, -0.00726928]])
Data: https://drive.google.com/file/d/1DXlFR3Jc5cFiblMxD6Bl7f4p7u_qsX2S/view?usp=sharing
Google Collaborator Full Code: https://colab.research.google.com/drive/1somvP3p7KES0NtBwnLYT6vpqSr3WQfgU?usp=sharing
I used my own code to check your answer and it was right.
import pandas as pd
import numpy as np
df = pd.read_csv('agrupamento_Q1.csv')
data = df.to_numpy()
centeroids = np.array([[1.0,1.0,1.0,1.0],[-1.0,-1.0,-1.0,-1.0],[1.0,-1.0,1.0,-1.0]])
iterations = 10
for itr in range(iterations):
assign = np.zeros([data.shape[0],],dtype=int)
for i in range(data.shape[0]):
for c in range(1,3):
if np.linalg.norm(data[i]-centeroids[c]) < np.linalg.norm(data[i]-centeroids[assign[i]]):
assign[i]=c
new_cent = np.zeros_like(centeroids)
cent_pop = np.zeros([centeroids.shape[0],])
for i in range(data.shape[0]):
new_cent[assign[i]]+=data[i]
cent_pop[assign[i]]+=1
for i in range(centeroids.shape[0]):
centeroids[i] = new_cent[i]/cent_pop[i]
print(centeroids)
# [[ 1.02575735 -0.00207592 -0.02395886 0.63623732]
# [ 0.10361404 0.00370027 0.00669603 -0.03432606]
# [ 0.99690983 0.48052607 0.94034839 -0.00726928]]
Im tring to use list as a value in pandas.DataFrame
but Im getting Exception when trying to use use the adapt function in on the Normalization layer with the NumPy array
this is the error:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
and this is the code:
import pandas as pd
import numpy as np
# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)
import tensorflow as tf
from tensorflow.keras import layers
data = [[45.975, 45.81, 45.715, 45.52, 45.62, 45.65, 4],
[55.67, 55.975, 55.97, 56.27, 56.23, 56.275, 5],
[86.87, 86.925, 86.85, 85.78, 86.165, 86.165, 3],
[64.3, 64.27, 64.285, 64.29, 64.325, 64.245, 6],
[35.655, 35.735, 35.66, 35.69, 35.665, 35.63, 5]
]
lables = [0, 1, 0, 1, 1]
def do():
d_1 = None
for l, d in zip(lables, data):
if d_1 is None:
d_1 = pd.DataFrame({'lable': l, 'close_price': [d]})
else:
d_1 = d_1.append({'lable': l, 'close_price': d}, ignore_index=True)
dataset = d_1.copy()
print(dataset.isna().sum())
dataset = dataset.dropna()
print(dataset.keys())
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)
print(train_dataset.describe().transpose())
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('lable')
test_labels = test_features.pop('lable')
print(train_dataset.describe().transpose()[['mean', 'std']])
normalizer = tf.keras.layers.Normalization(axis=-1)
ar = np.array(train_features)
normalizer.adapt(ar)
print(normalizer.mean.numpy())
first = np.array(train_features[:1])
with np.printoptions(precision=2, suppress=True):
print('First example:', first)
print()
print('Normalized:', normalizer(first).numpy())
diraction = np.array(train_features)
diraction_normalizer = layers.Normalization(input_shape=[1, ], axis=None)
diraction_normalizer.adapt(diraction)
diraction_model = tf.keras.Sequential([
diraction_normalizer,
layers.Dense(units=1)
])
print(diraction_model.summary())
print(diraction_model.predict(diraction[:10]))
diraction_model.compile(
optimizer=tf.optimizers.Adam(learning_rate=0.1),
loss='mean_absolute_error')
print(train_features['close_price'])
history = diraction_model.fit(
train_features['close_price'],
train_labels,
epochs=100,
# Suppress logging.
verbose=0,
# Calculate validation results on 20% of the training data.
validation_split=0.2)
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
print(hist.tail())
test_results = {}
test_results['diraction_model'] = diraction_model.evaluate(
test_features,
test_labels, verbose=0)
x = tf.linspace(0.0, 250, 251)
y = diraction_model.predict(x)
print("end")
def main():
do()
if __name__ == "__main__":
main()
I think it is not the usual practice to shrink your features into one column.
Quick-fix is you may put the following line
train_features = np.array(train_features['close_price'].to_list())
before
normalizer = tf.keras.layers.Normalization(axis=-1)
to get rid of the error, but now because your train_features has changed from a DataFrame into a np.array, your subsequent code may suffer, so you need to take care of that too.
If I were you, however, I would have constructed the DataFrame this way
df = pd.DataFrame(data)
df['label'] = lables
Please consider.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train,Y_train)
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, Y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() +
1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1,
step = 0.01))
plt.sactter(X1, X2, knn.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],c = ListedColormap(('red',
'green'))(i), label = j)
plt.title('Classifier (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
error:
File "C:\Users\shaar\.spyder-py3\MLPractice\KNN.py", line 55, in <module>
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1])
IndexError: too many indices for array: array is 2-dimensional, but 3 were indexed
This usually occurs when you try to input or try to use the other dimension of numpy array when it's only 1D. To be more clear if you have a numpy array like
a = [1,2,3,4]
And later if you try to use it's values using like (1,2) it'll take it if you were trying to find the 1st row and 2nd column of an 2D numpy array.
So avoid using comma when accessing numpy arrays. Hope I'm clear, if not consider checking https://www.w3schools.com/python/numpy/numpy_creating_arrays.asp
It's a little confusing what numpy-arrays do if they have two dimensions but one of them have only one block.
In your code-snippet we cannot see what you fills into y_set in
X_set, y_set = X_test, Y_test
but I think if you look at the dimensions of y_set with y_set.shape you will get
(150,1)
(I assume that there are 150 data-sets). Python will generate one index for each shape. To separate the wanted dimension you can set the unwanted dimension to zero:
y_set_one_dimension = y_set[:,0]
print(y_set_one_dimension.shape)
just like how it is described in How to access the ith column of a NumPy multidimensional array?
The output will be:
(150,)
Now the scatter-plot will get the wanted 2 indizies for 2 dimensions and will work.
Annotation:
If y_set is a dataframe you have to convert it first to a numpy-array with:
yArray = numpy.array(y_set)
X_set, y_set = X_test, Y_test.ravel()
I'm trying to use dask.array.map_blocks to process a dask array, using a second dask array with different shape as an argument. The use case is firstly running some peak finding on a 2-D stack of images (4-dimensions), which is returned as a 2-D dask array of np.objects. Ergo, the two first dimensions of the two dask arrays are the same. The peaks are then used to extract intensities from the 4-dimensional dataset. In the code below, I've omitted the peak finding part. Dask version 1.0.0.
import numpy as np
import dask.array as da
def test_processing(data_chunk, position_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = position_chunk[islice]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
positions = np.empty(data.shape[:-2], dtype='object')
for index in np.ndindex(positions.shape):
positions[index] = np.arange(10).reshape(5, 2)
data_output = da.map_blocks(test_processing, data, positions, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_output.compute()
This gives the error ValueError: Can't drop an axis with more than 1 block. Please useatopinstead., which I'm guessing is due to positions having 3 dimensions, while data has 4 dimensions.
The same function, but without the positions dask array works fine.
import numpy as np
import dask.array as da
def test_processing(data_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = [[5, 2], [1, 3]]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
data_output = da.map_blocks(test_processing, data, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_computed = data_output.compute()
This has been fixed in more recent versions of dask: running the same code on version 2.3.0 of dask works fine.
I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.