Incompatible row dimensions when using passthrough in GridSearch over sklearn Pipeline with FeatureUnion - scikit-learn

I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; however, it fails when the custom class is ignored in the pipeline by setting passthrough in the grid search parameters.
The full pipeline is defined as follows:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, FeatureUnion
ngram_vectorizer = Pipeline([
("vectorizer", CountVectorizer(analyzer="char_wb", ngram_range=(1,3))),
("tfidf", TfidfTransformer())
])
pipe_full = Pipeline(
[
("features", FeatureUnion(
[
("ngrams", ngram_vectorizer),
("lengths", TextLengthExtractor())
]
)
),
("classifier", MultinomialNB())
]
)
The custom transformer class TextLengthExtractor simply computes the number of characters from an input string:
from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthExtractor(BaseEstimator, TransformerMixin):
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
string_lengths = np.array([len(doc) for doc in X])
return string_lengths.reshape(-1,1)
The tuning parameters for grid search are defined through a dictionary params. Importantly, the parameters for the custom TextLengthExtractor contain the passthrough option to ignore the entire features__lengths step from the pipeline (see also the sklearn's documentation on pipelines):
params = {
"features__lengths": [TextLengthExtractor(), "passthrough"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}
When the pipeline is fit on the following dummy data
X_train_dummy = ["a", "ab", "a bc", "aaaaa", "b ab cc b", "ba", "baba", "cc bb aa", "c", "bca"]
y_train_dummy = [1,0,1, 1, 0, 1, 0, 1, 0, 0]
pipe_full.fit(X_train_dummy, y_train_dummy)
it can be seen that the lengths step of the FeatureUnion pipeline works as expected:
pipe_full["features"].get_params()["lengths"].transform(X_train_dummy)
# gives the following output of shape (10,1)
# array([[1], [2], [4], [5], [9], [2], [4], [8], [1], [3]])
However - and now comes the problem - when grid search is performed as follows:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe_full, params, cv=5, n_jobs=-1, verbose=10)
grid_search.fit(X_train_dummy, y_train_dummy)
all fits that ignore the lengths step (as defined by the passthrough option from params["features__lengths"] throw the following error:
5 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 378, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\dev\NameClassification\venv\lib\site-packages\joblib\memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1162, in fit_transform
return self._hstack(Xs)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1216, in _hstack
Xs = sparse.hstack(Xs).tocsr()
File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 532, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 665, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 8.
I do understand that both steps require identical row dimensions for both ngrams and lengths in the FeatureUnion, where the number of rows in the extracted feature matrices must equal the number of samples in the respective split. However, I have no idea how to control the shape of matrices when ignoring the lengths part of FeatureUnion using the passthrough option in the gird search params.
I have found any solution to the problem on SE or any other sklearn related resource. Does anyone have an idea on how to solve the issue?

I think I found the solution to the problem: To ignore an individual step in a FeatureUnion, the string drop rather than passthrough must be used. According to sklearn's documentation of FeatureUnion:
Parameters of the transformers may be set using its name and the parameter name separated by a '__'. A transformer may be replaced entirely by setting the parameter with its name to another transformer, removed by setting to 'drop' or disabled by setting to 'passthrough' (features are passed without transformation).
An example of dropping an entire transformer in FeatureUnion is also shown in sklearn's user guide on pipelines.
In conclusion, to solve my problem, I had to replace passthrough with drop in the grid search parameter dictionary as follows
Change from
params = {
"features__lengths": [TextLengthExtractor(), "passthrough"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}
to
params = {
"features__lengths": [TextLengthExtractor(), "drop"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}

Related

Is it possible to use a custom generator to train multi input architecture with keras tensorflow 2.0.0?

With TF 2.0.0, I can train an architecture with one input, I can train an architecture with one input using a custom generator, and I can train an architecture with two inputs. But I can't train an architecture with two inputs using a custom generator.
To keep it minimalist, here's a simple example, with no generator and no multiple inputs to start with:
from tensorflow.keras import layers, models, Model, Input, losses
from numpy import random, array, zeros
input1 = Input(shape=2)
dense1 = layers.Dense(5)(input1)
fullModel = Model(inputs=input1, outputs=dense1)
fullModel.summary()
# Generate random examples:
nbSamples = 21
X_train = random.rand(nbSamples, 2)
Y_train = random.rand(nbSamples, 5)
batchSize = 4
fullModel.compile(loss=losses.LogCosh())
fullModel.fit(X_train, Y_train, epochs=10, batch_size=batchSize)
It's a simple dense layer which takes in input vectors of size 2. The randomly generated dataset contains 21 examples and the batch size is 4. Instead of loading all the data and giving them to model.fit(), we can also give a custom generator in input. The main advantage (for RAM consumption) of this is to load only batch by batch rather that the whole dataset. Here is a simple example with the previous architecture and a custom generator:
import json
# Save the last dataset in a file:
with open("./dataset1input.txt", 'w') as file:
for i in range(nbSamples):
example = {"x": X_train[i].tolist(), "y": Y_train[i].tolist()}
file.write(json.dumps(example) + "\n")
def generator1input(datasetPath, batch_size, inputSize, outputSize):
X_batch = zeros((batch_size, inputSize))
Y_batch = zeros((batch_size, outputSize))
i=0
while True:
with open(datasetPath, 'r') as file:
for line in file:
example = json.loads(line)
X_batch[i] = array(example["x"])
Y_batch[i] = array(example["y"])
i+=1
if i % batch_size == 0:
yield (X_batch, Y_batch)
i=0
fullModel.compile(loss=losses.LogCosh())
my_generator = generator1input("./dataset1input.txt", batchSize, 2, 5)
fullModel.fit(my_generator, epochs=10, steps_per_epoch=int(nbSamples/batchSize))
Here, the generator opens the dataset file, but loads only batch_size examples (not nbSamples examples) each time it is called and slides into the file while looping.
Now, I can build a simple functional architecture with 2 inputs, and no generator:
input1 = Input(shape=2)
dense1 = layers.Dense(5)(input1)
subModel1 = Model(inputs=input1, outputs=dense1)
input2 = Input(shape=3)
dense2 = layers.Dense(5)(input2)
subModel2 = Model(inputs=input2, outputs=dense2)
averageLayer = layers.average([subModel1.output, subModel2.output])
fullModel = Model(inputs=[input1, input2], outputs=averageLayer)
fullModel.summary()
# Generate random examples:
nbSamples = 21
X1 = random.rand(nbSamples, 2)
X2 = random.rand(nbSamples, 3)
Y = random.rand(nbSamples, 5)
fullModel.compile(loss=losses.LogCosh())
fullModel.fit([X1, X2], Y, epochs=10, batch_size=batchSize)
Until here, all models compile and run, but I'm not able to use a generator with the last architecture and its 2 inputs... By trying the following code (which should logically work in my opinion):
# Save data in a file:
with open("./dataset.txt", 'w') as file:
for i in range(nbSamples):
example = {"x1": X1[i].tolist(), "x2": X2[i].tolist(), "y": Y[i].tolist()}
file.write(json.dumps(example) + "\n")
def generator(datasetPath, batch_size, inputSize1, inputSize2, outputSize):
X1_batch = zeros((batch_size, inputSize1))
X2_batch = zeros((batch_size, inputSize2))
Y_batch = zeros((batch_size, outputSize))
i=0
while True:
with open(datasetPath, 'r') as file:
for line in file:
example = json.loads(line)
X1_batch[i] = array(example["x1"])
X2_batch[i] = array(example["x2"])
Y_batch[i] = array(example["y"])
i+=1
if i % batch_size == 0:
yield ([X1_batch, X2_batch], Y_batch)
i=0
fullModel.compile(loss=losses.LogCosh())
my_generator = generator("./dataset.txt", batchSize, 2, 3, 5)
fullModel.fit(my_generator, epochs=10, steps_per_epoch=(nbSamples//batchSize))
I obtain the following error:
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 729, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 224, in fit
distribution_strategy=strategy)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 547, in _process_training_inputs
use_multiprocessing=use_multiprocessing)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 606, in _process_inputs
use_multiprocessing=use_multiprocessing)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\keras\engine\data_adapter.py", line 566, in __init__
reassemble, nested_dtypes, output_shapes=nested_shape)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 540, in from_generator
output_types, tensor_shape.as_shape, output_shapes)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\data\util\nest.py", line 471, in map_structure_up_to
results = [func(*tensors) for tensors in zip(*all_flattened_up_to)]
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\data\util\nest.py", line 471, in <listcomp>
results = [func(*tensors) for tensors in zip(*all_flattened_up_to)]
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py", line 1216, in as_shape
return TensorShape(shape)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py", line 776, in __init__
self._dims = [as_dimension(d) for d in dims_iter]
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py", line 776, in <listcomp>
self._dims = [as_dimension(d) for d in dims_iter]
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py", line 718, in as_dimension
return Dimension(value)
File "C:\Anaconda\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py", line 193, in __init__
self._value = int(value)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'
As explain in the doc, x argument of model.fit() can be A generator or keras.utils.Sequence returning (inputs, targets), and The iterator should return a tuple of length 1, 2, or 3, where the optional second and third elements will be used for y and sample_weight respectively. Thus, I think that it can not take in input more than one generator. Perhaps multiple inputs are not possible with custom generator. Please, would you have an explanation? A solution?
(otherwise, it seems possible to go through tf.data.Dataset.from_generator() with a less custom approach, but I have difficulties to understand what to indicate in the output_signature argument)
[EDIT] Thank you for your response #Francis Tang. In fact, it's possible to use a custom generator, but it allowed me to understand that I just had to change the line:
yield ([X1_batch, X2_batch], Y_batch)
To:
yield (X1_batch, X2_batch), Y_batch
Nevertheless, it is indeed perhaps better to use tf.keras.utils.Sequence. But I find it a bit restrictive.
In particular, I understand in the example given (as well as in most of the examples I could find about Sequence) that __init__() is first used to load the full dataset, which is against the interest of the generator.
But maybe it was a particular example about Sequence(), and there is no need to use __init__() like that: you can directly read a file and load the desired batch into the __getitem__().
In this case, it seems to push to browse each time the data file, or else it is necessary to create a file per batch beforehand (not really optimal).
from tensorflow.python.keras.utils.data_utils import Sequence
class generator(Sequence):
def __init__(self,filename,batch_size):
data = pickle.load(open(filename,'rb'))
self.X1 = data['X1']
self.X2 = data['X2']
self.y = data['y']
self.bs = batch_size
def __len__(self):
return (len(self.y) - 1) // self.bs + 1
def __getitem__(self,idx):
start, end = idx * self.bs, (idx+1) * self.bs
return (self.X1[start:end], self.X2[start:end]), self.y[start:end]
You need to write a class using Sequence: https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence

How to predict unseen data?

Hi I am practicing ML models and facing issue while trying to predict the unseen data.
The error is coming while doing the onehotencoding for categorical data.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_x_1 = LabelEncoder() #will encode country
X[:,1] = labelencoder_x_1.fit_transform(X[:,1])
labelencoder_x_2 = LabelEncoder() #will encode Gender
X[:,2] = labelencoder_x_2.fit_transform(X[:,2])
onehotencoder_x = OneHotEncoder(categorical_features=[1])
X= onehotencoder_x.fit_transform(X).toarray()
X = X[:,1:]
My X has 11 columns and column 2 and 3 are categorical type(Country and Gender).
Model running fine but while trying to test the model against a random input its failing at onehotencoding.
input = [[619], ['France'], ['Male'], [42], [2], [0.0], [1], [1], [1],[101348.88]]
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
input= onehotencoder_x.fit_transform(input).toarray()
Error:
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:451:
DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20
and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-44-44a43edf17aa>", line 1, in <module>
input= onehotencoder_x.fit_transform(input).toarray()
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in
fit_transform
self._handle_deprecations(X)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in
_handle_deprecations
n_features = X.shape[1]
AttributeError: 'list' object has no attribute 'shape'
I believe this is because you have nested lists.
You should flatten your input list and use that for the prediction.
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
intput = [item for sublist in input for item in sublist]
input= onehotencoder_x.fit_transform(input).toarray()
If you have a nested list, then each element in the list will be considered an item that needs to go through the fit_transform function, but since it's a single element, it does not match the shape that fit_transform looks for, which is [1, 10] (1 row, 10 columns).

Bugs when fitting Multi label text classification models

I am now trying to fit a classification model for a Multi label text classification problem.
I have a train set X_train that contains list of cleaned text, like
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
and a train multiple tags set y corresponding to each text in X_train, like
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
Now I want to fit a model that could predict the tags in a text set X_test that has same format as X_train.
I have used the MultiLabelBinarizer to convert the tags and used TfidfVectorizer to convert the cleaned text in train set.
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
But when I try to fit the model I always get bugs.I have tried OneVsRestClassifier and LogisticRegression.
When I fit a OneVsRestClassifier model I got bugs like
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
When I fit a LogisticRegression model I got bugs like
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
Anyone knows where the problem is and how to solve this? Many thanks.
OneVsRestClassifier fits one classifier per class. You need to tell it which type of classifier you want (for example Losgistic regression).
The following code works for me:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
output: array([[0, 1, 1, 0, 0, 1]])

Tensorflow Adagrad optimizer isn't working

When I run the following script, I notice the following couple of errors:
import tensorflow as tf
import numpy as np
import seaborn as sns
import random
#set random seed:
random.seed(42)
def potential(N):
points = np.random.rand(N,2)*10
values = np.array([np.exp((points[i][0]-5.0)**2 + (points[i][1]-5.0)**2) for i in range(N)])
return points, values
def init_weights(shape,var_name):
"""
Xavier initialisation of neural networks
"""
init = tf.contrib.layers.xavier_initializer()
return tf.get_variable(initializer=init,name = var_name,shape=shape)
def neural_net(X):
with tf.variable_scope("model",reuse=tf.AUTO_REUSE):
w_h = init_weights([2,10],"w_h")
w_h2 = init_weights([10,10],"w_h2")
w_o = init_weights([10,1],"w_o")
### bias terms:
bias_1 = init_weights([10],"bias_1")
bias_2 = init_weights([10],"bias_2")
bias_3 = init_weights([1],"bias_3")
h = tf.nn.relu(tf.add(tf.matmul(X, w_h),bias_1))
h2 = tf.nn.relu(tf.add(tf.matmul(h, w_h2),bias_2))
return tf.nn.relu(tf.add(tf.matmul(h2, w_o),bias_3))
X = tf.placeholder(tf.float32, [None, 2])
with tf.Session() as sess:
model = neural_net(X)
## define optimizer:
opt = tf.train.AdagradOptimizer(0.0001)
values =tf.placeholder(tf.float32, [None, 1])
squared_loss = tf.reduce_mean(tf.square(model-values))
## define model variables:
model_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,"model")
train_model = opt.minimize(squared_loss,var_list=model_vars)
sess.run(tf.global_variables_initializer())
for i in range(10):
points, val = potential(100)
train_feed = {X : points,values: val.reshape((100,1))}
sess.run(train_model,feed_dict = train_feed)
print(sess.run(model,feed_dict = {X:points}))
### plot the approximating model:
res = 0.1
xy = np.mgrid[0:10:res, 0:10:res].reshape(2,-1).T
values = sess.run(model, feed_dict={X: xy})
sns.heatmap(values.reshape((int(10/res),int(10/res))),xticklabels=False,yticklabels=False)
On the first run I get:
[nan] [nan] [nan] [nan] [nan] [nan] [nan]] Traceback (most
recent call last):
...
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/seaborn/matrix.py",
line 485, in heatmap
yticklabels, mask)
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/seaborn/matrix.py",
line 167, in init
cmap, center, robust)
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/seaborn/matrix.py",
line 206, in _determine_cmap_params
vmin = np.percentile(calc_data, 2) if robust else calc_data.min()
File
"/Users/aidanrockea/anaconda/lib/python3.6/site-packages/numpy/core/_methods.py",
line 29, in _amin
return umr_minimum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation minimum which has
no identity
On the second run I have:
ValueError: Variable model/w_h/Adagrad/ already exists, disallowed.
Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?
It's not clear to me why I get either of these errors. Furthermore, when I use:
for i in range(10):
points, val = potential(10)
train_feed = {X : points,values: val.reshape((10,1))}
sess.run(train_model,feed_dict = train_feed)
print(sess.run(model,feed_dict = {X:points}))
I find that on the first run, I sometimes get a network that has collapsed to the constant function with output 0. Right now my hunch is that this might simply be a numerics problem but I might be wrong.
If so, it's a serious problem as the model I have used here is very simple.
Right now my hunch is that this might simply be a numerics problem
indeed, when running potential(100) I sometimes get values as large as 1E21. The largest points will dominate your loss function and will drive the network parameters.
Even when normalizing your target values e.g. to unit variance, the problem of the largest values dominating the loss would still remain (look e.g. at plt.hist(np.log(potential(100)[1]), bins = 100)).
If you can, try learning the log of val instead of val itself. Note however that then you are changing the assumption of the loss function from 'predictions follow a normal distribution around the target values' to 'log predictions follow a normal distribution around log of the target values'.

Building my own tf.Estimator, how did model_params overwrite model_dir? RuntimeWarning?

Recently I built a customized deep neural net model using TFLearn, which claims to bring deep learning to the scikit-learn estimator API. I could train models and make predictions, but I couldn't get the scoring (evaluate) function to work, so I couldn't do cross-validation. I tried to ask questions about TFLearn in various places, but I got no responses.
It appears that TensorFlow itself has an estimator class. So I am putting TFLearn aside, and I'm trying to follow the guide at https://www.tensorflow.org/extend/estimators. Somehow I'm managing to get variables where they don't belong. Can anyone spot my problem? I will post code and the output.
Note: Of course, I can see the RuntimeWarning at the top of the output. I have found references to this warning online, but so far everyone claims it's harmless. Maybe it is not...
CODE:
import tensorflow as tf
from my_library import Database, l2_angle_distance
def my_model_function(topology, params):
# This function will eventually be a function factory. This should
# allow easy exploration of hyperparameters. For now, this just
# returns a single, fixed model_fn.
def model_fn(features, labels, mode):
# Input layer
net = tf.layers.conv1d(features["x"], topology[0], 3, activation=tf.nn.relu)
net = tf.layers.dropout(net, 0.25)
# The core of the network is here (convolutional layers only for now).
for nodes in topology[1:]:
net = tf.layers.conv1d(net, nodes, 3, activation=tf.nn.relu)
net = tf.layers.dropout(net, 0.25)
sh = tf.shape(features["x"])
net = tf.reshape(net, [sh[0], sh[1], 3, 2])
predictions = tf.nn.l2_normalize(net, dim=3)
# PREDICT EstimatorSpec
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode,
predictions={"vectors": predictions})
# TRAIN or EVAL EstimatorSpec
loss = l2_angle_distance(labels, predictions)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=params["learning_rate"])
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, predictions, loss, train_op)
return model_fn
##===================================================================
window = "whole"
encoding = "one_hot"
db = Database("/home/bwllc/Documents/Files for ML/compact")
traindb, testdb = db.train_test_split()
train_features, train_labels = traindb.values(window, encoding)
test_features, test_labels = testdb.values(window, encoding)
# Create the model.
tf.logging.set_verbosity(tf.logging.INFO)
LEARNING_RATE = 0.01
topology = (60,40,20)
model_params = {"learning_rate": LEARNING_RATE}
model_fn = my_model_function(topology, model_params)
model = tf.estimator.Estimator(model_fn, model_params)
print("\nmodel_dir? No? Why not? ", model.model_dir, "\n") # This documents the error
# Input function.
my_input_fn = tf.estimator.inputs.numpy_input_fn({"x" : train_features}, train_labels, shuffle=True)
# Train the model.
model.train(input_fn=my_input_fn, steps=20)
OUTPUT
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': {'learning_rate': 0.01}, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0b55279048>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
model_dir? No? Why not? {'learning_rate': 0.01}
INFO:tensorflow:Create CheckpointSaverHook.
Traceback (most recent call last):
File "minimal_estimator_bug_example.py", line 81, in <module>
model.train(input_fn=my_input_fn, steps=20)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 756, in _train_model
scaffold=estimator_spec.scaffold)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 411, in __init__
self._save_path = os.path.join(checkpoint_dir, checkpoint_basename)
File "/usr/lib/python3.6/posixpath.py", line 78, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not dict
------------------
(program exited with code: 1)
Press return to continue
I can see exactly what went wrong, model_dir (which I left as the default) somehow bound to the value I intended for model_params. How did this happen in my code? I can't see it.
If anyone has advice or suggestions, I would greatly appreciate them. Thanks!
Simply because you're feeding your model_param as a model_dir when you construct your Estimator.
From the tensorflow documentation :
Estimator __init__ function :
__init__(
model_fn,
model_dir=None,
config=None,
params=None
)
Notice how the second argument is the model_dir one. If you want to specify only the params one, you need to pass it as a keyword argument.
model = tf.estimator.Estimator(model_fn, params=model_params)
Or specify all the previous positional arguments :
model = tf.estimator.Estimator(model_fn, None, None, model_params)

Resources