How to add ArgMax to the end of ONNX model? - onnx

I have an ONNX model where I get the output as an 1xN array of probabilities. I want to add ArgMax to the end of the model so that I get the index instead.
I tried doing this using onnx.helper but was not able to find a good way to do it. I can create an ArgMax node using:
argmax = onnx.helper.make_node(
'ArgMax',
inputs=['inp'],
outputs=['out'],
axis=0,
keepdims=0)
but how do I append this node to the end of the graph?

I usually use the onnx.compose.merge operator. If you have an existing model original_model that has some named output, in this example output_original_model you can create a graph from your node and use compose to combine them.
original_model = ...
argmax = onnx.helper.make_node(
'ArgMax',
inputs=['inp'],
outputs=['out'],
axis=0,
keepdims=0)
graph = onnx.helper.make_graph(
nodes=[
argmax,
],
name="argmaz",
inputs=[
onnx.helper.make_tensor_value_info(
"inp", onnx.TensorProto.FLOAT, [None, None, input_size]
)
],
outputs=[
onnx.helper.make_tensor_value_info(
"out", onnx.TensorProto.FLOAT, [None, out_size]
)
],
)
combined_model = onnx.compose.merge_models(
onnx_original, graph, io_map=[("output_original_model", "inp")]
)

Related

after using an Embedding layer in Keras why do I get: Input to reshape is a tensor with 2 values, but the requested shape has 4 [Op:Reshape]

So I created below sample polars data frame. I want to use Keras's normalisation and Embedding layers to preprocess my data. sum_cost and sum_gmv are my numerical columns and I normalize each individual column by using normalization layer.category is my categorical column and I want to use embedding layer to get embedding vectors for each category.
import polars as pl
import tensorflow as tf
df = pl.DataFrame(
{'sum_cost':[1.,4.,7.,3.,2.],
'category':[311,210,450,311,567],
'sum_gmv':[-4.,-2.,0.,2.,4.],
}
)
numeric_col = ['sum_cost','sum_gmv']
categorical_col = ['category']
all_inputs = []
inputs = {}
for col in numeric_col + categorical_col:
if col in numeric_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.float32)
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(df[col].to_numpy())
all_inputs.append(normalizer(inputs[col])[:,tf.newaxis])
elif col in categorical_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.int32)
embedding = tf.keras.layers.Embedding(
567 + 1,
output_dim=2,
name='cat_embedding')(inputs[col])
em_model = tf.keras.layers.Reshape((2,))(embedding)
all_inputs.append(em_model)
outputs = tf.keras.layers.Concatenate(axis=1)(all_inputs)
model=tf.keras.Model([inputs[col] for col in numeric_col+categorical_col],outputs)
When I want to test my preprocessing model by using a single data point model(dict(df.to_pandas().iloc[1,:])) I receive the following error.
on the other hand when I pass this input:
model({'sum_cost':1.,'sum_gmv':1,'category':np.array([[1.]])})
it works well. I dont understand why i should provide an array for category but scalar for numerical columns. In the original dataset they are all scalar. Also I dont deifne a shape for my Input tensors. Why does this happening and how can I solve it?
Thanks!

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

I am trying to apply Logistic Regression Models with text.
I Vectorized my data by TFIDF:
vectorizer = TfidfVectorizer(max_features=1500)
x = vectorizer.fit_transform(df['text_column'])
vectorizer_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df.drop('text_column', axis=1, inplace=True)
result = pd.concat([df, vectorizer_df], axis=1)
I split my data:
x = result.drop('target', 1)
y = result['target']
and finally:
x_raw_train, x_raw_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
I build a classifier:
classifier = Pipeline([('clf', LogisticRegression(solver="liblinear"))])
classifier.fit(x_raw_train, y_train)
And I get this error:
ValueError: y should be a 1d array, got an array of shape (74216, 2) instead.
This is a strange thing because when I assign max_features=1000 it is working well, but when max_features=1500 I got an error.
Someone can help me please?
Basically, the text_column column in df contains at least one occurrence of the word target. This word becomes a column name when you convert the TF-IDF feature matrix to a dataframe with the parameter columns=vectorizer.get_feature_names(). Lastly, when you concatenate df with vectorized_df, you add both the target columns into the final dataframe.
Therefore, result['target'] will return two columns instead of one as there are effectively two target columns in the result dataframe. This will naturally lead to a ValueError, because, as specified in the error description, you need a 1d target array to fit your estimator, whereas your target array has two columns.
The reason why you are encountering this for a high max_features threshold is simply because the word target isn't making the cut with the lower threshold allowing the process to run as it should.
Unless you have a reason to vectorize separately, the best solution for this is to combine all your steps in a pipeline. It's as simple as:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1500)),
('clf', LogisticRegression(solver="liblinear")),
])
pipeline.fit(x_train.text_column, y_train.target)

Saving a model that uses tensorflow.lookup.StaticVocabularyTable in .pb format in Tensorflow 2

I am building a model that accepts as input a 2d array of string tokens, then uses a lookup table to get the assigned indices of the input tokens in the vocabulary. The model then uses those indices to compute an embedded representation of the input tokens by fetching associated token embeddings and adding them together. The compounded embedding is then compared agains another matrix using a nearest-neighbors lookup and then the indices of the top-k most similar entries are returned.
The model is saved in .pb format and is then used in a container running the TensorFlow Serving image for inference.
At the moment I have something that works just fine in TensorFlow 1.15, however I am trying to migrate my code to TensorFlow 2.4 and can't find a way to make it work.
Here is a slightly modified version of the code I am working with at the moment in TensorFlow 1.15
import tensorflow as tf
graph = tf.get_default_graph()
session = tf.Session()
tf.global_variables_initializer()
vocabulary = ['one', 'two', 'three', 'four', 'five', 'six']
embedding_dimension = 512
n_tokens = len(vocabulary)
token_embeddings = np.random.random((n_tokens, embedding_dimension))
matrix = np.random.random((100, embedding_dimension))
lookup_table_initializer = tf.lookup.KeyValueTensorInitializer(vocabulary, np.arange(n_tokens))
lookup_table = tf.lookup.StaticVocabularyTable(lookup_table_initializer, num_oov_buckets=1)
token_embeddings_with_oov_token = np.vstack([token_embeddings, np.zeros(embedding_dimension)])
token_embeddings_tensor = tf.convert_to_tensor(token_embeddings_with_oov_token, dtype=tf.float32)
matrix = tf.convert_to_tensor(matrix, dtype=tf.float32)
model_input = tf.placeholder(tf.string, [None, None], name="input")
input_tokens_indices = lookup_table.lookup(model_input)
input_token_indices_one_hot = tf.one_hot(input_tokens_indices, tf.dtypes.cast(value, dtype=np.int32)(lookup_table.size()))
encoded_text = tf.math.reduce_sum(input_token_indices_one_hot, axis=1, keepdims=True)
embedded_text = tf.linalg.matmul(encoded_text, token_embeddings_tensor)
embedded_text_pooled = tf.math.reduce_sum(embedded_text, axis=1)
embedded_text_normed = tf.divide(embedded_text_pooled, tf.norm(embedded_text_pooled, ord=2))
neighbors = tf.linalg.matmul(embedded_text_normed, product_embeddings_tensor, transpose_b=True, name="output")
tf.saved_model.simple_save(
session,
"model.pb",
inputs={"input": model_input},
outputs={"output": neighbors},
legacy_init_op=tf.tables_initializer(),
)
The issue that I am facing is when converting the above code to TensorFlow 2. First of all, the tf.placeholder is no more and I have read on other posts suggestions to replace that with tf.keras.layers.Input((), dtype=tf.dtypes.string), however then I get an error when I try to carry out the lookup_table.lookup() step, as apparently I cannot pass a symbolic tensor to that function. As a result I am stuck and do not know which way to proceed to make my model compatible with tf2 and after hours searching online for solutions I can't seem to find something that works.

Keras: Get True labels (y_test) from ImageDataGenerator or predict_generator

I am using ImageDataGenerator().flow_from_directory(...) to generate batches of data from directories.
After the model builds successfully I'd like to get a two column array of True and Predicted class labels. With model.predict_generator(validation_generator, steps=NUM_STEPS) I can get a numpy array of predicted classes. Is it possible to have the predict_generator output the corresponding True class labels?
To add: validation_generator.classes does print the True labels but in the order that they are retrieved from the directory, it doesn't take into account the batching or sample expansion by augmentation.
You can get the prediction labels by:
y_pred = numpy.rint(predictions)
and you can get the true labels by:
y_true = validation_generator.classes
You should set shuffle=False in the validation generator before this.
Finally, you can print confusion matrix by
print confusion_matrix(y_true, y_pred)
There's another, slightly "hackier" way, of retrieving the true labels as well. Note that this approach can handle when setting shuffle=True in your generator (it's generally speaking a good idea to shuffle your data - either if you do this manually where you've stored the data, or through the generator, which is probably easier). You will need your model for this approach though.
# Create lists for storing the predictions and labels
predictions = []
labels = []
# Get the total number of labels in generator
# (i.e. the length of the dataset where the generator generates batches from)
n = len(generator.labels)
# Loop over the generator
for data, label in generator:
# Make predictions on data using the model. Store the results.
predictions.extend(model.predict(data).flatten())
# Store corresponding labels
labels.extend(label)
# We have to break out from the generator when we've processed
# the entire once (otherwise we would end up with duplicates).
if (len(label) < generator.batch_size) and (len(predictions) == n):
break
Your predictions and corresponding labels should now be stored in predictions and labels, respectively.
Lastly, remember that we should NOT add data augmentation on the validation and test sets/generators.
Using np.rint() method will get one hot coding result like [1., 0., 0.] which once I've tried creating a confusion matrix with confusion_matrix(y_true, y_pred) it caused error. Because validation_generator.classes returns class labels as a single number.
In order to get a class number for example 0, 1, 2 as class label specified, I have found the selected answer in this topic useful. here
You should try this to resolve the class probabilities and convert it to a single class based on the score.
if Y_preds.ndim !=1:
Y_preds = np.argmax(Y_preds, axis=1)

python feature selection in pipeline: how determine feature names?

i used pipeline and grid_search to select the best parameters and then used these parameters to fit the best pipeline ('best_pipe'). However since the feature_selection (SelectKBest) is in the pipeline there has been no fit applied to SelectKBest.
I need to know the feature names of the 'k' selected features. Any ideas how to retrieve them? Thank you in advance
from sklearn import (cross_validation, feature_selection, pipeline,
preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)
scores = []
for k, (train, test) in enumerate(split):
X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]
top_feat = feature_selection.SelectKBest()
pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
('feat', top_feat),
('clf', linear_model.LogisticRegression())])
K = [40, 60, 80, 100]
C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
penalty = ['l1', 'l2']
param_grid = [{'feat__k': K,
'clf__C': C,
'clf__penalty': penalty}]
scoring = 'precision'
gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring)
gs.fit(X_train, y_train)
best_score = gs.best_score_
scores.append(best_score)
print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score)
print gs.best_params_
best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()),
('feat', feature_selection.SelectKBest(k=80)),
('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))])
best_pipe.fit(X_train, y_train)
best_pipe.predict(X_test)
You can access the feature selector by name in best_pipe:
features = best_pipe.named_steps['feat']
Then you can call transform() on an index array to get the names of the selected columns:
X.columns[features.transform(np.arange(len(X.columns)))]
The output here will be the eighty column names selected in the pipeline.
Jake's answer totally works. But depending on what feature selector your using, there's another option that I think is cleaner. This one worked for me:
X.columns[features.get_support()]
It gave me an identical answer to Jake's answer. And you can see more about it in the docs, but get_support returns an array of true/false values for whether or not the column was used. Also, it's worth noting that X must be of identical shape to the training data used on the feature selector.
This could be an instructive alternative: I encountered a similar need as what was asked by OP. If one wants to get the k best features' indices directly from GridSearchCV:
finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)
And via index manipulation, can get your finalFeatureList:
finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

Resources