Simple prediction from frozen .pb saved model - python-3.x

I try for days to use tf exported .pb file model for prediction. The model was generated with bestExporter function as follows :
features_specs = tf.feature_column.make_parse_example_spec(serving_features)
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec=features_specs,default_batch_size=None)
exporter[n] = tf.estimator.BestExporter(name="best_exporter", serving_input_receiver_fn=serving_input_receiver_fn,event_file_pattern='eval/*.tfevents.*',exports_to_keep=1)
if train_params["use_early_stop"] == True:
hookModel[n] = tf.estimator.experimental.stop_if_no_decrease_hook(model[n], metric_name='average_loss', max_steps_without_decrease=train_params["early_stop_max_steps_without_decrease"], min_steps=train_params["early_stop_min_steps"],run_every_secs=train_params["early_stop_run_every_secs"], run_every_steps=train_params["early_stop_run_every_steps"],)
else:hookModel[n] = None
train_spec[n] = tf.estimator.TrainSpec(input_fn=input_fn_["train"+m],hooks=[hookModel[n]])
eval_spec[n] = tf.estimator.EvalSpec(input_fn=input_fn_["test"+m],start_delay_secs = train_params["eval_specs_start_delay_secs"],throttle_secs = train_params["eval_specs_throttle_secs"],exporters=[exporter[n]])
tf.estimator.train_and_evaluate(model[n], train_spec[n], eval_spec[n])
I think in this way input dict names are referenced...
I successfully load the model with :
model_[model_stage+"_"+model_type] = tf.saved_model.load(model_path)
but i don't know how correctly pass my features dictionnary in the model_XX['prediction'](example) wrapped function.
I saw this topic but didn't help : TensorFlow v2: Replacement for tf.contrib.predictor.from_saved_model
There's no equivalent of old tf.contrib.predictor.from_saved_model i used before...
Thanks for answer.

I found the solution to pass a dict in wrapped model. This is a slightly modified synthesis of these given solutions with modifications for TF2-4/Python 3.7 :
TensorFlow v2: Replacement for tf.contrib.predictor.from_saved_model
https://www.programcreek.com/python/example/90440/tensorflow.Example
Second is particulary complete and shows a lot of cases.
So :
my_dict = {"feature_1" : str(something), "feature_2" : int(an_int), , "feature_3" : float(a_float), ...}
# Load the model
my_model = tf.saved_model.load(model_path)
# Creates a serialized example from dict
def create_serialized_example(name_to_values):
example = tf.train.Example()
for name, values in name_to_values.items():
feature = example.features.feature[name]
if isinstance(values, str):
values = values.encode() # Modified because in new tf versions strings have to be encoded
add = feature.bytes_list.value.extend
elif isinstance(values, float):
add = feature.float_list.value.extend # Modified : float_list instead of float_32 in TF 2
elif isinstance(values, int):
add = feature.int64_list.value.extend
else:
raise AssertionError('Unsupported type: %s' % type(values[0]))
add([values]) # Modified : have to be a list, not variable
return example.SerializeToString()
# Predict function
pred = my_model.signatures["predict"] (examples=tf.constant([create_serialized_example(mydict)]))

Related

Token indices sequence length is longer than the specified maximum sequence length for this model (28627 > 512)

I am using BERT's Huggingface DistilBERT model as a backend for a question and answer application. The text I am using with which to train the model is one very large single text field. Even though the text field is a single string, the punctuation was left in place as a clue for BERT. When I execute the application I am getting the "Token indices sequence length error". I am using the transformer.encodeplus() method to pass the text into the model. I have tried various mechanisms to truncate the input ids to a length <= to 512.
I am currently using Windows 10 but I will also be porting the code to a Raspberry Pi 4 platform.
The code is failing at this line:
start_scores, end_scores = model(torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask]))
I am attempting to perform the truncation at this line:
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True).input_ids)
The entire code is here:
from transformers import AutoTokenizer, DistilBertTokenizer, DistilBertForQuestionAnswering
import torch
# globals - set once used everywhere
tokenizer = None
model = None
context = ''
def establishSettings():
global tokenizer, model, context
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', return_token_type_ids=True, model_max_length=512)
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad', return_dict=False)
# context = "Some 1,500 volcanoes are still considered potentially active around the world today 161 of those over 10 percent sit within the boundaries of the United States."
# get the volcano corpus
with open('volcanic.corpus', encoding="utf8") as file:
context = file.read().replace('\n', '')
print(len(tokenizer(context, truncation=True).input_ids))
def askQuestion(question):
global tokenizer, model, context
print("\nQuestion ", question)
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True).input_ids)
input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
start_scores, end_scores = model(torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask]))
ans_tokens = input_ids[torch.argmax(start_scores): torch.argmax(end_scores) + 1]
answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens, skip_special_tokens=True)
#all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
return answer_tokens
def main():
# set the global itmes once
establishSettings()
# ask a question
question = "How many potentially active volcanoes are there in the world today?"
answer_tokens = askQuestion(question)
print("answer_tokens: ", answer_tokens)
if len(answer_tokens) == 0:
answer = "Sorry, I don't have an answer for that one. Ask me another question about New Mexico volcanoes."
print(answer)
else:
answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)
print("\nFinal Answer : ")
print(answer_tokens_to_string)
if __name__ == '__main__':
main()
What is the best way to truncate the input.ids to <= 512 in length.
Edit this line:
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True).input_ids)
to
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True, max_length=512).input_ids)

Training roberta model on imdb movie reviews dataset giving this error?

def convert_data_to_examples(train, test, review, sentiment):
train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
text_a = x[review],
label = x[sentiment]), axis = 1)
validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
text_a = x[review],
label = x[sentiment]), axis = 1,)
return train_InputExamples, validation_InputExamples
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, 'review', 'sentiment')
NameError: name 'InputExample' is not defined
After running this part of code its gives error. Please tell me how to solve this error.
from transformers import InputExample

AllenNLP DatasetReader.read returns generator instead of AllennlpDataset

While studying AllenNLP framework (version 2.0.1), I tried to implement the example code from https://guide.allennlp.org/training-and-prediction#1.
While reading the data from a Parquet file I got:
TypeError: unsupported operand type(s) for +: 'generator' and 'generator'
for the next line:
vocab = build_vocab(train_data + dev_data)
I suspect the return value should be AllennlpDataset but maybe I got it mixed up.
What did I do wrong?
Full code:
train_path = <some_path>
test_path = <some_other_path>
class ClassificationJobReader(DatasetReader):
def __init__(self,
lazy: bool = False,
tokenizer: Tokenizer = None,
token_indexers: Dict[str, TokenIndexer] = None,
max_tokens: int = None):
super().__init__(lazy)
self.tokenizer = tokenizer or WhitespaceTokenizer()
self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
self.max_tokens = max_tokens
def _read(self, file_path: str) -> Iterable[Instance]:
df = pd.read_parquet(data_path)
for idx in df.index:
text = row['title'][idx] + ' ' + row['description'][idx]
print(f'text : {text}')
label = row['class_id'][idx]
print(f'label : {label}')
tokens = self.tokenizer.tokenize(text)
if self.max_tokens:
tokens = tokens[:self.max_tokens]
text_field = TextField(tokens, self.token_indexers)
label_field = LabelField(label)
fields = {'text': text_field, 'label': label_field}
yield Instance(fields)
def build_dataset_reader() -> DatasetReader:
return ClassificationJobReader()
def read_data(reader: DatasetReader) -> Tuple[Iterable[Instance], Iterable[Instance]]:
print("Reading data")
training_data = reader.read(train_path)
validation_data = reader.read(test_path)
return training_data, validation_data
def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
print("Building the vocabulary")
return Vocabulary.from_instances(instances)
dataset_reader = build_dataset_reader()
train_data, dev_data = read_data(dataset_reader)
vocab = build_vocab(train_data + dev_data)
Thanks for your help
Please find below the code fix first and the explanation afterwards.
Code Fix
# the extend_from_instances expands your vocabulary with the instances passed as an arg
# and is therefore equivalent to Vocabulary.from_instances(train_data + dev_data)
# previously
vocabulary.extend_from_instances(train_data)
vocabulary.extend_from_instances(dev_data)
Explanation
This is because the AllenNLP API have had couple of breaking changes in allennlp==2.0.1. You can find the changelog here and the upgrade guide here.The guide is outdated as per my understanding (it reflects allennlp<=1.4).
The DatasetReader returns a generator now as opposed to a List previously. DatasetReader used to have a parameter called "lazy" which was for lazy loading data. It was False by default and therefore dataset_reader.read would return a List previously. However, as of v2.0 (if i remember exactly), lazy loading is applied by default and it therefore returns a generator by default. As you know, the "+" operator has not been overridden for generator objects and therefore you cannot simply add two generators.
So, you can simply use vocab.extend_from_instances to achieve same behavior as before. Hope this helped you. If you need a full code snippet, please leave a comment below, I could post a rekated gist and share it with you.
Good day!

Why can't I split files when generating some TFrecord files?

Why can't I split files when generating some TFrecords files?
I'm doing some job predicting protein stuctures. As you may know, one protein molecule might have different strands. So I need to split the list of the atoms into different TFrecords by the strand name.
The problem is, this code ended up by generating several TFrecords with nothing written. All blank.
Or, is there a method to split the strands while training my module? Then I could ignore this problem and put the strand name in the TFrecords as a feature.
'''
with all module imported and no errors raised
'''
def generate_TFrecord(intPosition, endPosition, path):
CrtS = x #x is the name of the current strand
path = path + CrtS
writer = tf.io.TFRecordWriter('%s.tfrecord' %path)
for i in range(intPosition, endPosition):
if identifyCoreCarbon(i):
vectros = getVectors(i)
features = {}
'''
feeding this dict
'''
tf_features = tf.train.Features(feature = features)
tf_example = tf.train.Example(features = tf_features)
tf_serialized = tf_example.SerializeToString()
writer.write(tf_serialized)
'''
if checkStrand(i) == False:
writer.write(tf_serialized)
intPosition = i
'''
writer.close()
'''
strand_index is a list of all the startpoint of a single strand
'''
for loop in strand_index:
generate_TFrecord(loop, endPosition, path)
'''
________division___________
This code below works, but only generate a single tfrecord containing all the atom imformations.
writer = tf.io.TFRecordWriter('%s.tfrecord' %path)
for i in range(0, endPosition):
if identifyCoreCarbon(i):
vectros = getVectors(i)
features = {}
'''
feeing features
'''
tf_features = tf.train.Features(feature = features)
tf_example = tf.train.Example(features = tf_features)
tf_serialized = tf_example.SerializeToString()
writer.write(tf_serialized)
writer.close()
'''

AttributeError: 'DType' object has no attribute 'type' Tensorflow Serving

I am trying to use a function (from another module) inside tensorflow. The function accepts a numpy array and returns the changepoints. My main goal is to deploy this model on tensorflow serving. I am running into error
AttributeError: 'DType' object has no attribute 'type'
There are 2 functions, one is create_data() that creates a numpy array and returns it, another is change() which accepts numpy array and uses the before mentioned function to return changepoints. I have created a placeholder to accept input data, an operation to execute the function. Problem is, if i try to send data through placeholder, i run into error. If i send the data directly into the function, it runs. Following is my code.
def create_data():
np.random.seed(0)
size = 100
mean_a = 0.0
mean_b = 10.0
mean_c = 0
var = 0.1
data_a = np.random.normal(mean_a, var, size)
data_b = np.random.normal(mean_b, var, size)
data_c = np.random.normal(mean_c, var, size)
data = np.concatenate([data_a, data_b, data_c])
return data
def change(data):
# what else i tried
# data = np.array(data, dtype=np.float)
# above line gives another error mentioned after code
cpts = (pelt(normal_mean(x, np.var(x)), len(x)))
return cpts
sess = tf.Session()
x = tf.placeholder(tf.float32, shape=[300, ], name="myInput")
y = tf.convert_to_tensor(change(x),np.float32,name="myOutput")
z = sess.run(y,feed_dict={x:create_data()})
If i try the code data = np.array(data, dtype=np.float) in the function change(), it gives me error
ValueError: setting an array element with a sequence.
I also tried data = np.hstack((data)).astype(np.float) and data = np.vstack((data)).astype(np.float) but it runs into a separate error that says use tf.map_fn. I also tried to use tf.eval() to convert the numbers but i couldn't get them to run inside a function with placeholders.
But if i send in the output directly,
y = tf.convert_to_tensor(change(create_data()),np.float32,name="myOutput")
It works.
How should i send in the input to make it work?
EDIT: The function in question is this if anyone wants to know.
This error is raised when you try to pass a Tensor into a numpy function
You need to use tf.py_func to include python function into tensorflow graph
(also, your change() functin uses data as argument instead of x)
Here is the code that worked for me
import numpy as np
import tensorflow as tf
from changepy import pelt
from changepy.costs import normal_mean
def create_data():
np.random.seed(0)
size = 100
mean_a = 0.0
mean_b = 10.0
mean_c = 0
var = 0.1
data_a = np.random.normal(mean_a, var, size)
data_b = np.random.normal(mean_b, var, size)
data_c = np.random.normal(mean_c, var, size)
data = np.concatenate([data_a, data_b, data_c])
return data
def change(x):
# what else i tried
# data = np.array(data, dtype=np.float)
# above line gives another error mentioned after code
cpts = (pelt(normal_mean(x, np.var(x)), len(x)))
return cpts
sess = tf.Session()
x = tf.placeholder(tf.float32, shape=[300, ], name="myInput")
y = tf.convert_to_tensor(tf.compat.v1.py_func(change, [x], 3*[tf.int64]),np.float32,name="myOutput")
z = sess.run(y,feed_dict={x:create_data()})
print(z)

Resources