I have 3 working Keras models and each of those I saved it with 2 file structure.json (saving Model Structure) and weight.h5.
I've built a Flask App to load these model. But currently I can't add threading=True to option of run() like this:
api.run(threaded=True)
So I am only able to use:
api.run()
In this flask app I have two different api (1 GET and 1 POST). Because it runs on one thread so it works too slowly. My system has over 100 connections by seconds, and each connection I have to load another Model.
Note that all of my keras model have the same structure. I only have to load one structure, and whenever a new connection is coming, I load weight to that structure.
My code in api likes this:
# Compile model
json_file = open(formal_model_path, 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
model.compile(loss='mean_absolute_error', optimizer='adam')
#api.route("/predict", methods=["POST"])
try:
model.load_weights(os.path.join(cointainer_folder, 'weight.h5'))
except Exception:
return jsonify(
error_code='model_file_reading_failed'
)
My code ran errors when enabling threading=True at line model.load_weights[...].
Are there any solutions to build multithreading API with loading many different Keras model?
I think you are running into two separate problems:
You are loading model weights per request. That's a bad idea and will make every request very slow.
flask uses multiple threads. Tensorflow models loaded in one thread, must be used in that same thread.
The right place to load models would be an init method. You also need to use tf.get_default_graph() to make sure that you are loading models and predicting in the same thread.
Here's what your code might look like
def init():
global models
models = {}
for idx, model_path in enumerate(model_paths):
with open(model_path, "r") as fp:
model = model_from_json(json.load(fp))
model.compile(loss='mean_absolute_error', optimizer='adam')
model.load_weights(os.path.join(model_path, "weights.h5"))
models[idx] = model
# save default graph in a global var
global graph
graph = tf.get_default_graph()
And inside your request handler
#api.route("/predict", methods=["POST"])
def predict():
# select your model based on something inside the request
# making up func !
model = models[func(request)]
with graph.as_default():
model.predict(..)
Related
I'm using the pytorch-forecasting library (which is based on pytorch-lightning) for running a TFT model on time series forecasting. My training routine is segregated into three different tasks. At first I perform HPO using optuna, then I do a training+validation, and in the end, a retraining with full data (no validation).
Currently, both training+validation and retraining are happening using fresh models from scratch, so the runtime is quite high. So, I'm trying to reduce the run-time of the whole training routine by trying to leverage incremental-training where I'll load the checkpointed trained model from phase 2 and retrain it for smaller epochs on phase 3.
I have a method fit_model() which is used in both training/validation and retraining, but with different args. The core part of my fit() looks something like the following:
def fit_model(self, **kwargs):
...
to_retrain = kwargs.get('to_retrain', False)
ckpt_path = kwargs.get('ckpt_path', None)
trainer = self._get_trainer(cluster_id, gpu_id, to_retrain) # returns a pl.Trainer object
tft_lightning_module = self._prepare_for_training(cluster_id, to_retrain)
train_dtloaders = ...
val_dtloaders = ...
if not to_retrain:
trainer.fit(
tft_lightning_module,
train_dataloaders=train_dtloaders,
val_dataloaders=val_dtloaders
)
else:
trainer.fit(
tft_lightning_module,
train_dataloaders=train_dtloaders,
val_dataloaders=val_dtloaders,
ckpt_path=ckpt_path
)
best_model_path = trainer.checkpoint_callback.best_model_path
return best_model_path
While I call the above method in my retraining phase, I can see the log where it says that it's loading the checkpointed model:
Restored all states from the checkpoint file at /tft/incremental_training/tft_training_20230206/171049/lightning_logs_3/lightning_logs/version_0/checkpoints/epoch=4-step=5.ckpt
But unfortunately, no further training is happening at phase 3. If I look at the best_model_path returned by the method, it has the old checkpoint path from train/validation phase and not from retraining phase. How to resolve this issue?
I'm using the following libraries
pytorch-lightning==1.6.5
pytorch-forecasting==0.9.0
I got it working finally. The key thing to keep in mind here is not to use same number of epochs in both training and retraining. If we're training for x epochs and intend to run the retraining for y more epochs, then max-epochs has to be set to x+y and not y in retraining.
I currently load data with torch.load() because it is saved as pickle. Pickle can only load everything at once into the memory. The dimension of the data is [2000, 3, 32, 32].
Can I write a dataloader, where data is loaded subsequently? I have limited CPU memory and all at once would be too much.
I give an example:
data = torch.load('clean_data.pkl')
test_loader = dataloader(data, batch_size=32, shuffle=True)
result = []
for img, label in test_loader:
# do somehting
result.append([img.gpu()])
torch.save(result)
Well, when I write a data loader, I also need to use torch.load. By my understanding, the data loader would also open the pickle file all at once, right? So, I don't have no memory advantage.
What to do, to just load one file / batch after another, instead of the the whole pickle at once?
I have found a similar thread, here: https://discuss.pytorch.org/t/loading-pickle-files-with-pytorch-dataloader/129405
https://localcoder.org/how-to-load-pickle-file-in-chunks
How does one create a data set in pytorch and save it into a file to later be used?
I am grateful for any help. Thanks
I am experimenting with the Tensorflow model optimization library and am trying to reduce the size of the SavedModel that is running in a production cluster with the goal of reducing operating costs while keeping as much performance as possible.
A few things I've read suggested I should try out pruning weights in the model.
I've tried it and so far have gotten very mixed results.
Here is the code for the model I am trying to prune.
n = 300000 # input vector dimension, it's not exactly 300k but it's close
code_dimension = 512
inputs = Input(shape=(n,))
outputs = Dense(code_dimension, activation='relu')(inputs)
outputs = Dense(code_dimension, activation='relu')(outputs)
outputs = Dense(n, activation='softmax')(outputs)
model = Model(input, outputs)
model.compile("adam", "cosine_similarity")
model.fit(training_data_generator, epochs=10, validation_data=validation_data_generator)
model.save("base_model.pb")
# model pruning starts here
pruning_schedule = tfmot.sparsity.keras.ConstantSparsity(
target_sparsity=0.95, begin_step=0, end_step=-1, frequency=100
)
callbacks = [tfmot.sparsity.keras.UpdatePruningStep()]
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(base_model, pruning_schedule=pruning_schedule)
model_for_pruning.compile(optimizer="adam", loss="cosine_similarity")
model_for_pruning.fit(training_data_generator, validation_data=validation_data_generator, epochs=2, callbacks=callbacks)
print(f"Mean model sparsity post-pruning: {mean_model_sparsity(model_for_pruning): .4f}")
# strip pruning not to carry around those extra parameters
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
model_for_export.save("pruned_model.pb")
Here's the problem, when I set code_dimension to 32 or 64 after model pruning and saving the pruned_model.pb file is about 2 - 3 times smaller than the base_model.pb file. However when I use a code dimension of 256 or 512 my pruned model is actually bigger than the base model.
I have a script that runs this and each time I run it I do a full reset of my environment.
Has anyone who used the TensorFlow model optimization library ever experienced this?
My dataset depends on a 3GB tensor. This tensor could either be on the CPU or the GPU. The bottleneck of my code is the data loading preprocessing. But I can't add more than a few workers without killing my RAM.
This sounds silly for me: why could each worker receives a copy of the 3GB tensor, when this one is exactly the same across each worker?
Is there any solution for letting the workers access to a single version of this tensor?
Thanks,
The Pytorch documentation explicitly mentions this issue with DataLoader duplicating the underlying dataset (at least on Windows and macOS as I understand).
In general, you should not eagerly load all your dataset in memory because of such issue. The dataset should be lazy loaded, i.e. samples should only be loaded when they are accessed in the __getitem__ method.
If your whole dataset in stored on disk as a monolithic tensor, you could fragment it into individual samples and save them into a folder for instance.
You could then define your dataset as:
from torch.utils.data import Dataset, DataLoader
from glob import glob
from os.path import abspath
class MyDataset(Dataset):
def __init__(self, folder: str):
# Retrieve all tensor file names
folder = abspath(folder)
self.files = glob(f"{folder}/*.pt")
def __getitem__(self, index: int):
# Load tensors on demand
return torch.load(self.files[index])
def __len__(self) -> int:
return len(self.files)
Another solution is to memory-map the dataset. This is what HuggingFace does for huge datasets, take a look here. This avoids loading the whole dataset in RAM and also allows it to be shared in multiple processes without copies.
Ray may be an interesting option for you. Check out ray training datasets!
Additionally, you could also use
data_id = ray.put(data)
to dump your data, and
data = ray.get(data_id)
to load the same files without copying them between functions.
I have my model written and trained in Keras. I'm trying to use it for inference in production. I receive SQS "task" messages containing a tuple of (path_in, path_out).
I can obviously use:
BATCH_SIZE = 10
batch_messages = []
while True:
while len(batch_messages) < BATCH_SIZE:
msg = sqs.read_messsage()
batch_messages.apend(msg)
assert len(batch_messages) == BATCH_SIZE
batch = np.array([read_image(msg.path_in) for msg in batch_messages])
output_batch = model.predict(batch)
for i in range(BATCH_SIZE):
write_output(output_batch[i], path=batch_messages[i].path_out)
batch_messages = []
The problem with that is that the code wastes most of the time reading from SQS, reading the image from disk and writing it back at the end. This means the GPU is idle during all this time.
I'm aware of Keras' Sequence, but not sure if it is intended for that case as well, and for inference (and not training)
I would suggest you to use Tensorflow Serving solution as it implements a server side batching strategy which optimizes the inference speed and GOU utilization. Also if you'd like to speed up your pipeline, you should convert the model into a TensorRT model which optimizes the models operations to a specific GPU (and it does a lot more).