AutoGluon TextPredictor.fit gives BrokenPipeError: [Errno 32] Broken pipe - possible solution

AutoGluon TextPredictor.fit gives BrokenPipeError: [Errno 32] Broken pipe - possible solution - python-3.x

I'm posting this as another potential solution to the AutoGluon TextPredictor.fit() "BrokenPipeError: [Errno 32] Broken pipe" and get feedback. I searched broadly, looked at related responses on SO, Github, PyTorch discussion, etc. and don't see option 2 below for AutoGluon which is why I'd like to propose it.
For the AutoGluon Text Prediction quick start tutorial I used an AWS EC2 P3.2xlarge EC2 instance with 8 vCPUs, 61GB ram, 1 NVIDIA v100 running Windows Server 2016.
Running the tutorial, TextPredictor.fit() gave a "BrokenPipeError: [Errno 32] Broken pipe" error.
Option 1 - One way to fix it is to wrap your code in a block as mentioned here:
def run():
# TextPredictor.fit() code goes here
if __name__ == '__main__':
run()
Option 2 - Earlier in that same SO post there is another resolution to set PyTorch 'num_workers' to 0, however in my case PyTorch is running via AutoGluon. I looked through the Autogluon docs here and noticed I could set both env.num_workers and env.num_workers_evaluation. In TextPredictor.fit() I set both to zero and it worked great on my EC2 instance.
predictor.fit(train_data,
time_limit = TRAIN_TIME_LIMIT,
hyperparameters = {"env.num_workers": 0, "env.num_workers_evaluation": 0})
Option 2 only takes ~3 minutes longer to run vs option 1 (17 mins vs 14 mins).
My full code is below. As mentioned earlier most of it was borrowed from the AutoGluon site.
# -*- coding: utf-8 -*-
# -----
# Libs
from datetime import datetime
import torch
from autogluon.text import TextPredictor
import autogluon as ag
from autogluon.core.utils.loaders import load_pd
import pandas as pd
# -----
# Constants.
# This is in seconds. Ex: 5 mins * 60 secs per minute.
TRAIN_TIME_LIMIT = 1*60
# Data source
TRAIN_DATA = '.\\data\\train.parquet'
TEST_DATA = '.\\data\\train.parquet'
TRAIN_SAMPLE_SIZE = 67300
PRED_LABEL_COL = 'label'
# -----
# Main
print("\nStart time: ", datetime.now())
# Sanity test pytorch:
print("Is Cuda available? ", torch.cuda.is_available()) # Should be True
print("Is Cuda device count > 0? ", torch.cuda.device_count()) # Should be > 0
# My debug stuff
from autogluon.text.version import __version__
print("autogluon.text.version: ", __version__)
train_data = load_pd.load(TRAIN_DATA)
test_data = load_pd.load(TEST_DATA)
train_data = train_data.sample(n = TRAIN_SAMPLE_SIZE, random_state = 42)
predictor = TextPredictor(label = PRED_LABEL_COL, eval_metric = 'acc', path = '.\\ag_sst')
# Set num workers to Zero.
predictor.fit(train_data,
time_limit = TRAIN_TIME_LIMIT,
hyperparameters = {"env.num_workers": 0, "env.num_workers_evaluation": 0})
test_score = predictor.evaluate(test_data)
print("\n\nTest score:", test_score)
test_score = predictor.evaluate(test_data, metrics=['acc'])
print(test_score)
print("\nEnd time: ", datetime.now())
Question - I'm not familiar writing multi-processor code. Is there anything wrong or inefficient setting the fit() num_workers hyperparameters above within the context of the full code?

Related

Stablebaselines3 and Pettingzoo

I am trying to understand how to train agents in a pettingzoo environment using the single agent algorithm PPO implemented in stablebaselines3.
I'm following this tutorial where the agents act in a cooperative environment and they are all trained with (parameter sharing) PPO. However when I pass the pettingzoo environment into the PPO constructor of stablebaselines3, I get the following error message:
The algorithm only supports (<class 'gym.spaces.box.Box'>, <class 'gym.spaces.discrete.Discrete'>, <class 'gym.spaces.multi_discrete.MultiDiscrete'>, <class 'gym.spaces.multi_binary.MultiBinary'>) as action spaces but Box(-1.0, 1.0, (1,), float32) was provided
Here is my full code:
from pettingzoo.butterfly import pistonball_v6
from pettingzoo.utils.conversions import aec_to_parallel
import supersuit as ss
from stable_baselines3.ppo import CnnPolicy
from stable_baselines3 import PPO
def main():
# Initialize environment
env = pistonball_v6.env(n_pistons=20,
time_penalty=-0.1,
continuous=True,
random_drop=True,
random_rotate=True,
ball_mass=0.75,
ball_friction=0.3,
ball_elasticity=1.5,
max_cycles=125)
# Reduce the complexity of the observation by considering only the blue channel
env = ss.color_reduction_v0(env, mode='B')
# Resize the observation to reduce dimension
env = ss.resize_v1(env, x_size=84, y_size=84)
# In order to let the policy learn based on the ball's velocity and acceleration,
# we inlcude the last 3 frames consecutive frames in the observation
env = ss.frame_stack_v1(env,3)
# This is for using stable baselines
env = aec_to_parallel(env)
env = ss.pettingzoo_env_to_vec_env_v1(env)
# prepare the einvironment to use stablebaselines
env = ss.concat_vec_envs_v1(env, 2, num_cpus=1, base_class='stable_baselines3')
# PPO
model = PPO(CnnPolicy,
env,
verbose=3,
gamma=0.95,
n_steps=256,
ent_coef=0.0905168,
learning_rate=0.00062211,
vf_coef=0.042202,
max_grad_norm=0.9,
gae_lambda=0.99,
n_epochs=5,
clip_range=0.3,
batch_size=256)
model.learn(total_timesteps=100000)
pass
if __name__ == "__main__":
main()

How long does load_dataset take time in huggingface?

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?

Cannot export PyTorch model to ONNX

I am trying to convert a pre-trained torch model to ONNX, but recive the following error:
RuntimeError: step!=1 is currently not supported
I'm trying this on a pre-trained colorization model: https://github.com/richzhang/colorization
Here is the code I ran in Google Colab:
!git clone https://github.com/richzhang/colorization.git
cd colorization/
import colorizers
model = colorizer_siggraph17 = colorizers.siggraph17(pretrained=True).eval()
input_names = [ "input" ]
output_names = [ "output" ]
dummy_input = torch.randn(1, 1, 256, 256, device='cpu')
torch.onnx.export(model, dummy_input, "test_converted_model.onnx", verbose=True,
input_names=input_names, output_names=output_names)
I appreciate any help :)
UPDATE 1: #Proko suggestion solved the ONNX export issue. Now I have a new possibly related problem when I try to convert the ONNX to TensorRT. I get the following error:
[TensorRT] ERROR: Network must have at least one output
Here is the code I used:
import torch
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import onnx
TRT_LOGGER = trt.Logger()
def build_engine(onnx_file_path):
# initialize TensorRT engine and parse ONNX model
builder = trt.Builder(TRT_LOGGER)
builder.max_workspace_size = 1 << 25
builder.max_batch_size = 1
if builder.platform_has_fast_fp16:
builder.fp16_mode = True
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)
# parse ONNX
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
parser.parse(model.read())
print('Completed parsing of ONNX file')
# generate TensorRT engine optimized for the target platform
print('Building an engine...')
engine = builder.build_cuda_engine(network)
context = engine.create_execution_context()
print("Completed creating Engine")
return engine, context
ONNX_FILE_PATH = 'siggraph17.onnx' # Exported using the code above
engine,_ = build_engine(ONNX_FILE_PATH)
I tried to force the build_engine function to use the output of the network by:
network.mark_output(network.get_layer(network.num_layers-1).get_output(0))
but it did not work.
I appropriate any help!

Like I have mentioned in a comment, this is because slicing in torch.onnx supports only step = 1 but there are 2-step slicing in the model:
self.model2(conv1_2[:,:,::2,::2])
Your only option as for now is to rewrite slicing to be some other ops. You can do it by using range and reshape to obtain proper indices. Consider the following function "step-less-arange" (I hope it is generic enough for anyone with similar problem):
def sla(x, step):
diff = x % step
x += (diff > 0)*(step - diff) # add length to be able to reshape properly
return torch.arange(x).reshape((-1, step))[:, 0]
usage:
>> sla(11, 3)
tensor([0, 3, 6, 9])
Now you can replace every slice like this:
conv2_2 = self.model2(conv1_2[:,:,self.sla(conv1_2.shape[2], 2),:][:,:,:, self.sla(conv1_2.shape[3], 2)])
NOTE: you should optimize it. Indices are calculated for every call so it might be wise to pre-compute it.
I have tested it with my fork of the repo and I was able to save the model:
https://github.com/prokotg/colorization

What works for me was to add the opset_version=11 on torch.onnx.export
First I had tried use opset_version=10, but the API suggest 11 so it works.
So your function should be:
torch.onnx.export(model, dummy_input, "test_converted_model.onnx", verbose=True,opset_version=11,
input_names=input_names, output_names=output_names)

Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch?

I'm just going through the beginner tutorial on PyTorch and noticed that one of the many different ways to put a tensor (basically the same as a numpy array) on the GPU takes a suspiciously long amount compared to the other methods:
import time
import torch
if torch.cuda.is_available():
print('time =', time.time())
x = torch.randn(4, 4)
device = torch.device("cuda")
print('time =', time.time())
y = torch.ones_like(x, device=device) # directly create a tensor on GPU => 2.5 secs??
print('time =', time.time())
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
a = torch.ones(5)
print(a.cuda())
print('time =', time.time())
else:
print('I recommend you get CUDA to work, my good friend!')
Output (just times):
time = 1551809363.28284
time = 1551809363.282943
time = 1551809365.7204516 # (!)
time = 1551809365.7236063
Version details:
1 CUDA device: GeForce GTX 1050, driver version 415.27
CUDA = 9.0.176
PyTorch = 1.0.0
cuDNN = 7401
Python = 3.5.2
GCC = 5.4.0
OS = Linux Mint 18.3
Linux kernel = 4.15.0-45-generic
As you can see this one operation ("y = ...") takes much longer (2.5 seconds) than the rest combined (.003 seconds). I'm confused about this as I expect all these methods to basically do the same. I've tried making sure the types in this line are 32 bit or have different shapes but that didn't change anything.

When I re-order the commands, whatever command is on top takes 2.5 seconds. So this leads me to believe there is a delayed one-time setup of the device happening here, and future on-GPU allocations will be faster.

How to translate deprecated tf.train.QueueRunners tensorflow approach to importing data to new tf.data.Dataset approach

Altough tensorflow recommends very much to not use deprecated functions that are going to be replaced by tf.data objects, there seems to be no good documentation for cleanly replacing the deprecated for the modern approach. Moreover, Tensorflow tutorials still use the deprecated functionality to treat file processing (Reading data tutorial: https://www.tensorflow.org/api_guides/python/reading_data).
On the other hand, though there is good documentation for using the 'modern' approach (Importing data tutorial: https://www.tensorflow.org/guide/datasets), there still exists the old tutorials which will probably lead many, as me, to use the deprecated one first. That is why one would like to cleanly translate the deprecated to the 'modern' approach, and an example for this translation would probably be very useful.
#!/usr/bin/env python3
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import shutil
import os
if not os.path.exists('example'):
shutil.rmTree('example');
os.mkdir('example');
batch_sz = 10; epochs = 2; buffer_size = 30; samples = 0;
for i in range(50):
_x = np.random.randint(0, 256, (10, 10, 3), np.uint8);
plt.imsave("example/image_{}.jpg".format(i), _x)
images = tf.train.match_filenames_once('example/*.jpg')
fname_q = tf.train.string_input_producer(images,epochs, True);
reader = tf.WholeFileReader()
_, value = reader.read(fname_q)
img = tf.image.decode_image(value)
img_batch = tf.train.batch([img], batch_sz, shapes=([10, 10, 3]));
with tf.Session() as sess:
sess.run([tf.global_variables_initializer(),
tf.local_variables_initializer()])
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for _ in range(epochs):
try:
while not coord.should_stop():
sess.run(img_batch)
samples += batch_sz;
print(samples, "samples have been seen")
except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
finally:
coord.request_stop();
coord.join(threads)
This code runs perfectly well for me, printing to console:
10 samples have been seen
20 samples have been seen
30 samples have been seen
40 samples have been seen
50 samples have been seen
60 samples have been seen
70 samples have been seen
80 samples have been seen
90 samples have been seen
100 samples have been seen
110 samples have been seen
120 samples have been seen
130 samples have been seen
140 samples have been seen
150 samples have been seen
160 samples have been seen
170 samples have been seen
180 samples have been seen
190 samples have been seen
200 samples have been seen
Done training -- epoch limit reached
As can be seen, it uses deprecated functions and objects as tf.train.string_input_producer() and tf.WholeFileReader(). An equivalent implementation using the 'modern' tf.data.Dataset is needed.
EDIT:
Found already given example for importing CSV data: Replacing Queue-based input pipelines with tf.data. I would like to be as complete as possible here, and suppose that more examples are better, so I don't feel it as a repeated question.

Here is the translation, which prints exactly the same to standard output.
#!/usr/bin/env python3
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
import shutil
if not os.path.exists('example'):
shutil.rmTree('example');
os.mkdir('example');
batch_sz = 10; epochs = 2; buffer_sz = 30; samples = 0;
for i in range(50):
_x = np.random.randint(0, 256, (10, 10, 3), np.uint8);
plt.imsave("example/image_{}.jpg".format(i), _x);
fname_data = tf.data.Dataset.list_files('example/*.jpg')\
.shuffle(buffer_sz).repeat(epochs);
img_batch = fname_data.map(lambda fname: \
tf.image.decode_image(tf.read_file(fname),3))\
.batch(batch_sz).make_initializable_iterator();
with tf.Session() as sess:
sess.run([img_batch.initializer,
tf.global_variables_initializer(),
tf.local_variables_initializer()]);
next_element = img_batch.get_next();
try:
while True:
sess.run(next_element);
samples += batch_sz
print(samples, "samples have been seen");
except tf.errors.OutOfRangeError:
pass;
print('Done training -- epoch limit reached');
The main issues are:
Use of tf.data.Dataset.list_files() to load filenames as a dataset, instead of generating a queue with deprecated tf.tran.string_input_producer() for consuming filenames.
Use of iterators to process datasets, which require initialization too, instead of sequent reads to a deprecated tf.WholeFileReader, batched with a deprecated tf.train.batch() function.
A Coordinator is not needed because threads for queues (tf.train.QueueRunners created by tf.train.string_input_producer()) are not used anymore, but it should be checked when dataset iterator has ended.
I hope this will be useful for many, as was for me after achieving it.
Ref:
Importing data: https://www.tensorflow.org/guide/datasets
Medium Datasets Tutorial: https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428
BONUS: Dataset + Estimator
#!/usr/bin/env python3
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
import shutil
if not os.path.exists('example'):
shutil.rmTree('example');
os.mkdir('example');
batch_sz = 10; epochs = 2; buffer_sz = 10000; samples = 0;
for i in range(50):
_x = np.random.randint(0, 256, (10, 10, 3), np.uint8);
plt.imsave("example/image_{}.jpg".format(i), _x);
def model(features,labels,mode,params):
return tf.estimator.EstimatorSpec(
tf.estimator.ModeKeys.PREDICT,{'images': features});
estimator = tf.estimator.Estimator(model,'model_dir',params={});
def input_dataset():
return tf.data.Dataset.list_files('example/*.jpg')\
.shuffle(buffer_sz).repeat(epochs).map(lambda fname: \
tf.image.decode_image(tf.read_file(fname),3))\
.batch(batch_sz);
predictions = estimator.predict(input_dataset,
yield_single_examples=False);
for p_dict in predictions:
samples += batch_sz;
print(samples, "samples have been seen");
print('Done training -- epoch limit reached');
The main issues are:
Definition of a model function for a custom estimator for processing images, which in this case does nothing because we are just passing them by.
Definition of an input_dataset function for retriving the dataset to be used (for prediction in this case) by the estimator.
Use of tf.estimator.Estimator.predict() on estimator instead of using tf.Session() directly, plus yield_single_example=False to retrieve batch of elements instead of single in predictions list of dictionaries.
It seems to me like more modular and reusable code.
Ref:
Datasets for estimators: https://www.tensorflow.org/guide/datasets_for_estimators,
Custom estimators: https://www.tensorflow.org/guide/custom_estimators

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AutoGluon TextPredictor.fit gives BrokenPipeError: [Errno 32] Broken pipe - possible solution - python-3.x

Related

Stablebaselines3 and Pettingzoo

How long does load_dataset take time in huggingface?

Cannot export PyTorch model to ONNX

Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch?

How to translate deprecated tf.train.QueueRunners tensorflow approach to importing data to new tf.data.Dataset approach

Categories

Resources