How long does load_dataset take time in huggingface? - python-3.x

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?

Related

How does the queue in Pytorch DataLoader work with num_workers >= 2?

Do I understand the following correctly?
When num_workers >=1, the main process pre-loads prefetch_factor * num_workers batches. When the training loop consumes one batch, the corresponding worker loads the next batch in its queue.
If this is the case, let's go through an example.
NOTE: I have chosen the numeric values for illustration purposes and
have ignored various overheads in this example. The accompanying code example uses these numbers.
Say I have num_workers=4, prefetch_factor=4, batch_size=128. Further assume, it takes 0.003125 s to fetch an item from a source database and the train step takes 0.05 s.
Now, each batch would take 0.003125 * 128 = 0.4 s to load.
With a prefetch_factor=4 and num_workers=4, first, 4*4=16 batches will be loaded.
Once the 16 batches are loaded, the first train step consumes 1 batch and takes 0.05 s. Say worker[0] provided this batch and will start the process to generate a new batch to replenish the queue. Recall fetching a new batch takes 0.4 s.
Similarly, the second step consumes one more batch and the corresponding worker (worker[1] in this example) starts the data fetching process.
The first 8 train steps would take 0.05*8=0.4s. By this time, 8 batches have been
consumed and worker[0] has produced 1 batch. In the next step, 1 batch is consumed and worker[1] produces a new batch. worker[1] had started the data fetching process in the second train step which would now be completed.
Following this we can see, each subsequent train step will consume 1 batch and one of the workers will produce 1 batch, keeping the dataloader queue to have always 8 batches. This means that the train step is never waiting for the data loading process as there are always 8 batches in the buffer.
I would expect this behavior regardless of the data size of the batch given num_workers, prefetch_factor are large enough. However, in the following code example that is not case.
In the code below, I define a custom iterable that returns a numpy array. As the size of the numpy array increases, increasing num_worker or 'prefetch_factor' does not improve the time taken for running through a batch.
I'm guessing this is because each worker serializes the batch to send to the main process where it is de-serialized. As the data size increase, this process would take more time. However, I would think if the queue size is large enough (num_workers, prefetch_factor), at some point, there should be a break even point where each training step consumption of a batch would be accompanied by replenishment via one of the workers as I illustrated in the above example.
In the code below, when MyIterable returns a small object (np array of size (10, 150)), increasing num_workers helps as expected. But when the returned object is larger (np array of size (1000, 150)), num_workers or prefetch_factor does not do much.
# small np object
avg time per batch for num workers=0: 0.47068126868714444
avg time per batch for num workers=2: 0.20982365206225495
avg time per batch for num workers=4: 0.10560789656221914
avg time per batch for num workers=6: 0.07202646931250456
avg time per batch for num workers=8: 0.05311137337469063
# large np object
avg time per batch for num workers=0: 0.6090951558124971
avg time per batch for num workers=2: 0.4594530961876444
avg time per batch for num workers=4: 0.45023533212543043
avg time per batch for num workers=6: 0.3830978863124983
avg time per batch for num workers=8: 0.3811495694375253
Am I missing something here? Why doesn't the data loader queue have enough buffer such that data loading is not the bottleneck?
Even if the serialization and de-serialization process would take longer for the latter case, I'd expect to have a large enough buffer where the consumption and replenishment rate of the batches are almost equal. Otherwise, what is the point of having prefetch_factor.
If the code is behaving as expected, are there any other ways to pre-load the next n batches in a buffer such that it is large enough and never depleted?
Thanks
import time
import torch
import numpy as np
from time import sleep
from torch.utils.data import DataLoader, IterableDataset
def collate_fn(records):
# some custom collation function
return records
class MyIterable(object):
def __init__(self, n):
self.n = n
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.i < self.n:
sleep(0.003125) # simulates data fetch time
# return np.random.random((10, 150)) # small data item
return np.random.random((1000, 150)) # large data item
else:
raise StopIteration
class MyIterableDataset(IterableDataset):
def __init__(self, n):
super(MyIterableDataset).__init__()
self.n = n
def __iter__(self):
return MyIterable(self.n)
def get_performance_metrics(num_workers):
ds = MyIterableDataset(n=10000)
if num_workers == 0:
dl = torch.utils.data.DataLoader(ds, num_workers=0, batch_size=128, collate_fn=collate_fn)
else:
dl = torch.utils.data.DataLoader(ds, num_workers=num_workers, prefetch_factor=4, persistent_workers=True,
batch_size=128, collate_fn=collate_fn,
multiprocessing_context='spawn')
warmup = 5
times = []
t0 = time.perf_counter()
for i, batch in enumerate(dl):
sleep(0.05) # simulates train step
e = time.perf_counter()
if i >= warmup:
times.append(e - t0)
t0 = time.perf_counter()
if i >= 20:
break
print(f'avg time per batch for num workers={num_workers}: {sum(times) / len(times)}')
if __name__ == '__main__':
num_worker_options = [0, 2, 4, 6, 8]
for n in num_worker_options:
get_performance_metrics(n)

Setting task slots with pyspark on an individual machine

I am trying to run the optimization of a ML model using SparkTrials from the hyperopt library. I am running this on a single machine with 16 cores but when I run the following code which sets the number of cores to 8 I get a warning that seems to indicate that only one core is used.
SparkTrials accepts as an argument spark_session which in theory is where I set the number of cores.
Can anyone help me?
Thanks!
import os, shutil, tempfile
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
import numpy as np
from sklearn import linear_model, datasets, model_selection
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").config('spark.local.dir', './').config("spark.executor.cores", 8).getOrCreate()
def gen_data(bytes):
"""
Generates train/test data with target total bytes for a random regression problem.
Returns (X_train, X_test, y_train, y_test).
"""
n_features = 100
n_samples = int(1.0 * bytes / (n_features + 1) / 8)
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features, random_state=0)
return model_selection.train_test_split(X, y, test_size=0.2, random_state=1)
def train_and_eval(data, alpha):
"""
Trains a LASSO model using training data with the input alpha and evaluates it using test data.
"""
X_train, X_test, y_train, y_test = data
model = linear_model.Lasso(alpha=alpha)
model.fit(X_train, y_train)
loss = model.score(X_test, y_test)
return {"loss": loss, "status": STATUS_OK}
def tune_alpha(objective):
"""
Uses Hyperopt's SparkTrials to tune the input objective, which takes alpha as input and returns loss.
Returns the best alpha found.
"""
best = fmin(
fn=objective,
space=hp.uniform("alpha", 0.0, 10.0),
algo=tpe.suggest,
max_evals=8,
trials=SparkTrials(parallelism=8,spark_session=spark))
return best["alpha"]
data_small = gen_data(10 * 1024 * 1024) # ~10MB
def objective_small(alpha):
# For small data, you might reference it directly.
return train_and_eval(data_small, alpha)
tune_alpha(objective_small)
Parallelism (8) is greater than the current total of Spark task slots
(1). If dynamic allocation is enabled, you might see more executors
allocated.
if you are in cluster: The core in Spark nomenclature is unrelated to the physical core in your CPU here with spark.executor.cores you specified the maximum number of thread(=task) each executor(you have one here) can run is 8 if you want to increase the number of executors you have to use --num-executors in command-line or spark.executor.instances configuration property in your code.
I suggest try something like this configuration if you are in a yarn cluster
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","2")
spark.conf.set("spark.dynamicAllocation.maxExecutors","10")
please consider above options are not available in local mode
local: in local mode you only have one executor and if you want to change the number of its worker threads (which is one by default) you have to set your master like this local[*] or local[16]

BoostedTreeClassifier gets stuck on loss on the first step

I'm trying to run a simple boostedTreeClassifier on my dataset from the example, but it seems to get stuck on first step:
2019-06-28 11:20:31.658689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 84090 of 85873
2019-06-28 11:20:32.908425: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
I0628 11:20:34.904214 140220602029888 basic_session_run_hooks.py:262] loss = 0.6931464, step = 0
W0628 11:21:03.421219 140220602029888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
W0628 11:21:05.555618 140220602029888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
The same dataset seems to work fine when I pass it to other keras based model or xgboost model.
Here's the relevant code:
def make_input_fn(self, X, y, shuffle=True, num_epochs=None):
num_samples = len(self.y_train)
def input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
if shuffle:
dataset = dataset.shuffle(num_samples).repeat(num_epochs).batch(self.batch_size)
else:
dataset = dataset.repeat(num_epochs).batch(self.batch_size)
return dataset
return input_fn
def ens_train(self):
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.DEBUG)
train_input_fn = self.make_input_fn(self.X_train, self.y_train, num_epochs=self.epochs)
self.model = tf.estimator.BoostedTreesClassifier(self.feature_columns,
n_batches_per_layer = int(0.5* len(self.y_train)/self.batch_size),
model_dir = self.ofolder,
max_depth = 10,
n_trees = 1000)
self.model.train(train_input_fn, max_steps = 1000)
Was able to get a result by playing around with learning rates and number of epochs. The "best" parameters obtained by hyperparameter tuning on xgboost doesn't give similar results in BoostedTreeClassifier. It took a large number of epochs to get around 84% accuracy (balanced dataset). xgboost had given 95% without even hyperparameter tuning..

Is sklearn learning_curve function supported by dask?

I'm computing learning curves out of random forests using sklearn. I need to do it for lot of different RFs, therefore I want to use a cluster and Dask to reduce the time of the RFs fits.
Currently I implemented the following algorithm:
from sklearn.externals import joblib
from dask.distributed import Client, LocalCluster
worker_kwargs = dict(memory_limit="2GB", ncores=4)
cluster = LocalCluster(n_workers=4, threads_per_worker=2, **worker_kwargs) # processes=False?
client = Client(cluster)
X, Y = ..., ...
estimator = RandomForestRegressor(n_jobs=-1, **rf_params)
cv = ShuffleSplit(n_splits=5, test_size=0.2)
train_sizes = [...] # 20 different values
with joblib.parallel_backend('dask', scatter=[X,Y]):
train_sizes, train_scores, test_scores = learning_curve(estimator, X, Y, cv=cv, n_jobs=-1, train_sizes=train_sizes)
Here are 2 levels of parallelism:
One for the fitting of a RF (n_jobs=-1)
One for the looping over all the training set sizes (n_jobs=-1)
My problem is: if the backend is loky, then it takes around 23s.
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 22.8s finished
Now, if the backend is dask, then it takes more time:
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 30.3s finished
I now that Dask introduces overhead, but I don't except that this explain all the difference of running time.
dask is being developed quickly and I find a lot of different versions to do the same thing, without knowing which one is up-to-date.

Memory leak evaluating CNN model for text clasification

I've been doing some adaptation to code in this blog about CNN for text clasification:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
Everything works fine! But when I try to use the model trained to predict new instances it consumes all memory available. It seems that it's not liberating any memory when evaluates and load all the model again and again. As far as I know memory should be liberated after every sess.run command.
Here is the part of the code I'm working with:
with graph.as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
# Load the saved meta graph and restore variables
saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
saver.restore(sess, checkpoint_file)
# Get the placeholders from the graph by name
input_x = graph.get_operation_by_name("input_x").outputs[0]
# input_y = graph.get_operation_by_name("input_y").outputs[0]
dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
# Tensors we want to evaluate
predictions = graph.get_operation_by_name("output/predictions").outputs[0]
# Add a vector for probas
probas =graph.get_operation_by_name("output/scores").outputs[0]
# Generate batches for one epoch
print("\nGenerating Bathces...\n")
gc.collect()
#mem0 = proc.get_memory_info().rss
batches = data_helpers.batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False)
#mem1 = proc.get_memory_info().rss
print("\nBatches done...\n")
#pd = lambda x2, x1: 100.0 * (x2 - x1) / mem0
#print "Allocation: %0.2f%%" % pd(mem1, mem0)
# Collect the predictions here
all_predictions = []
all_probas = []
for x_test_batch in batches:
#Calculate probability of prediction been good
gc.collect()
batch_probas = sess.run(tf.reduce_max(tf.nn.softmax(probas),1), {input_x: x_test_batch, dropout_keep_prob: 1.0})
batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
all_predictions = np.concatenate([all_predictions, batch_predictions])
all_probas = np.concatenate([all_probas, batch_probas])
# Add summary ops to collect data
with tf.name_scope("eval") as scope:
p_h = tf.histogram_summary("eval/probas", batch_probas)
summary= sess.run(p_h)
eval_summary_writer.add_summary(summary)
Any help will be much appreciated
Cheers
Your training loop creates new TensorFlow operations (tf.reduce_max(), tf.nn.softmax() and tf.histogram_summary()) in each iteration, which will lead to more memory being consumed over time. TensorFlow is most efficient when you run the same graph many times, because it can amortize the cost of optimizing the graph over multiple executions. Therefore,
to get the best performance, you should revise your program so that you create each of these operations once, before the for x_test_batch in batches: loop, and then re-use the same operations in each iteration.

Resources