PyTorch: purpose of addmm function - pytorch

What is the purpose of the following PyTorch function (doc):
torch.addmm(beta=1, mat, alpha=1, mat1, mat2, out=None)
More specifically, is there any reason to prefer this function instead of just using
beta * mat + alpha * (mat1 # mat2)

The addmm function is an optimized version of the equation beta*mat + alpha*(mat1 # mat2). I ran some tests and timed their execution.
If beta=1, alpha=1, then the execution of both the statements (addmm and manual) is approximately the same (addmm is just a little faster), regardless of the matrices size.
If beta and alpha are not 1, then addmm is two times faster than the manual execution for smaller matrices (with total elements in order of 105). But, if matrices are large (in order of 106), the speedup seems negligible (39ms v/s 41ms)

Related

On batch size, epochs, and learning rate of DistributedDataParallel

I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.
Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the batch size to 8, so each epoch will do 100 gradient steps.
Now I am in a multi-node multi-gpu scenario, with 2 nodes and 4 GPUs (so world size is 8).
I understand that I need to pass batches of 8 / 8 = 1, because each update will aggregate the gradients from the 8 GPUs. In each worker, the data loader will load still 100 batches, but each of 1 sample. So the whole dataset is parsed exactly once per epoch.
I checked and everything seems like that.
But what about the learning rate?
According to the official doc
When a model is trained on M nodes with batch=N, the gradient will be
M times smaller when compared to the same model trained on a single
node with batch=M*N if the loss is summed (NOT averaged as usual)
across instances in a batch (because the gradients between different
nodes are averaged). [...] But in most cases, you can just treat a
DistributedDataParallel wrapped model, a DataParallel wrapped model
and an ordinary model on a single GPU as the same (E.g. using the same
learning rate for equivalent batch size).
I understand that the gradients are averaged, so if the loss is averaged over samples nothing changes, while if it is summer we need to account for that. But does 'nodes' refer to the total number of GPUs across all cluster nodes (world size) or just cluster nodes? In my example, would M be 2 or 8? Some posts in the threads I linked say that the gradient is divided 'by the number of GPUs'. How exactly is the gradient aggregated?
Please refer to the following discussion:
https://github.com/PyTorchLightning/pytorch-lightning/discussions/3706
"As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.
Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes
In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512."

Map function in Keras

In tutorial 'Text classification from scratch',
# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)
How to understand this map function here?
In this case, the map function is helping in doing asynchronous processing. The example in the tutorial mentioned is using text-only data. It is discarding the labels. This is done using lambda function x, y --> x. This transformation is applied to each sample of data on the CPU of host machine while your GPU is processing the previous sample of data. This asynchronous processing is being done by map function. Since the GPU doesn't have to wait for next batch of data, you get full utilization.

Strange performance issue Spark LSH MinHash approxSimilarityJoin

I'm joining 2 datasets using Apache Spark ML LSH's approxSimilarityJoin method, but I'm seeings some strange behaviour.
After the (inner) join the dataset is a bit skewed, however every time one or more tasks take an inordinate amount of time to complete.
As you can see the median is 6ms per task (I'm running it on a smaller source dataset to test), but 1 task takes 10min. It's hardly using any CPU cycles, it actually joins data, but so, so slow.
The next slowest task runs in 14s, has 4x more records & actually spills to disk.
If you look
The join itself is a inner join between the two datasets on pos & hashValue (minhash) in accordance with minhash specification & udf to calculate the jaccard distance between match pairs.
Explode the hashtables:
modelDataset.select(
struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols))
Jaccard distance function:
override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
val xSet = x.toSparse.indices.toSet
val ySet = y.toSparse.indices.toSet
val intersectionSize = xSet.intersect(ySet).size.toDouble
val unionSize = xSet.size + ySet.size - intersectionSize
assert(unionSize > 0, "The union of two input sets must have at least 1 elements")
1 - intersectionSize / unionSize
}
Join of processed datasets :
// Do a hash join on where the exploded hash values are equal.
val joinedDataset = explodedA.join(explodedB, explodeCols)
.drop(explodeCols: _*).distinct()
// Add a new column to store the distance of the two rows.
val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)
val joinedDatasetWithDist = joinedDataset.select(col("*"),
distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)
)
// Filter the joined datasets where the distance are smaller than the threshold.
joinedDatasetWithDist.filter(col(distCol) < threshold)
I've tried combinations of caching, repartitioning and even enabling spark.speculation, all to no avail.
The data consists of shingles address text that have to be matched:
53536, Evansville, WI => 53, 35, 36, ev, va, an, ns, vi, il, ll, le, wi
will have a short distance with records where there is a typo in the city or zip.
Which gives pretty accurate results, but may be the cause of the join skew.
My question is:
What may cause this discrepancy? (One task taking very very long, even though it has less records)
How can I prevent this skew in minhash without losing accuracy?
Is there a better way to do this at scale? ( I can't Jaro-Winkler / levenshtein compare millions of records with all records in location dataset)
It might be a bit late but I will post my answer here anyways to help others out. I recently had similar issues with matching misspelled company names (All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster). Someone helped me out by suggesting to take NGrams to reduce the data skew. It helped me a lot. You could also try using e.g. 3-grams or 4-grams.
I don’t know how dirty the data is, but you could try to make use of states. It reduces the number of possible matches substantially already.
What really helped me improving the accuracy of the matches is to postprocess the connected components (group of connected matches made by the MinHashLSH) by running a label propagation algorithm on each component. This also allows you to increase N (of the NGrams), therefore mitigating the problem of skewed data, setting the jaccard distance parameter in approxSimilarityJoin less tightly, and postprocess using label propagation.
Finally, I am currently looking into using skipgrams to match it. I found that in some cases it works better and reduces the data skew somewhat.

What's the difference between "samples_per_epoch" and "steps_per_epoch" in fit_generator

I was confused by this problem for several days...
My question is that why the training time has such massive difference between that I set the batch_size to be "1" and "20" for my generator.
If I set the batch_size to be 1, the training time of 1 epoch is approximately 180 ~ 200 sec.
If I set the batch_size to be 20, the training time of 1 epoch is approximately 3000 ~ 3200 sec.
However, this horrible difference between these training times seems to be abnormal..., since it should be the reversed result:
batch_size = 1, training time -> 3000 ~ 3200 sec.
batch_size = 20, training time -> 180 ~ 200 sec.
The input to my generator is not the file path, but the numpy arrays which are already loaded into the
memory via calling "np.load()".
So I think the I/O trade-off issue doesn't exist.
I'm using Keras-2.0.3 and my backend is tensorflow-gpu 1.0.1
I have seen the update of this merged PR,
but it seems that this change won't affect anything at all. (the usage is just the same with original one)
The link here is the gist of my self-defined generator and the part of my fit_generator.
When you use fit_generator, the number of samples processed for each epoch is batch_size * steps_per_epochs. From the Keras documentation for fit_generator: https://keras.io/models/sequential/
steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.
This is different from the behaviour of 'fit', where increasing batch_size typically speeds up things.
In conclusion, when you increase batch_size with fit_generator, you should decrease steps_per_epochs by the same factor, if you want training time to stay the same or lower.
Let's clear it :
Assume you have a dataset with 8000 samples (rows of data) and you choose a batch_size = 32 and epochs = 25
This means that the dataset will be divided into (8000/32) = 250 batches, having 32 samples/rows in each batch. The model weights will be updated after each batch.
one epoch will train 250 batches or 250 updations to the model.
here steps_per_epoch = no.of batches
With 50 epochs, the model will pass through the whole dataset 50 times.
Ref - https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
You should also take into account the following function parameters when working with fit_generator:
max_queue_size, use_multiprocessing and workers
max_queue_size - might cause to load more data than you actually expect, which depending on your generator code may do something unexpected or unnecessary which can slow down your execution times.
use_multiprocessing together with workers - might spin-up additional processes that would lead to additional work for serialization and interprocess communication. First you would get your data serialized using pickle, then you would send your data to that target processes, then you would do your processing inside those processes and then the whole communication procedure repeats backwards, you pickle results, and send them to the main process via RPC. In most cases it should be fast, but if you're processing dozens of gigabytes of data or have your generator implemented in sub-optimal fashion then you might get the slowdown you describe.
The whole thing is:
fit() works faster than fit_generator() since it can access data directly in memory.
fit() takes numpy arrays data into memory, while fit_generator() takes data from the sequence generator such as keras.utils.Sequence which works slower.

Random Forest: Running out of memory

I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.
What are some of the ways to deal with this?
rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
Your best choice is to tune the arguments.
n_jobs=4
This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.
cv=20
This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.
n_estimators = 100
Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.
To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).
I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6 the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved!
I write down this experiment in the blog post.
From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth or by tuning: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes.
The memory of the Random Forest can be of course controlled with number of trees in the ensemble.

Resources