I've written a program to process a large amount of data samples using Repa. Performance is key for this program. A large part of the operations require parallel maps/folds over a multi-dimensional arrays and Repa is perfect for this. However, there is still a part of my program that only uses one-dimensional arrays and doesn't require parallelism (i.e. overhead of parallelism would harm performance). Some of these operations require functions like take or folds with custom accumulators, which Repa doesn't support. So I'm writing these operations myself by iterating over the Repa array.
Am I better off re-writing these operations by using Vector instead of Repa? Would they result in better performance?
I've read somewhere that one-dimensional Repa arrays are implemented as Vectors 'under the hood' so I doubt that Vectors result in better performance. On the other hand, Vector does have some nice built-in functions that I could use instead of writing them myself.
I've implemented some parts of my program with Data.Vector.Unboxed instead of using one-dimensional Data.Array.Repa. Except for some minor improvements, the algorithms are the same. Data.Vector.Unboxed seems to be 4 times faster than one-dimensional Data.Array.Repa for sequential operations.
Related
What's the difference between setting CV=some integer vs cv=PredefinedSplit(test_fold=your_test_fold)?
Is there any advantage of one over the other? Does CV=some integer sets the splits randomly?
Specifying an integer will produce kfold cross-validation without shuffling, as described in the documentation for sklearn.model_selection.KFold. Shuffling before splitting may or may not be preferred; if your data is sorted, shuffling is necessary to randomize the distribution of samples, while if the samples are simply correlated due to spatial or temporal sampling effects, shuffling may provide an optimistic view of performance.
I would avoid using PredefinedSplit unless you have a very good reason to predefine your splits. There are other CV generators that can probably meet your needs, like StratifiedKFold if you want to maintain your class distribution (for example.)
I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.
I'm working on extracting text features from a large dataset of documents (about 15 million documents) using CountVectorizer. I also looked at HashingVectorizer as an alternative, but I think CountVectorizer is what I need, as it provides more information about text features and other stuff.
The problem here is kinda common: I don't have enough memory when fitting the CountVectorizer model.
def getTexts():
# an iterator that will yield each document from the database
vectorizer = CountVectorizer(max_features=500, ngram_range=(1,3))
X = vectorizer.fit_transform(getTexts())
Here, let's say I have an iterator that will yield one document at a time from a database. If I pass this iterator as a parameter to CountVectorizer fit() function, how is the vocabulary built? Does it wait until finishing loading all the documents and then do the fit() once, or does it load one document at a time, do the fit, and then load the next one? What's a possible solution to resolve the memory overhead here?
The reason why CountVectorizer will consume much more memory is that the CountVectorizer needs to store a vocabulary dictionary in memory, however, the HashingVectorizer has a better memory performance because it does not need to store the vocabulary dictionary. The main difference between these two vectorizers is mentioned in the Doc of HashingVectorizer:
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
no IDF weighting as this would render the transformer stateful.
And of course the CountVectorizer will load one document at a time, do the fit, and then load the next one. In this process the CountVectorizer will build its vocabulary dictionary as the memory usage surging.
To optimize the memory, you may need to reduce the size of document dataset, or giving a lower max_features parameter may also help. However if you want to resolve this memory problem completely, try to use the HashingVectorizer instead of the CountVectorizer.
I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.
I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.