I implemented regression model with interactions using Julia GLM package:
Reg = lm(#formula(dep_var ~ var1&var2&var3), data, true).
Fitting this formula requires a lot of RAM (> 80 GB), but I noticed that the calculations are performed on one core, although my OS (x86_64-pc-linux-gnu) has 8 cpu cores.
Is it possible to implement linear regression using multiprocessing/parallelism approaches?
I suppose, it could also improve the model runtime.
Fitting a regression model is basically doing lots of matrix operations. By default Julia is using BLAS and the easiest thing you can do is to try to configure it to be multi-threaded. This requires running Julia in a multi-threaded setting and setting the BLAS.set_num_threads() configuration.
Before starting Julia run:
set JULIA_NUM_THREADS=4
or on Linux
export JULIA_NUM_THREADS=4
Once Julia is started run the command.
BLAS.set_num_threads(4)
You should observe an increased performance of your linear regression models.
Related
I want to use MLPRegressor from sklearn with all 12 cores available to me, however I do not see any option to select the amount of cores (such as with RandomForestClassifier which has the option with n_jobs).
Is there another way to make sure it uses all 12 cores? I vaguely heard about joblib, but how would I use it correctly?
MLPRegressor does not contain any multithreading per se, though the matrix operations will be vectorized and parallelized via numpy.
You may be able to get better performance by varying your batch size, but if performance is critical you should use a deep learning library like Tensorflow.
I build my own model with Keras Premade Models in tensorflow lattice using python3.7 and save the trained model. However, when I use the trained model for predicting, the speed of predicting each data point is at millisecond level, which seems very slow. Is there any way to speed up the predicting process for tfl?
There are multiple ways to improve speed, but they may involve a tradeoff with prediction accuracy. I think the three most promising options are:
Reduce the number of features
Reduce the number of lattices per feature
Use an ensemble of lattice models where every lattice model only gets a subsets of the features and then average the predictions of the different models (like described here)
As the lattice model is a standard Keras model, I recommend trying OpenVINO. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
The docs (see also this) for autocast in PyTorch only discuss training. Does it speed things up if I also use autocast for inference?
Yes it could (may not in some cases though).
You are processing data with lower precision (e.g. float16 vs float32).
Your program has to read and process less data in this case.
This might help with cache locality and hardware specific software (e.g. tensor cores if using CUDA)
I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.
I have been looking for a maximum entropy classification implementation which can deal with an output size of 500 classes and 1000 features. My training data has around 30,000,000 lines.
I have tried using MegaM, the 64-bit R maxent package, the maxent tool from the University of Edinburgh but as expected, none of them can handle the size of data. However, the size of the data set doesn't seem too out of the world for nlp tasks of this nature.
Are there any techniques that I should be employing? Or any suggestion for a toolkit which I may use?
I am trying to run this on a 64-bit Windows machine with 8GB of RAM,using Cygwin where required.
Vowpal Wabbit is currently regarded as the fastest large-scale learner. LibLinear is an alternative, but I'm not sure if it can handle matrices of 3e10 elements.
Note that the term "MaxEnt" is used almost exclusively by NLP people; machine learning folks call it logistic regression or logit, so if you search for that you might find many more tools than when you search for MaxEnt.