sklearn and Tensorflow with dual CPU machine - scikit-learn

I am thinking about building a dual-CPU machine for machine learning. I already have a fast GPU in my current rig but I am limited to 32GB of DDR3, I have an i7-4790k and I am planning to upgrade to dual E5 2683 v3's.
I need CPU computing power for sklearn and grid search. Does Sklearn work on 2 cpu's the same it does on 1? Will it use all the cores on both CPU's when n_jobs=-1?
Will tensorflow only work on the one CPU when training on my GPU? If I just copy and pasted the MNIST for experts tutorial on the TF website would it use both CPU's and my GPU without specifying the devices?
I choose not to put this on the superuser forum because it is more about the software than the hardware.

From what I read even if you add an instruction like
with tf.Session() as sess:
with tf.device("/cpu:0"):
...
It treats it as a recommendation and might use the GPU when it sees fit.
I guess it might use the other CPU

Related

Clearing memory when training Machine Learning models with Tensorflow 1.15 on GPU

I am training a pretty intensive ML model using a GPU and what will often happen that if I start training the model, then let it train for a couple of epochs and notice that my changes have not made a significant difference in the loss/accuracy, I will make edits, re-initialize the model and re-start training from epoch 0. In this case, I often get OOM errors.
My guess is that despite me overriding all the model variables something is still taking up space in-memory.
Is there a way to clear the memory of the GPU in Tensorflow 1.15 so that I don't have to keep restarting the kernel each time I want to start training from scratch?
It depends on exactly what GPUs you're using. I'm assuming you're using NVIDIA, but even then depending on the exact GPU there are three ways to do this-
nvidia-smi -r works on TESLA and other modern variants.
nvidia-smi --gpu-reset works on a variety of older GPUs.
Rebooting is the only options for the rest, unfortunately.

AWS, Cuda, Tensorflow

When I'm running my Python code on the most powerfull AWS GPU instances (with 1 or 8 x Tesla v100 16mb aka. P3.x2large or P3.16xlarge) they are both only 2-3 times faster than my DELL XPS Geforce 1050-Ti laptop?
I'm using Windows, Keras, Cuda 9, Tensorflow 1.12 and the newest Nvidia drivers.
When I check the GPU load via GZU the GPU max. run at 43% load for a very short period - each time. The controller runs at max. 100%...
The dataset I use is matrices in JSON format and the files are located on a Nitro drive at 10TB with MAX 64.000 IOPS. No matter if the folder contains 10TB, 1TB or 100mb...the training is still very very slow per iteration?
All advises are more than welcome!
UPDATE 1:
From the Tensorflow docs:
"To start an input pipeline, you must define a source. For example, to construct a Dataset from some tensors in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data are on disk in the recommended TFRecord format, you can construct a tf.data.TFRecordDataset."
Before I had matrices stored in JSON format (Made by Node). My TF runs in Python.
I will now only save the coordinates in Node and save it in JSON format.
The question is now: In Python what is the best solution to load data? Can TF use the coordinates only or do I have to make the coordinates back to matrices again or what?
The performance of any machine learning model depends on many things. Including but not limited to: How much pre-processing you do, how much data you copy from CPU to GPU, Op bottlenecks, and many more. Check out the tensorflow performance guide as a first step. There are also a few videos from the tensorflow dev summit 2018 that talk about performance. How to properly use tf.data, and how to debug performance are two that I recommend.
The only thing I can say for sure is that JSON is a bad format for this purpose. You should switch to tfrecord format, which uses protobuf (better than JSON).
Unfortunately performance and optimisation of any system takes a lot of effort and time, and can be a rabbit hole that just keeps going down.
First off, you should be having a really good reason to go for an increased computational overhead with Windows-based AMI.
If your CPU is at ~100%, while GPU is <100%, then your CPU is likely the bottleneck. If you are on cloud, consider moving to instances with larger CPU-count (CPU is cheap, GPU is scarce). If you can't increase CPU count, moving some parts of your graph to GPU is an option. However, tf.data-based input pipeline is run entirely on CPU (but highly scalable due to C++ implementation). Prefetching to GPUs might also help here, but the cost of spawning another background thread to populate the buffer for downstream might damp this effect. Another option is to do some or all pre-processing steps offline (i.e. prior to training).
A word of caution on using Keras as the input pipeline. Keras relies on Python´s multithreading (and optionally multiprocessing) libraries, which may both lack performance (when doing heavy I/O or augmentations on-the-fly) and scalability (when running on multiple CPUs) compared to GIL-free implementations. Consider performing preprocessing offline, pre-loading input data, or using alternative input pipelines (as the aforementioned TF native tf.data, or 3rd party ones, like Tensorpack).

Is training in tensorflow multithreaded?

Suppose I'm trying to train the following network for the cifar-10
https://www.tensorflow.org/tutorials/images/deep_cnn
I would like to know if the basic operations involved in stochastic gradient descent ( or some optimization technique) like computation of gradient etc multithreaded?
More precisely, if I run the above code on a single core machine and on many core machine like Intel Xeon Phi, will it run faster on the many-core machine? ( One can assume one core in both the machines are similar) If yes, what is the exact cause of the speed-up or which computations run faster on the many-core machine?
There are 3 kinds of parallelism in tensorflow.
input pipeline: input, preprocessing and network are run in different thread. This can be achieved in tf.QueueRunner or tf.data.Dataset in newer version of tensorflow.
inter op parallelism: if one node is independent from another, then they can be executed at the same time. So this depends on the structure of your network.
intra op parallelism: one node can use multi-thread, for example, ops implemented with Eigen can use multi-thread in one node.
BTW, inter and intra op parallelism threads appear in Session config, most cpu ops can benefit from it. However, GPU for training is highly recommended because of the high speed.

Tensorflow. How to distribute ops between GPUs

I am running a very large Tensorflow model on google cloud ml-engine.
When using the scale tier basic_gpu (with batch_size=1) I get errors like:
Resource exhausted: OOM when allocating tensor with shape[1,155,240,240,16]
because the model is too large to fit in one GPU.
Using the tier comple_model_m_gpu which provides 4 GPUs I can spread the operations between the 4 GPUs.
However, I remember reading that communication between GPUs is slow and can create a bottleneck in training. Is this true?
If so, is there a recommended way of spreading operations between the GPUs that prevents this problem?
I recommend the following guide:
Optimizing for GPU
From the guide:
The best approach to handling variable updates depends on the model,
hardware, and even how the hardware has been configured.
A few suggestions based on the guide:
Try using P100s which have 16 GB of RAM (compared to 12 on the K80s). They are also significantly faster, although they also cost more
Place the variables on CPU: tf.train.replica_device_setter(worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
Using Tesla P100 GPUs instead of Tesla K80 GPUs fixes this issue because P100s have something called Page Migration Engine.
Page Migration Engine frees developers to focus more on tuning for
computing performance and less on managing data movement. Applications
can now scale beyond the GPU's physical memory size to virtually
limitless amount of memory.

Running tensorflow on top of Keras with CPU on Windows 10

I have installed Tensorflow and Keras with an Anaconda installation on Windows 10. I´m using an Intel i7 processor. It takes 40 minutes to train 4000 data samples of a CSV file and I´m trying to perform a LSTM RNN predictive analytics on this data.
Is this an expected compile time using CPU? Can we make it faster using cpu or switching to GPU?
Yes, this does seem like a reasonable amount of time for your code to run when you're training using only a CPU. If you used a NVIDIA GPU it would run much faster.
However, you might not be using every core on your CPU; if you are, it might run faster. You can change the number of threads that Tensorflow uses by running
sess = tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
If you set the number of threads equal to those provided by your CPU, it should run faster.

Resources