i am trying to troubleshoot my non-working apache spark and netlib setup and i don't know what to do next.
Here some info:
Spark 1.3.1 (but also tried 1.5.1)
Mesos Cluster with 3 Nodes
Ubuntu Trusty on every node and installed following BLAS package
$ dpkg -l | grep 'blas\|atlas\|lapack'
ii libopenblas-base 0.2.8-6ubuntu1 amd64 Optimized BLAS (linear algebra) library based on GotoBLAS2
$ update-alternatives --get-selections | grep 'blas\|lapack'
libblas.so.3 auto /usr/lib/openblas-base/libblas.so.3
I have built a sample jar for testing if netlib-java can detect this libraries, with following code:
object Main extends App {
println(com.github.fommil.netlib.BLAS.getInstance().getClass().getName())
println(com.github.fommil.netlib.LAPACK.getInstance().getClass().getName())
}
When i execute this code i get following response:
$ java -jar artifacts/BLAStest-assembly-1.0.jar
Mar 29, 2016 3:43:33 PM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Mar 29, 2016 3:43:33 PM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /tmp/jniloader6790966128222263615netlib-native_ref-linux-x86_64.so
com.github.fommil.netlib.NativeRefBLAS
Mar 29, 2016 3:43:33 PM com.github.fommil.netlib.LAPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
Mar 29, 2016 3:43:33 PM com.github.fommil.jni.JniLoader load
INFO: already loaded netlib-native_ref-linux-x86_64.so
com.github.fommil.netlib.NativeRefLAPACK
So it seems to work just fine here.
But spark can't detect the libraries. I have added this java dependency to my assembly jar
com.github.fommil.netlib:all:1.1.2
also if i try to start a spark shell with this package it doesn't work.
spark-shell --packages com.github.fommil.netlib:all:1.1.2
It looks like your netlib-java implementation is loading the NativeRefBLAS, but not the NativeSystemBLAS. This means that you're inclusion of "com.github.fommil.netlib:all" is working ok, since without it you would be using the non-native F2J implementation. The issue is that you want to use the system-provided BLAS (OpenBLAS), instead of the reference implementation that came with netlib-java. This probably just is a matter of getting the right shared libraries in a location that's visible to your spark executors.
You said that you linked libblas.so.3, but as described in the netlib-java readme, you need to also configure the libblas.so, liblapack.so, and liblapack.so.3:
sudo apt-get install libatlas3-base libopenblas-base
sudo update-alternatives --config libblas.so
sudo update-alternatives --config libblas.so.3
sudo update-alternatives --config liblapack.so
sudo update-alternatives --config liblapack.so.3
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I installed Tensorflow 1.6.0 - GPU version with anaconda in a Python 3.6.4 environment.
When I do import tensorflow as tf, I get the following error:
ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory
The different versions:
cudnn : 7.1.1
cuda : 9.0.176
tensorflow : 1.6.0
Ubuntu : 16.04
I am aware of this but it did not solve my problem.
The accepted answer is wrong (installing nvidia-cuda-toolkit). By installing the toolkit you are basically installing a second CUDA on top of already installed cuda from the nvidia guide.
The problem turned out to be an issue with symbolic links. Inspiration is from this topic http://queirozf.com/entries/installing-cuda-tk-and-tensorflow-on-a-clean-ubuntu-16-04-install
but the actual resolution is different
So at one point during CuDNN installation nvidia tutorial will ask you to do this:
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
The problem with this approach is that copying files with filter libcudnn* will break the symbolic links of the copied files. Instead, I suggest runnign following command, but it will still break the links:
sudo cp --preserve=links cuda/lib64/libcudnn* /usr/local/cuda/lib64
You can verify the links by running ls -lha libcudnn* in /usr/local/cuda/lib64 folder. If you happen to not see a picture like this:
lrwxrwxrwx 1 root root 13 May 2 20:02 libcudnn.so -> libcudnn.so.7
lrwxrwxrwx 1 root root 17 May 2 20:02 libcudnn.so.7 -> libcudnn.so.7.6.5
-rwxr-xr-x 1 root root 409M May 2 20:02 libcudnn.so.7.6.5
-rw-r--r-- 1 root root 386M May 2 20:02 libcudnn_static.a
Then you just found the problem. The actual solution is involving doing the following:
sudo rm /usr/local/cuda/lib64/libcudnn.so
sudo rm /usr/local/cuda/lib64/libcudnn.so.7
cd /usr/local/cuda/lib64/
sudo ln -s libcudnn.so.7.6.5 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
Remove the old "links" and create new ones. Verify the links again with ls -lha libcudnn*. After that run following command in verbose mode:
sudo ldconfig -v
CHECK the logs. I don't know exactly what it does, but it turned out that it is something very important. Also, if the log says that symbolic link is broken or something along these lines then the tensorflow will continue to show the error mentioned in the subject.
BONUS! make sure you have following paths appended as the last lines nano ~/.bashrc
export PATH=/usr/local/cuda/bin:/opt/nvidia/nsight-compute/2019.4.0${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDADIR=/usr/local/cuda${CUDADIR:+:${CUDADIR}}
export CUDA_HOME=/usr/local/cuda
and then run the command source ~/.bashrc
All of the above steps assume that you did NOT use the nvidia-cuda-toolkit, but instead used nvidia cuda repo.
Also when installing CUDA make sure you are not targeting the 10.2. On the momenent of writing TF supports versions up to Cuda 10.1, so following is the right way of installing the necessary version:
sudo apt-cache policy cuda
sudo apt-get install cuda=10.1.243-1
Verifications by:
nvcc --version
nvidia-smi
EDIT: I found the error that you should AVOID seeing after running the ldconfig command:
/usr/local/cuda-10.1/targets/x86_64-linux/lib:
...
libnppist.so.10 -> libnppist.so.10.2.0.243
libcuinj64.so.10.1 -> libcuinj64.so.10.1.243
> /sbin/ldconfig.real: /usr/local/cuda-10.1/targets/x86_64-linux/lib /libcudnn.so.7 is not a symbolic link
libcudnn.so.7 -> libcudnn.so.7.6.5
libnppc.so.10 -> libnppc.so.10.2.0.243
libnppicom.so.10 -> libnppicom.so.10.2.0.243
libnvgraph.so.10 -> libnvgraph.so.10.1.243
/usr/lib/x86_64-linux-gnu/libfakeroot:
...
If you see it, then something is still misconfigured.
I don't have enough reputations to comment on Alex's answer. But now on Ubuntu 20.04 the paths have been changed! Also, no need to --preserve=links now when doing cp! So I should probably post a new answer:
Install cuDNN library 7.6 for TensorFlow 2.3.1 with CUDA 10.1, in an environment created by conda create --name tfgpu10.1 python=3.8:
Go to https://developer.nvidia.com/cuDNN
Download "cuDNN Library for Linux" in "Download cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.1"
Extract using tar -xvzf cudnn-10.1-linux-x64-v7.6.5.32.tgz
"Install" files:
sudo cp cuda/include/cudnn.h /usr/lib/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/lib/cuda/lib64/
Set permission:
sudo chmod a+r /usr/lib/cuda/include/cudnn.h /usr/lib/cuda/lib64/libcudnn*
Output of testing:
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-12-02 03:58:41.089993: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.config.list_physical_devices("GPU")
2020-12-02 03:58:48.538295: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-12-02 03:58:48.587523: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-02 03:58:48.587838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2020-12-02 03:58:48.587860: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-12-02 03:58:48.589111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-12-02 03:58:48.590284: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-12-02 03:58:48.590488: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-12-02 03:58:48.591785: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-12-02 03:58:48.592520: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-12-02 03:58:48.595129: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-12-02 03:58:48.595213: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-02 03:58:48.595555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-02 03:58:48.595815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I installed the nvidia-cuda-toolkit package:
$ sudo apt install nvidia-cuda-toolkit
and it worked.
I did not find the solution nor on the tensorflow website nor on the nvidia installation page. I found it by luck while looking for a way to get the cuda version with a command line: How to get the cuda version?
This didn't work for me, In my case it was because I had multiple versions of Cuda installed and that the cudnn version I had was for an older version than the one I was trying to use so I installed the cudnn for the new verision following nvidia's instructions and that did it for me.
I am getting the following Warning when I run the PySpark job:
17/10/06 18:27:16 WARN ARPACK: Failed to load implementation
from: com.github.fommil.netlib.NativeSystemARPACK
17/10/06 18:27:16
WARN ARPACK: Failed to load implementation from:
com.github.fommil.netlib.NativeRefARPACK
My Code is
mat = RowMatrix(tf_rdd_vec.cache())
svd = mat.computeSVD(num_topics, computeU=False)
I am using Ubuntu 16.04 EC2 instance. And I have installed all following libraries into my system.
sudo apt install libarpack2 Arpack++ libatlas-base-dev liblapacke-dev libblas-dev gfortran libblas-dev liblapack-dev libnetlib-java libgfortran3 libatlas3-base libopenblas-base
I have adjusted LD_LIBRARY_PATH to point to shared lib path as following.
export LD_LIBRARY_PATH=/usr/lib/
Now when I list $LD_LIBRARY_PATH directory it shown me the following .so files
ubuntu:~$ ls $LD_LIBRARY_PATH/*.so | grep "pack\|blas"
/usr/lib/libarpack.so
/usr/lib/libblas.so
/usr/lib/libcblas.so
/usr/lib/libf77blas.so
/usr/lib/liblapack_atlas.so
/usr/lib/liblapacke.so
/usr/lib/liblapack.so
/usr/lib/libopenblasp-r0.2.18.so
/usr/lib/libopenblas.so
/usr/lib/libparpack.so
But Still I am not able to use the Native ARPACK implementation. Also I am Caching the RDD passing to matrix But it still throws Cache WARNING Any suggestion how to solve these 3 Warnings ?
I have downloaded compiled version of spark-2.2.0 from the spark download page.
After exploring I am able to remove these warnings and using native ARPACK by the following way.
The solution was to rebuild spark with -Pnetlib-lgpl argument.
Build Spark for Native Support
So following are my steps on Ubuntu 16.04
# Make sure you use the correct download link, from spark download section
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0.tgz
tar -xpf spark-2.2.0.tgz
cd spark-2.2.0/
./dev/make-distribution.sh --name custom-spark --pip --tgz -Psparkr -Phadoop-2.7 -Pnetlib-lgpl
When i started the first time it failed by throwing the following error
Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is
properly installed. [ERROR] Command execution failed.
[TRUNCATED]
[INFO] BUILD FAILURE [INFO]
[INFO] Total time: 02:38 min (Wall Clock) [INFO] Finished at:
2017-10-13T21:04:11+00:00 [INFO] Final Memory: 59M/843M
[ERROR] Failed to execute goal
org.codehaus.mojo:exec-maven-plugin:1.5.0:exec (sparkr-pkg) on project
spark-core_2.11: Command execution failed. Process exited with an
error: 1 (Exit value: 1) -> [Help 1] [ERROR]
So i installed the R language
sudo apt install r-base-core
Then i re-ran the above build command and it successfully installed.
Following are related versions when i build this release
$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
$ python --version
Python 2.7.12
$ R --version
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
$ make --version
GNU Make 4.1
Built for x86_64-pc-linux-gnu
I met this error when compiling a modified caffe version.
OpenCV static library was compiled with CUDA 7.5 support. Please, use the same version or rebuild OpenCV with CUDA 8.0
I have some old code may not compatible with CUDA8.0, so I want to change my cuda version for this error.
I modified my ~/.bash_profile like this
# export PYTHONPATH=$PYTHONPATH:/usr/local/cuda-8.0/lib64/
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64
export PYTHONPATH=$PYTHONPATH:/usr/local/cuda-7.5/targets/x86_64-linux/lib/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.5/targets/x86_64-linux/lib/
But it did't work. Still the same error. What should I do? Thanks.
Change your CUDA soft link to point on your desired CUDA version. For example:
ll /usr/local/cuda
lrwxrwxrwx 1 root root 19 Sep 06 2017 /usr/local/cuda -> /usr/local/cuda-8.0/
Simply relink it with
Update:
If the symlink already exists, use this other command:
[jalal#goku ~]$ ls /usr/local/cuda
lrwxrwxrwx. 1 root root 20 Sep 14 08:03 /usr/local/cuda -> /usr/local/cuda-10.2
[jalal#goku ~]$ sudo ln -sfT /usr/local/cuda/cuda-11.1/ /usr/local/cuda
[jalal#goku ~]$ ls /usr/local/cuda
lrwxrwxrwx. 1 root root 26 Sep 14 13:25 /usr/local/cuda -> /usr/local/cuda/cuda-11.1/
ln -s /usr/local/cuda-7.5 /usr/local/cuda
(With the proper installation location)
Perhaps cleaner:
sudo update-alternatives --display cuda
sudo update-alternatives --config cuda
Maybe a bit late, but I thought it might still be helpful for anyone who comes across this question. I wrote a simple bash script for switching to a different version of CUDA within the current bash session: https://github.com/phohenecker/switch-cuda
This solution explains how you can have multiple different cuda versions installed, i.e. 10.2, 11.3 and 11.6 and switch between them. It's an extension of #w.t and makes use of update-alternatives.
Afaik, after cuda 11.x the installations on Ubuntu 20.04 cuda installations will be added to the update-alternatives maintenance automatically.
Let's say you installed cuda 10.2, cuda 11.3 and cuda 11.6 (following the official nvidia installation guide: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html). They will all reside in:
/usr/local/cuda-10.2/...
/usr/local/cuda-11.3/...
/usr/local/cuda-11.6/...
Your update-alternatives will have two entries:
$ sudo update-alternatives --query cuda
...
/usr/local/cuda-11-3 - priority 113
/usr/local/cuda-11-6 - priority 116
Solution 1: If you want to make use of the update-alternatives make sure that your cuda symbolic link points to /etc/alternatives/cuda.
# Change the symbolic link target.
$ sudo ln -sfT /etc/alternatives/cuda /usr/local/cuda
# Check the path.
$ ll /usr/local/cuda
lrwxrwrwrwx 1 root root /usr/local/cuda -> /etc/alternatives/cuda/
Now, all that is left is to make sure /etc/alternatives/cuda points to the version you want to use, e.g. 11.3.
You can make that update with:
$ sudo update-alternatives --config cuda
and follow the instructions to change the version.
Check the path:
$ ll /etc/alternatives/cuda
lrwrwrwrwx root root /etc/alternatives -> /usr/local/cuda-11.3
almost done.
And always make sure to load the correct library PATHs in your ~/.bashrc.
Solution 2:
Directly set your /usr/local/cuda symbolic link to the correct version.
$ ln -sfT /usr/local/cuda-11.3 /usr/local/cuda
Reboot your machine and double check everything is set properly:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May3 19:15:14_PDT_2021
Cuda compilation tools, release 11.3 V11.3.109
Build cuda 11.3.r11.3/compiler.29920130_0
I solved the problem finally.
Modifying ~/.bash_profile to change the path to CUDA is the correct way. But when you changed the file, you need to relaunch the bash.
Simply source ~/.bash_profile won't work. Because source will only append the content in the file to the already existed path rather than cover it.
I switched from cuda 10.2 to 11.7. export PATH=/etc/alternatives/cuda/bin:$PATH did the trick for me. I found the solution in this thread: https://stackoverflow.com/a/40599478/7924573
I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the cluster.
Currently I'm just trying to get some basic examples up and running to validate that I've got everything configured properly. The problem I am having is that I'm not seeing the performance I expect from the Atlas BLAS libraries on the Amazon machine instance.
Below is a code snippet of my simple benchmark. It's just a square matrix multiply followed by short fat multiply and a tall thin multiply to yield a small matrix that can be printed (I wanted to be sure Scala would not skip any part of the computation due to lazy evaluation).
I'm using Breeze for the linear algebra library and netlib-java to pull in the local native libraries for BLAS/LAPACK
import breeze.linalg.{DenseMatrix, DenseVector}
import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.rdd.RDD
import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.SparkConf
import com.github.fommil.netlib.BLAS.{getInstance => blas}
import scala.reflect.ClassTag
object App {
def NaiveMultiplication(n: Int) : Unit = {
val vl = java.text.NumberFormat.getIntegerInstance.format(n)
println(f"Naive Multipication with vector length " + vl)
println(blas.getClass().getName())
val sm: DenseMatrix[Double] = DenseMatrix.rand(n, n)
val a: DenseMatrix[Double] = DenseMatrix.rand(2,n)
val b: DenseMatrix[Double] = DenseMatrix.rand(n,3)
val c: DenseMatrix[Double] = sm * sm
val cNormal: DenseMatrix[Double] = (a * c) * b
println(s"Dot product of a and b is \n$cNormal")
}
Based on a web survey of benchmarks I'm expecting a 3000x3000 matrix multiply to take approx. 2-4s using a native, optimized BLAS library. When I run locally on my MacBook Air this benchmark completes in 1.8s. When I run this on EMR it completes in approx. 11s (using a g2.2xlarge instance, though similar results were obtained on a m3.xlarge instance). As another cross check I ran a prebuilt EC2 AMI from the BIDMach project on the same EC2 instance type, g2.2xlarge, and got 2.2s (note, the GPU benchmark for the same calculation yielded 0.047s).
At this point I suspect that netlib-java is not loading the correct lib, but this is where I am stuck. I've gone through the netlib-java README many times and it seems the ATLAS libs are already installed as required (see below)
[hadoop#ip-172-31-3-69 ~]$ ls /usr/lib64/atlas/
libatlas.a libcblas.a libclapack.so libf77blas.so liblapack.so libptcblas.so libptf77blas.so
libatlas.so libcblas.so libclapack.so.3 libf77blas.so.3 liblapack.so.3 libptcblas.so.3 libptf77blas.so.3
libatlas.so.3 libcblas.so.3 libclapack.so.3.0 libf77blas.so.3.0 liblapack.so.3.0 libptcblas.so.3.0 libptf77blas.so.3.0
libatlas.so.3.0 libcblas.so.3.0 libf77blas.a liblapack.a libptcblas.a libptf77blas.a
[hadoop#ip-172-31-3-69 ~]$ cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
[hadoop#ip-172-31-3-69 ~]$ ls /etc/ld.so.conf.d
atlas-x86_64.conf kernel-4.4.11-23.53.amzn1.x86_64.conf kernel-4.4.8-20.46.amzn1.x86_64.conf mysql55-x86_64.conf R-x86_64.conf
[hadoop#ip-172-31-3-69 ~]$ cat /etc/ld.so.conf.d/atlas-x86_64.conf
/usr/lib64/atlas
Below I've show 2 examples running the benchmark on Amazon EMR instance. The first shows when the native system BLAS supposedly loads correctly. The second shows when the native BLAS does not load and the package falls back to the reference implementation. So it does appear to be loading a native BLAS based on the messages and the timing. Compared to running locally on my Mac, the no BLAS case runs in approximately the same time, but the native BLAS case runs in 1.8s on my Mac compared to 15s in the case below. The info messages are the same for my Mac compared to EMR (other than specific dir/file names, etc.).
[hadoop#ip-172-31-3-69 ~]$ spark-submit --class "com.cyberatomics.simplespark.App" --conf "spark.driver.extraClassPath=/home/hadoop/simplespark-0.0.1-SNAPSHOT-jar-with-dependencies.jar" --master local[4] simplespark-0.0.1-SNAPSHOT-jar-with-dependencies.jar 3000 naive
Naive Multipication with vector length 3,000
Jun 16, 2016 12:30:39 AM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /tmp/jniloader2856061049061057802netlib-native_system-linux-x86_64.so
com.github.fommil.netlib.NativeSystemBLAS
Dot product of a and b is
1.677332076284315E9 1.6768329748988206E9 1.692150656424957E9
1.6999000993276503E9 1.6993872020220244E9 1.7149145239563465E9
Elapsed run time: 15.1s
[hadoop#ip-172-31-3-69 ~]$
[hadoop#ip-172-31-3-69 ~]$ spark-submit --class "com.cyberatomics.simplespark.App" --master local[4] simplespark-0.0.1-SNAPSHOT-jar-with-dependencies.jar 3000 naive
Naive Multipication with vector length 3,000
Jun 16, 2016 12:31:32 AM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Jun 16, 2016 12:31:32 AM com.github.fommil.netlib.BLAS <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
com.github.fommil.netlib.F2jBLAS
Dot product of a and b is
1.6640545115052865E9 1.6814609592261212E9 1.7062846398842275E9
1.64471099826913E9 1.6619129531594608E9 1.6864479674870768E9
Elapsed run time: 28.7s
At this point my best guess is that it is actually loading a native lib, but it is loading a generic one. Any suggestions on how I can verify which shared library it is picking up at run time? I tried 'ldd' but that seems not to work with spark-submit. Or maybe my expectations for Atlas are wrong, but seems hard to believe AWS would pre-install the libs if they weren't running a reasonably competitive speeds.
If you see that the libs are not linked up correctly on EMR, please provide guidance on what I need to do in order for the Atlas libs to get picked up by netlib-java.
thanks
tim
Follow up:
My tentative conclusion is that the Atlas libs installed by default on the Amazon EMR instance is simply slow. Either it is a generic build that has not been optimized for the specific machine type, or it is fundamentally slower than other libraries. Using this script as a guide I built and installed OpenBLAS for the specific machine type where I was running the benchmarks(I also found some helpful info here). Once OpenBLAS was installed my 3000x3000 matrix multiply benchmark completed in 3.9s (as compared to the 15.1s listed above when using the default Atlas libs). This is still slower than the same benchmark run on my Mac (by a factor of x2), but this difference falls in a range that could credibly be due to underlying h/w performance.
Here is a complete listing of the commands I used to install OpenBLAS libs on Amazon's EMR, Spark instance:
sudo yum install git
git clone https://github.com/xianyi/OpenBlas.git
cd OpenBlas/
make clean
make -j4
sudo mkdir /usr/lib64/OpenBLAS
sudo chmod o+w,g+w /usr/lib64/OpenBLAS/
make PREFIX=/usr/lib64/OpenBLAS install
sudo rm /etc/ld.so.conf.d/atlas-x86_64.conf
sudo ldconfig
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/libblas.so
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/libblas.so.3
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/libblas.so.3.5
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/liblapack.so
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/liblapack.so.3
sudo ln -sf /usr/lib64/OpenBLAS/lib/libopenblas.so /usr/lib64/liblapack.so.3.5
I am trying to debug gcov code. I wrote a simple C program which calls __gcov_flush() method which is part of gcc/gcov.
After confirming that libgcov.a library has not been built with debug symbols, I have installed debuginfo packages for gcc on my machine (SLES 10).
# gcc -v
Using built-in specs.
Target: x86_64-suse-linux
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.1.2 --enable-ssp --disable-libssp --disable-libgcj --with-slibdir=/lib64 --with-system-zlib --enable-shared --enable-__cxa_atexit --enable-libstdcxx-allocator=new --program-suffix= --enable-version-specific-runtime-libs --without-system-libunwind --with-cpu=generic --host=x86_64-suse-linux
Thread model: posix
gcc version 4.1.2 20070115 (SUSE Linux)
# rpm -qi gcc-debuginfo-4.1.2_20070115-0.29.6.x86_64
Name : gcc-debuginfo Relocations: (not relocatable)
Version : 4.1.2_20070115 Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
Release : 0.29.6 Build Date: Sat Sep 5 03:04:50 2009
Install Date: Thu Apr 24 05:25:32 2014 Build Host: bingen
Group : Development/Debug Source RPM: gcc-4.1.2_20070115-0.29.6.src.rpm
Size : 251823743 License: GPL v2 or later
Signature : DSA/SHA1, Sat Sep 5 03:06:59 2009, Key ID a84edae89c800aca
Packager : http://bugs.opensuse.org
URL : http://gcc.gnu.org/
Summary : Debug information for package gcc
Description :
This package provides debug information for package gcc.
Debug information is useful when developing applications that use this
package or when debugging this package.
Distribution: SUSE Linux Enterprise 10
/usr/lib/debug/usr/bin # ls -lrt gcov.debug
-rw-r--r-- 1 root root 94216 Sep 5 2009 gcov.debug
However, even after installing the proper version of debuginfo (gcov.debug) packages, GDB still cannot recognize the line number information, it just passes the control to next line without reporting line number (or stepping into the function).
(gdb)s
26 i++;
(gdb)s
27 __gcov_flush();
(gdb)s
28 printf("%d",i);
(gdb)
(gdb) show debug-file-directory
The directory where separate debug symbols are searched for is "/usr/lib/debug".
Why GDB cannot identify line number information for gcov? If I have not installed the proper version of debuginfo packages for gcc/gcov, How to confirm it?
After confirming that libgcov.a library has not been built with debug symbols, I have installed debuginfo packages
You don't appear to understand how debuginfo packages work. They can't magically add debuginfo to an archive library that was built without debug symbols (or one that was stripped).
The usual build flow is:
build everything with -g
prepare separate debuginfo packages for all fully-linked binaries (executables and shared libraries)
strip fully-linked binaries (but not archive libraries)
This allows binaries and shared libraries to be small, but still debuggable after installing the debuginfo package.
Apparently, on SLES10 the "but not archive libraries" was not honored, and libgcov.a got stripped as well. Since separate debuginfo packages do not work for archive libraries, you can't get that info back. Your only option is to rebuild GCC from sources.
P.S. Why would they strip libgcov.a?
It's a trade-off: binaries that end-users link will be smaller, but code in libgcov.a will not be debuggable.
Since most end-users never debug libgcov.a, I'd say it was not an unreasonable trade-off.