Installing Spark MLLib on Mac OS X - apache-spark

I am trying to install MLLib on Mac OS X. On linux I just had to installed gfortran by following this post (Apache Spark -- MlLib -- Collaborative filtering). I have gfortran installed on my Mac. However, when I run:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import SVMWithSGD
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(1.0, [1.0]),
LabeledPoint(1.0, [2.0]),
LabeledPoint(1.0, [3.0])
]
svm = SVMWithSGD.train(sc.parallelize(data))
I am getting:
14/10/17 10:24:56 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
14/10/17 10:24:56 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
I am not sure what steps to follow to install successfully MLLib on my Mac. I am running Mac OS 10.9 with Spark 1.1.0 (pre-built).

Installing Apache Spark should implicitly install MLlib. Try installing Homebrew, xcode-select, java, scala and spark. Refer the link mentioned for a step-by-step process.

MLLib is part of Apache Spark, you do not need to install it separately.
The error message warns you that it cannot find a local implementation of BLAS and is falling back on F2J. The reason for this is most likely a spark installation via brew or tar.gz from spark.apache.org
Both distributions are missing a compile flag to use veclib.
To fix this you can either supply a dependency (com.github.fommil.netlib:all:1.1.2) or compile spark from sources with -Pnetlib-lgpl (see Failed to load implementation NativeSystemBLAS HiBench for a basic howto or read https://spark.apache.org/docs/latest/building-spark.html for more details)

I followed this article https://medium.freecodecamp.org/installing-scala-and-apache-spark-on-mac-os-837ae57d283f
install brew
xcode-select --install
brew cask install java
brew install scala
brew install apache-spark
you now have spark 🎉. To run a Scala shell
spark-shell
To run python shell
pyspark
To run a Scala file it must have a main method. Do
spark-submit file.scala

Related

Problem to Install pyspark of version 2.3

I was trying to install pyspark 2.3 from the last couple days. But I have found out Version 3.0.1 and 2.4.7 only so far. Actually I was trying to run a code implemented in pyspark 2.3 as a part of my project. Is that version still available now ? Please send me the essential resources to install pyspark 2.3 if it is available to install as well as shareable. As it seems tough to me to implement that code in version 3.0.1.
Pyspark 2.3 should still be available via Conda-Forge.
Please checkout https://anaconda.org/conda-forge/pyspark/files?version=2.3.2
There you will find the following and more packages for a direct download:
linux-64/pyspark-2.3.2-py36_1000.tar.bz2
win-64/pyspark-2.3.2-py36_1000.tar.bz2
If you don't want the raw packages, you can also install it via conda:
conda install -c conda-forge pyspark=2.3.2

PyArrow >= 0.8.0 must be installed; however, it was not found

I am on the Cloudera platform, I am trying to use pandas UDF in pyspark.I am getting below error.
PyArrow >= 0.8.0 must be installed; however, it was not found.
Installing pyarrow 0.8.0 on the platform will take time.
Is there any workaround to use pandas udf without installing pyarrow?
I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?
I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?
No you cant simply install in your machine and use it, as pyspark is distributed.
But you can pack your venv and ship to your pyspark worker without install custom package like pyarrow on every machine of your platform.
To use virtualenv, simply follow venv-pack package's instruction.
https://jcristharif.com/venv-pack/spark.html

Spark installation for production, pip install or not?

I would like to install Pyspark 2.4.4. I have seen that I can download the Spark package or use pip install. I only need Pyspark, are they the same with both installations?
you could do python pip install pyspark but it doesn't come with Hadoop binaries which is necessary for the spark to function properly.
The easiest way to install is by using python findspark
download .tgz file from the spark website which comes with Hadoop binaries
pip install findspark
In Python:
import findspark
finspark.init('\path\to\extracted\binaries\folder')
import pyspark

Jupyter + Apache toree - scala kernel is busy

I've installed jupyter notebook over python 3.5.2 on ubuntu server 16.04
I also have installed apache toree to run spark jobs from jupyter.
I run:
pip3 install toree
jupyter toree install --spark_home=/home/arik/spark-2.0.1-bin-hadoop2.7/ # My Spar directory
The output was a success:
[ToreeInstall] Installing Apache Toree version 0.1.0.dev8
[ToreeInstall] Apache Toree is an effort undergoing incubation at the
Apache Software Foundation (ASF), sponsored by the Apache Incubator
PMC.
Incubation is required of all newly accepted projects until a further
review indicates that the infrastructure, communications, and decision
making process have stabilized in a manner consistent with other
successful ASF projects.
While incubation status is not necessarily a reflection of the
completeness or stability of the code, it does indicate that the
project has yet to be fully endorsed by the ASF.
Additionally, this release is not fully compliant with Apache release
policy and includes a runtime dependency that is licensed as LGPL v3
(plus a static linking exception). This package is currently under an
effort to re-license (https://github.com/zeromq/jeromq/issues/327).
[ToreeInstall] Creating kernel Scala [ToreeInstall] Removing existing
kernelspec in /usr/local/share/jupyter/kernels/apache_toree_scala
[ToreeInstall] Installed kernelspec apache_toree_scala in
/usr/local/share/jupyter/kernels/apache_toree_scala
and i though that everthing was successful but everytime i create an apache toree notebook i see the following:
It says Kernel busy and all of my commands are ignored..
I couldn't find anything about this issue online.
Alternatives to toree would also be accepted.
Thank you
Toree unfortunately does not work with Scala 2.11. Either you can downgrade to scala 2.10 with spark or use more recent version of toree(still in beta). The way I made it work with spark 2.1 and Scala 2.11:
#!/bin/bash
pip install -i https://pypi.anaconda.org/hyoon/simple toree
jupyter toree install --spark_home=$SPARK_HOME --user #will install scala + spark kernel
jupyter toree install --spark_home=$SPARK_HOME --interpreters=PySpark --user
jupyter kernelspec list
jupyter notebook #launch jupyter notebook
Look at this post and this post for more info.
It will eventually look like this:

Error installing h5py using pip

Error installing h5py using pip
Environment:- spark service IBM bluemix
!pip install --user h5py fails with error gcc failed.
I even tried to download the package and then run !python setup.py install
Here is an example notebook that illustrates how to:
Grab a copy of the native HDF5 lib source and extract
Configure, Make and Install the native HDF5 libs
Successfully complete the install of h5py onto a notebook within the Bluemix Spark service.
Executes some of the h5py quick start guide sample code.
Hope this helps.

Resources