Installing external packages in spark on spylon kernel - apache-spark

I'm looking for a way to install outside packages on spylon kernel. I already tried initialize spark-shell with --package command inside the spylon but it justs creates another instance. I tried %%init_spark and launcher.packages, but it didn't work too. There's anyway to install an external package, from spark-packages for example?

Just need to move the jar to the $SPARK_HOME/jars folder. I wish that I could download and use it in the Notebook, but this works for now.

Related

Can't make action calls through anaconda py35 env in spark HdInsight

As per the documentation - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-python-package-installation
we had installed several external python modules through new anaconda env 'py35_data_prof'. However as soon as we invoke any rdd action calls like rdd.count() or rdd.avg() in our python code, spark2 throws -
Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory
enter image description here
FYI, The python indicated in error path - '/usr/bin/anaconda/envs/py35_data_prof/bin/python' is actually a symlink rather than python dir.
I have been looking up the HDInsight docs but can't seem to find the fix. Please let us know if there is a way around it.
The error message “Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory” clearly says the unable to find/locate the package installed. Make sure the package is installed with the all the requirements mentioned below.
• Create Python virtual environment using conda.
• Install external Python packages in the created virtual environment if needed.
• Change Spark and Livy configs and point to the created virtual environment.
I would request you to follow the each and every step mentioned here: “Safely install external Python packages”.
Hope this helps.

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?
EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Install an external library instance of libv8-3.14 to folder

I need libv8-3.14 to run some R packages on linux, but I don't have root access/sudo access on the linux computer I'm using so I'd like to install an external folder instance of libv8-3.14. I've seen R packages reference this as external as CDFLAG="folder/v8-3.14" so I know it is possible.
I'm new(ish) to linux but I've installed external libraries before with tar.gz files which then have a configure file in them, which I set the external folder with ./configure --prefix==/folder/loc, but the only downloads I can find of libv8 are .git (which I can't get to work either).
How can I install an libv8-3.14 to a folder and install so I can set:
export PATH=$PATH:/path/to/install/
and
export `LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/install/`
I had the exact same problem. In case somebody in the future comes across this post, I will leave my suggestions and how it worked out in the end. Also, all credits go to an experienced colleague of mine.
The most sure thing to do is to consult IT, or someone who has already had the same problem, there is usually a workaround these issues.
A way you can do it yourself:
Create an anaconda environment, you can name it 'V8' or something (make sure the environment is based on the latest python version, or recent enough for r-v8).
activate it
install the conda version of the V8 R interface with conda install -c conda-forge r-v8
That's it. Whenever you need V8, fire up your environment beforehand, and it should be A-OK.
Further advice: If you run into errors when installing r-v8, it may be a good idea to update your conda and all the packages. However, depending on your conda version conda update conda and conda upgrade --all MAY BREAK your conda installation, so be careful. (For further information on this problem, see the endless complaints of people in this issue: https://github.com/conda/conda/issues/8920).
V8 doesn't use autotools, so it has no ./configure. In fact, it provides no installation facilities at all, because it is meant for embedding, not installing.
What I would try is to download the Ubuntu package (guessing from your other question, you are on Ubuntu, right?) for the right architecture from https://packages.ubuntu.com/trusty/libv8-3.14.5, and extracting it manually. .deb files are just ZIP archives.
As a side note, there's no point in setting PATH, because libv8, being a library, provides no executables. LD_LIBRARY_PATH is all you need.

Can run pyspark.cmd but not pyspark from command prompt

I am trying to get pyspark setup for windows. I have java, python, Hadoop, and spark all setup and environmental variables I believe are setup as I've been instructed elsewhere. In fact, I am able to run this from the command prompt:
pyspark.cmd
And it will load up the pyspark interpreter. However, I should be able to run pyspark unqualified (without the .cmd), and python importing won't work otherwise. It does not matter whether I navigate directly to spark\bin or not, because I do have spark\bin added to the PATH already.
.cmd is listed in my PATHEXT variable, so I don't get why the pyspark command by itself doesn't work.
Thanks for any help.
While I still don't know exactly why, I think the issue somehow stemmed out of how I unzipped the spark tar file. Within the spark\bin folder, I was unable to run any .cmd programs without the .cmd extension included. But I could do that in basically any other folder. I redid the unzip and the problem no longer existed.

How to use cassandra-loader on ubuntu

I want to use cassandra-loader on ubuntu 14.04.
I have cassandra installed on my machine along with other prerequisites require for loader.
I am following this link for the same;
https://github.com/brianmhess/cassandra-loader
I downloaded the cassandra-loader tool but when trying to run any cassandra-loader command it prompts cassandra-loader command not found.
Kindly guide if I am missing anything or need to install other prerequisite as well.
in the README it says:
To run cassandra-loader, simply run the cassandra-loader executable (e.g., located at build/cassandra-loader)
so everywhere where you see the cassandra-loader alone, or just copy build/cassandra-loader somewhere in your PATH and use it.
Done. Just needed to run chmod command on utility to work properly.

Resources