Can't make action calls through anaconda py35 env in spark HdInsight - azure

As per the documentation - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-python-package-installation
we had installed several external python modules through new anaconda env 'py35_data_prof'. However as soon as we invoke any rdd action calls like rdd.count() or rdd.avg() in our python code, spark2 throws -
Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory
enter image description here
FYI, The python indicated in error path - '/usr/bin/anaconda/envs/py35_data_prof/bin/python' is actually a symlink rather than python dir.
I have been looking up the HDInsight docs but can't seem to find the fix. Please let us know if there is a way around it.

The error message “Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory” clearly says the unable to find/locate the package installed. Make sure the package is installed with the all the requirements mentioned below.
• Create Python virtual environment using conda.
• Install external Python packages in the created virtual environment if needed.
• Change Spark and Livy configs and point to the created virtual environment.
I would request you to follow the each and every step mentioned here: “Safely install external Python packages”.
Hope this helps.

Related

IBM-federated-learning-lib_ModuleNotFound: No module named 'examples.constants'

for some days I got in touch with the IBM federated learning library from github. After finishing the setup, I wanted to run the files for the aggregator and parties in my jupyter notebook and ran into a ModuleNotFoundError as you can see on the picture attached.
What can not be the cause:
a) Jupyter uses different Python as pip does. Both are using 3.8
b) Jupyter tries to load from a wrong directory: As you can see on the code we are loading generate.data from the directory /home/jovyan/FL_MNIST/federated-learning-lib/federated-learning-lib/examples, and the file generate.data is in the directory /home/jovyan/FL_MNIST/federated-learning-lib/federated-learning-lib/examples.
Did anyone ran into this also using the IBM library and could solve it? or do you have a well educated guess what can be the cause?
Best regards,
Solaris

No start-history-server.sh when pyspark installed through conda

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?
EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Can run pyspark.cmd but not pyspark from command prompt

I am trying to get pyspark setup for windows. I have java, python, Hadoop, and spark all setup and environmental variables I believe are setup as I've been instructed elsewhere. In fact, I am able to run this from the command prompt:
pyspark.cmd
And it will load up the pyspark interpreter. However, I should be able to run pyspark unqualified (without the .cmd), and python importing won't work otherwise. It does not matter whether I navigate directly to spark\bin or not, because I do have spark\bin added to the PATH already.
.cmd is listed in my PATHEXT variable, so I don't get why the pyspark command by itself doesn't work.
Thanks for any help.
While I still don't know exactly why, I think the issue somehow stemmed out of how I unzipped the spark tar file. Within the spark\bin folder, I was unable to run any .cmd programs without the .cmd extension included. But I could do that in basically any other folder. I redid the unzip and the problem no longer existed.

Properly configuring PySpark and Anaconda3 on Linux

Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).

Loading python modules through a computing cluster

I have an account to a computing cluster that uses Scientific Linux. Of course I only have user access. I'm working with python and I need to run python scripts, so I need to import some python modules. Since I don't have root access, I installed a local python copy on my $HOME with all the required modules. When I run the scripts on my account (hosting node), they run correctly. But in order to submit jobs to the computing queues (to process on much faster machines), I need to submit a bash script that has a line that executes the scripts. The computing cluster uses SunGrid Engine. However when I submit the bash script, I get an error that the modules I installed can't be found! I can't figure out what is wrong. I hope if you can help.
You could simply call your python program from the bash script with something like: PYTHONPATH=$HOME/lib/python /path/to/my/python my_python_script
I don't know how SunGrid works, but if it uses a different user than yours, you'll need global read access to your $HOME. Or at least to the python libraries.
First, whether or not this solution works for you depends heavily on how the cluster is set up. That said, the general solution to your problem is below. If the compute cluster has access to the same files as you do in your home directory, I see no reason why this would not work.
You need to be using a virtualenv. Install your software inside that virtualenv along with any additional python packages you need. Then in your batch bash script, provide the full path to the python interpreter within that virtualenv.
Note: to install python packages inside your virtualenv, you need to use the pip instance that is in your virtualenv, not the system pip.
Example:
$ virtualenv foo
$ cd foo
$ ./bin/pip install numpy
Then in your bash script:
/path/to/foo/bin/python /path/to/your/script.py
Have you tried to add these in your python code:
import sys
sys.path.append("..")
from myOtherPackage import myPythonFile
This works very well for my code when I run it on Cluster and I wanted to call my "myPythonFile" from other package "myOtherPackage"

Resources