No start-history-server.sh when pyspark installed through conda - apache-spark

I have installed pyspark in a miniconda environment on Ubuntu through conda install pyspark. So far everything works fine: I can run jobs through spark-submit and I can inspect running jobs at localhost:4040. But I can't locate start-history-server.sh, which I need to look at jobs that have completed.
It is supposed to be in {spark}/sbin, where {spark} is the installation directory of spark. I'm not sure where that is supposed to be when spark is installed through conda, but I have searched through the entire miniconda directory and I can't seem to locate start-history-server.sh. For what it's worth, this is for both python 3.7 and 2.7 environments.
My question is: is start-history-server.sh included in a conda installation of pyspark?
If yes, where? If no, what's the recommended alternative way of evaluating spark jobs after the fact?

EDIT: I've filed a pull request to add the history server scripts to pyspark. The pull request has been merged, so this should tentatively show up in Spark 3.0.
As #pedvaljim points out in a comment, this is not conda-specific, the directory sbin isn't included in pyspark at all.
The good news is that it's possible to just manually download this folder from github (i.e. not sure how to download just one directory, I just cloned all of spark) into your spark folder. If you're using mini- or anaconda, the spark folder is e.g. miniconda3/envs/{name_of_environment}/lib/python3.7/site-packages/pyspark.

Related

where is local hadoop folder in pyspark (mac)

I have installed pyspark in local mac using homebrew. I am able to see spark under /usr/local/Cellar/apache-spark/3.2.1/
but not able to see hadoop folder. If I run pyspark in terminal it is running spark shell.
Where can I see its path?
I a trying to connect S3 to pyspark and I have dependency jars
You do not need to know the location of Hadoop to do this.
You should use a command like spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.1 app.py instead, which will pull all necessary dependencies rather than download all JARs (with their dependencies) locally.

Can't make action calls through anaconda py35 env in spark HdInsight

As per the documentation - https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-python-package-installation
we had installed several external python modules through new anaconda env 'py35_data_prof'. However as soon as we invoke any rdd action calls like rdd.count() or rdd.avg() in our python code, spark2 throws -
Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory
enter image description here
FYI, The python indicated in error path - '/usr/bin/anaconda/envs/py35_data_prof/bin/python' is actually a symlink rather than python dir.
I have been looking up the HDInsight docs but can't seem to find the fix. Please let us know if there is a way around it.
The error message “Cannot run program "/usr/bin/anaconda/envs/py35_data_prof/bin/python": error=2, No such file or directory” clearly says the unable to find/locate the package installed. Make sure the package is installed with the all the requirements mentioned below.
• Create Python virtual environment using conda.
• Install external Python packages in the created virtual environment if needed.
• Change Spark and Livy configs and point to the created virtual environment.
I would request you to follow the each and every step mentioned here: “Safely install external Python packages”.
Hope this helps.

Properly configuring PySpark and Anaconda3 on Linux

Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).

Setting python in workers in SPARK YARN with anaconda

I went through this post setting python path for workers/drivers in standalone spark mode. Apparently, the straightforward way is to direct PYSPARK_PATh environment variable in ./conf/spark-env.sh file located in the conf folder of spark such as /opt/cloudera/parcels/CDH/lib/spark/conf/ in my case. However, I was finding to repeat it for spark in YARN cluster mode. Tried playing around for quite some time. I found this cloudera blog to add Anaconda package.
Now all that is left to do, is add the Anaconda path in the spark-env.sh file instead of the standard python path. It finally worked. Please share if there is a better/alternative way for python setup/update in SPARK and pyspark.

Copying the Apache Spark installation folder to another system will work properly?

I am using Apache Spark. Working in cluster properly with 3 machines. Now I want to install Spark on another 3 machines.
What I did: I tried to just copy the folder of Spark, which I am using currently.
Problem: ./bin/spark-shell and all other spark commands are not working and throwing error 'No Such Command'
Question: 1. Why it is not working?
Is it possible that I just build Spark installation for 1 machine and then from that installation I can distribute it to other machines?
I am using Ubuntu.
We were looking into problem and found that Spark Installation Folder , which was copied, having the .sh files but was not executable. We just make the files executable and now spark is running.
Yes, It would work but should ensure that you have set all the environment variables required for spark to work.
like SPARK_HOME, WEBUI_PORT etc...
also use hadoop integrated spark build which comes with the supported versions of hadoop.

Resources