Pyspark + Hadoop + Anaconda install - apache-spark

I'm new to Anaconda, Spark and Hadoop. I wanted to get a standalone dev environment setup on my Ubuntu 16.04 machine but was getting confused on what I do within conda and what is external.
So far I had installed Anaconda and created a Tensorflow environment (I will be using TF too I should add), and installed PySpark separately outside of Anaconda.
However I wasn't sure if that's what I'm supposed to do or whether I am supposed to use conda install? I'm also eventually going to want to install Hortonwork's Hadoop too - and hear that may already come bundled with spark.
Long story short - I just want to get a dev environment set up with all these technologies that I can play around with and have data flow from one to the other as seamlessly as possible.
Appreciate any advice on the "correct" way to set everything up.

Related

How to have dedicated Jupyter notebook configuration files on one machine

I am running a Windows 10 machine, where I have a Python installation installed from one of the programs I am working with. This leads to dependencies of this program to specific versions of Python packages, including Jupyter and Jupyterlab, and I cannot update/upgrade them without breaking the functionality in the original program.
Hence, I decided to install a more recent version of Python in addition to the one I already have on my machine. That was also not the issue, and installing all the packages I was after went fine so far.
However, even though I installed nodejs and npm within the new version of Python, when attempting to install a widget in Jupyterlab, it still does not recognize the packages.
In addition, wen running jupyter-lab.exe --generate-config, I am getting asked, if I want to override the existing configuration file.
I have no intention to do so, but would like to be able to configure the different jupyter notebook environments separate from each other.
Is there a possibility to do so?

Python, Anaconda & PyCharm multiple versions of Python3

I just installed Anaconda3-2019-10 on my MacBook.
I tried to make sure that my previous Python 3 version was totally uninstalled / removed from my system. Typing python3to the terminal didn´t work anymore.
After installing Anaconda and PyCharm (pycharm-community-anaconda-2019.3.3) I started a new Project to test everything. For that I selected to create a new Conda environment:
After I created the process I checked the Preferences and the "Project Interpreter". This is what I found:
I expected to find two interpreters 1.) my 3.7 Python version and 2.) the Conda environment just created.
Does finding 3 versions mean that I didn´t correctly deinstall Python3 before installing anaconda or is there anything that I don´t understand here?
Do I need both versions?
If not is there a safe way to remove one of them?
For removing Python3 from my system I did almost everythin suggested in numerous posts in Stackoverflow.
Upon creating a venv(virtual environment) you no longer need to worry about the existing interpreter. https://docs.python.org/3/tutorial/venv.html this might be of help.

RHEL 7.6 - Built Python3.6 from Source Broke Network

I have a RHEL system which by default was running Python2.7 and Python3.4
I needed Python3.6 for a project I wanted to work on and so I downloaded it and built it from source. I ran make and make install which hindsight may have been the wrong decision.
Now I do not seem to have any internet connectivity. Does anyone know what I may have over written to cause this or at least where specifically I can look to track this issue down?
Note: I can Putty into the Linux machine but it doesn't seem to have any other connectivity, specifically HTTPS
It's a bit weird that this would break network connectivity. One possible explanation is that the system has networking scripts or a network manager that relies on Python, and it got broken after make install replaced your default Python installation. It may be possible to fix this by reinstalling your RHEL Python packages (sorry, cannot offer more detailed help there, as I don't have access to a RHEL box).
I guess the lesson is "be careful about running make install as superuser". To easily install and manage different Python versions (separate from the system Python), the Anaconda Python distribution would be a good solution.
I suggest to undo that 3.6 installation and use the Software Collections version of python 3.6. See here for python 3.6 installation. Software Collections install "along side" the original versions so as to not affect the OS - and they are included in the subscription.
So after a lot of time slamming my head against the wall I got it worked out. My best guess is that the system (RHEL 7) relied on something from its default Python2.7 installation to handle SSL negotiations. Installing 3.6 alongside must have overwritten some pointer. Had I done this correctly, with altinstall all would have likely been fine.
The most frustrating part of this is that there were no error messages, connections just timed out.
To fix this, I had to uninstall all Python versions and then reinstalled Python2.7 - Once Python2 was back in the system it all seemed to work well.

Anaconda environment creation

I am new to python and machine learning. I am trying to create separate anaconda environments for tensorflow, keras, etc. in my Windows machine. I have successfully installed anaconda with automatic environment path setup. Created a test environment but when I try to create another one I get the following error. I have python 3.6.2 and 3.7.2 installed in conda. When I write the command
conda create -n t_tensor
it gives the following error. Not sure where the problem is. Any help is appreciated.
Thank you.
Error Image

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using
pip install pyspark
I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/#GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).
Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:
what is the exact connection between these two technologies?
why is installing PySpark enough to make it run? Does it actually install Spark under the hood? If yes, where?
if you install only PySpark, is there something you miss (e.g. I cannot find the sbin folder which contains e.g. script to start history server)
As of v2.2, executing pip install pyspark will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars
PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.
PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

Resources