How to run python3 on google's dataproc pyspark - python-3.x

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default.
The best I've been able to find is adding these initialization commands
However, when I ssh into the cluster then
(a) python command is still python2,
(b) my job fails due to a python 2 incompatibility.
I've tried uninstalling python2 and also aliasing alias python='python3' in my init.sh script, but alas, no success. The alias doesn't seem to stick.
I create the cluster like this
cluster_config = {
"projectId": self.project_id,
"clusterName": cluster_name,
"config": {
"gceClusterConfig": gce_cluster_config,
"masterConfig": master_config,
"workerConfig": worker_config,
"initializationActions": [
[{
"executableFile": executable_file_uri,
"executionTimeout": execution_timeout,
}]
],
}
}
credentials = GoogleCredentials.get_application_default()
api = build('dataproc', 'v1', credentials=credentials)
response = api.projects().regions().clusters().create(
projectId=self.project_id,
region=self.region, body=cluster_config
).execute()
My executable_file_uri is sits on google storage; init.sh:
apt-get -y update
apt-get install -y python-dev
wget -O /root/get-pip.py https://bootstrap.pypa.io/get-pip.py
python /root/get-pip.py
apt-get install -y python-pip
pip install --upgrade pip
pip install --upgrade six
pip install --upgrade gcloud
pip install --upgrade requests
pip install numpy

I found an answer to this here such that my initialization script now looks like this:
#!/bin/bash
# Install tools
apt-get -y install python3 python-dev build-essential python3-pip
easy_install3 -U pip
# Install requirements
pip3 install --upgrade google-cloud==0.27.0
pip3 install --upgrade google-api-python-client==1.6.2
pip3 install --upgrade pytz==2013.7
# Setup python3 for Dataproc
echo "export PYSPARK_PYTHON=python3" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
echo "export PYTHONHASHSEED=0" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
echo "spark.executorEnv.PYTHONHASHSEED=0" >> /etc/spark/conf/spark-defaults.conf

Configure the Dataproc cluster's Python environment explained it in detail. Basically, you need init actions before 1.4, and the default is Python3 from Miniconda3 in 1.4+.

You can also use the Conda init action to setup Python 3 and optionally install pip/conda packages: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/conda.
Something like:
gcloud dataproc clusters create foo --initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

There are couple of ways to select python interpreter for pyspark.
1.If you want to set python3 as default, set export PYSPARK_PYTHON=python3 while creating dataproc cluster. I added couple of code in the init scripts.
sudo echo "export PYSPARK_PYTHON=python3" | sudo tee -a /etc/profile.d/effective-python.sh
source /etc/profile.d/effective-python.sh
2.Otherwise it's also possible to specify the python version through --properties to use while submitting pyspark job to dataproc cluster. The python version can be passed in the following way:
--properties spark.pyspark.python=python3.7,spark.pyspark.driver=python3.7

Related

Cannot connect to EC2 ubuntu 18.04 instance after upgrading to Python3.9

I am using EC2 Ubuntu 18.04 VM.
Due to CVE-2021-3177, Python needs to be upgraded to the latest version of Python3.9 which would be 3.9.9 currently.
I did that using the deadsnakes option as per the steps mentioned below:
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get install python3.9
sudo apt-get update
sudo apt upgrade -y
The above ensures that Python3.9.9 is now available. But now python3.6 & python3.9 is available. So next we will use the update-alternatives command to make python3.9 as the default version.
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 2
Now that alternatives are defined, we will switch to Option 2 as the default option i.e. Python3.9
sudo update-alternatives --config python3
Once done, the following command would point to the latest version.
sudo python3 -V
However, if you use the sudo apt update command, you will see an error stating that
Traceback (most recent call last):
File "/usr/lib/cnf-update-db", line 8, in <module>
from CommandNotFound.db.creator import DbCreator
File "/usr/lib/python3/dist-packages/CommandNotFound/db/creator.py", line 11, in <module>
import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'
Reading package lists... Done
E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
E: Sub-process returned an error code
To fix this we will have to add a link using the following command
cd /usr/lib/python3/dist-packages/
sudo ln -s apt-pkg.cpython-{36m,39m}-x86_64-linux-gnu.so
Also below is optional, I tried with and without the following commands
apt purge python3-apt
apt install python3-apt
sudo apt install python3.9-distutils python3.9-dev
Once done following command will now not result in any errors
sudo apt update
This means that the issue is fixed.
But for some reason, I cannot connect with the machine afterwards or if I create an AMI using this I cannot connect to the launched instance using PUTTY or SCP.
The same issue persists with Ubuntu-20.x too.
Appreciate your help.
After upgrading Python, there are issues with the following Python modules that cloud-init depends on, which in turn prevents EC2 from being able to correctly configure your newly booted EC2 instance using cloud-init, and which is why it is inaccessible:
setuptools
urllib3
requests
jinja2
netifaces
You can debug this issue by going to your EC2 instance in the AWS Web Console and clicking:
Actions -> Monitor and troubleshoot -> Get system log
Sometimes it takes a while to update, so click the refresh button until your logs appear. It is easier to read the logs if you download them. This is what helped me solve the issues that I was having.
The following steps resolved the issue for me on Ubuntu 18.04 LTS:
For Ubuntu 20.04 LTS, change the 36m in the symbolic links to 38.
# Add deadsnakes ppa repository
sudo add-apt-repository ppa:deadsnakes/ppa
# Install new python version
sudo apt update
sudo apt install python3.10
# Fix broken apt_inst after python upgrade
sudo ln -s /usr/lib/python3/dist-packages/apt_inst.cpython-36m-x86_64-linux-gnu.so /usr/lib/python3/dist-packages/apt_inst.so
# Fix broken apt_pkg after python upgrade
sudo ln -s /usr/lib/python3/dist-packages/apt_pkg.cpython-36m-x86_64-linux-gnu.so /usr/lib/python3/dist-packages/apt_pkg.so
# Make installed python version an alternative with a priority of 2
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 2
# Make upgraded python version an alternative with a priority of 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
# Reinstall python3-apt
sudo apt remove --purge python3-apt
sudo apt autoclean
sudo apt install python3-apt
# Install required packages
sudo apt install \
build-essential \
python3.10-distutils \
python3.10-venv \
libpython3.10-dev
# Install latest pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python3.10 get-pip.py
# Upgrade outdated python libraries that break cloud-init
sudo -i
pip3 install --upgrade setuptools
pip3 install --upgrade urllib3
pip3 install --upgrade requests
pip3 install --upgrade jinja2
pip3 install --upgrade netifaces
pip3 install --upgrade --ignore-installed pyyaml
exit
# Upgrade cloud-init to latest version
sudo apt install --only-upgrade cloud-init
If you use Ansible, it is also affected by the upgrade.
Ansible can be fixed as follows:
Edit /usr/lib/python3/dist-packages/apt/package.py and change the following line:
from collections import Mapping, Sequence
to:
from collections.abc import Mapping, Sequence
It would be useful if the deadsnakes repository could provide an update for python3-apt (eg. python3.10-apt) to solve this issue.
Reference:
https://cloudbytes.dev/snippets/upgrade-python-to-latest-version-on-ubuntu-linux

AWS EMR psycopg2 module not found

I am executing a spark job in AWS Elastic Map Reduce cluster. I had the same problem with boto3. However, I fixed that with my bootstrap script which is as follows
#!/bin/bash
sudo pip-3.6 install boto3
sudo yum update -y
I tried to do the same thing with psycopg2 but it is not working.
#!/bin/bash
sudo pip-3.6 install boto3 psycopg2
sudo python3 -m pip install psycopg2
sudo yum update -y
emr version: 5.29.0

How to Run Python 3.6 on GCP AI Platform Notebook

I have a dependency for my project that requires python v3.6+. It therefore throws an error during installation via pip in a python 3 kernel, because AI Platform Notebooks ship with v3.5 by default. How can I run a GCP AI Platform Notebook with the latest version of python?
The answer is simpler than I thought. Since the AI notebook is a GCE instance, I simply ssh'ed into the machine, and followed the instructions here to install Python 3.7.
Click on AI Platform notebook name and you will reach VM Instance details page and get remote access SSH option(option will be enabled only if AI Platform notebook is running and not stopped)
Once you SSH into notebook VM , you can install using following commands: How do I install Python 3.7 in google cloud shell
# Install requirements
sudo apt-get install -y build-essential checkinstall libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev zlib1g-dev openssl libffi-dev python3-dev python3-setuptools wget
# Prepare to build
mkdir /tmp/Python37
cd /tmp/Python37
# Pull down Python 3.7, build, and install
wget https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tar.xz
tar xvf Python-3.7.0.tar.xz
cd /tmp/Python37/Python-3.7.0
./configure
sudo make altinstall
Now you can create kernel in notebook using below commands
you can do this inside the virtual environment:
Open up your terminal and enter the following line by line
virtualenv -p python3.6 py_36_env
. py_36_env/bin/activate # if . does not work then use source py_36_env/bin/activate
pip install ipykernel
python -m ipykernel install --user --name=py_36_env
jupyter notebook
Then in jupyter notebook you can select the 3.6 environment (py_36_env) from the 'New' drop down menu shown above or from the 'Kernel' drop down menu within a given jupyter notebook.

How to create virtual environment for python 3.7.0?

I'm able to install it with root user but I wanted to install it in a clean environment. My use case is to test the installation of another application with pip for the customer who is using python3.7.0
sudo apt-get update
sudo apt-get install build-essential libpq-dev libssl-dev openssl libffi-dev zlib1g-dev
sudo apt-get install python3-pip python3-dev
sudo apt-get install python3.7
Thanks.
(assuming python3.7 is installed)
Install virtualenv package:
pip3.7 install virtualenv
Create new environment:
python3.7 -m virtualenv MyEnv
Activate environment:
source MyEnv/bin/activate
To help anyone else who runs into the chicken & egg situation trying to use the above chosen answer, here's what solved it for me:
sudo apt install python3.7-venv
python3.7 -m venv env37
source env37/bin/activate
deactivate (when done using the environment)
I had installed python 3.7 using deadsnakes vs source:
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
In doing so I could run python3.7 --version but since I had no pip3.7 I could not install virtualenv as directed in the solution above. Luck would have it that deadsnakes has venv! Once I installed venv I could create my environment & be on my merry way
Handy official python page with venv info
So why didn't I use?:
python3.7 -m ensurepip
That was giving me:
ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/usr/local/lib/python3.7/dist-packages/easy_install.py'
Consider using the --user option or check the permissions.
Which left me with 3 choices:
use sudo (which is simple but I keep being told is frowned upon)
install with --user option which wasn't ideal in that I may not always be logged in as the same user
or install it in an environment which I'm told is the recommended route.
But see chicken egg above.. How do I install pip in environment when I can't create venv or virtualenv? Thus my workaround solution of installing venv from deadsnakes which allowed me to create the virtual environment to then install pip3.7:
(env37) user#ubuntu:~$ python3.7 -m ensurepip
(env37) user#ubuntu:~$ pip3.7 --version
pip 19.2.3 from /home/user/env37/lib/python3.7/site-packages/pip (python 3.7)
Some added information, if you are trying for some version like python 3.7.10, which might give following error upon executing pip3.7.10 install virtualenv
.pyenv/versions/3.7.10/bin/python: No module named virtualenv
So, in a general sense you can do the following steps:
[commands are specific to MacOs, I am currently using with the new M1 chip]
After installing 3.7.10 using pyenv, make it global.
brew update
brew install pyenv
set environment variables
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
source ~/.bash_profile
look at the pyenv list to see if the version you install is there or not and install and make it global
pyenv install --list
pyenv install 3.7.10
pyenv global 3.7.10
create your virtual environment now with this version
python -m venv MyEnv
activate it
source MyEnv/bin/activate
Using pip on windows, you can do the following:
1.virtualenv --python "C:\\Python37\\python.exe" venv# use your own path
You will see something like this:
Running virtualenv with interpreter C:\Python37\python.exe
Using base prefix 'C:\Python37'
New python executable in C:\Users\XXXX\Documents\GitHub\MyProject\venv\Scripts\python.exe
Installing setuptools, pip, wheel...
done.
2.C:\Users\XXXXX\Documents\GitHub\MyProject>cd venv
C:\Users\XXXXX\Documents\GitHub\MyProject\venv>cd Scripts
C:\Users\XXXXX\Documents\GitHub\MyProject\venv\Scripts>activate.
At the beginning of the command path, when you see (environment variable name) in this case (venv), this is a sign that your virtual environment is activated.
(venv) C:\Users\tuscar2001\Documents\GitHub\MyProject\venv\Scripts>
Please check the following link for more details:http://www.datasciencetopics.com/2020/03/how-to-set-up-virtual-environment-in.html
Figure out python3.7 path on your system. For mac with python3.7 in brew you can use the following
virtualenv env -p /usr/local/opt/python#3.7/bin/python3
source ./env/bin/activate

How to upgrade AWS CLI to the latest version?

I recently noticed that I am running an old version of AWS CLI that is lacking some functionality I need:
$aws --version
aws-cli/1.2.9 Python/3.4.3 Linux/3.13.0-85-generic
How can I upgrade to the latest version of the AWS CLI (1.10.24)?
Edit:
Running the following command fails to update AWS CLI:
$ pip install --upgrade awscli
Requirement already up-to-date: awscli in /usr/local/lib/python2.7/dist-packages
Cleaning up...
Checking the version:
$ aws --version
aws-cli/1.2.9 Python/3.4.3 Linux/3.13.0-85-generic
From http://docs.aws.amazon.com/cli/latest/userguide/installing.html#install-with-pip
To upgrade an existing AWS CLI installation, use the --upgrade option:
pip install --upgrade awscli
On Linux and MacOS X, here are the three commands that correspond to each step:
$ curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
$ unzip awscli-bundle.zip
$ sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
This does not work:
pip install --upgrade awscli
This worked fine on Ubuntu 14.04( no need to reboot also .. You would have to first install pip3 ):
pip3 install --upgrade awscli
For Ubuntu 16.04 I used parts of the other answers and comments and just reloaded bash instead of rebooting.
I installed the aws-cli using apt so I removed that first:
sudo apt-get remove awscli
Then I could pip install (I chose to use sudo to install globally with pip2):
sudo pip install -U awscli
Since I was doing this on a server I didn't want to reboot it, but reloading bash did the trick:
source ~/.bashrc
At this point I could use the new version of aws cli
aws --version
Update: Upgrade instance using AWS CLI v1 to AWS CLI v2:
This question and answer was initially created when there was only an AWS CLI v1. There is now a AWS CLI v2. The installation instructions for the AWS CLI v2 can be found here.
The new AWS CLI v2 has different installation instructions based on whether your EC2 instance is using Linux x86 (64-bit) or Linux ARM architecture.
To upgrade to AWS CLI v2, on an EC2 instance using Linux ARM, I had to issue the following commands:
rm -rf /bin/aws
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install -i /usr/local/aws -b /bin
Subsequently test your AWS CLI version by executing: aws --version
For the Linux x86 (64-bit) architecture I'm hoping the commands are the same except for replacing the curl command with the following: (as per the installation instructions)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o
"awscliv2.zip"
The AMI I used was the most recent one currently available and it was still using the AWS CLI v1. In the future if AWS starts packaging AWS CLI v2 with their AMIs this answer might require an update.
Original answer: Upgrade instance using AWS CLI v1 to use the most recent version of AWS CLI v1:
If you are having trouble installing the AWS CLI using pip you can use the "Bundled Installer" as documented here.
The steps discussed there are as follows:
$ curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
$ unzip awscli-bundle.zip
$ sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
Check your AWS CLI version subsequently as a sanity-check that everything executed correctly:
$ aws --version
If the AWS CLI didn't update to the latest version as expected maybe the AWS CLI binaries are located somewhere else as the previously-given commands assume.
Determine where AWS CLI is being executed from:
$ which aws
In my case, AWS CLI was being executed from /bin/aws, so I had to install the
"Bundled Installer" using that location as follows:
$ sudo ./awscli-bundle/install -i /user/local/aws -b /bin/aws
Try
sudo pip install --upgrade awscli, and open a new shell.
This worked well for me (no need to reboot).
Simple use
sudo pip install awscli --force-reinstall --upgrade
This will upgrade all the required modules.
On Mac you can use homebrew:
to install: brew install awscli
to upgrade: brew upgrade awscli
Make sure you don't have multiple installations: where aws
pip install awscli --upgrade --user
The --upgrade option tells pip to upgrade any requirements that are already installed. The --user option tells pip to install the program to a subdirectory of your user directory to avoid modifying libraries used by your operating system.
We can follow the below commands to install AWS CLI on UBUNTU:
sudo apt install curl
curl “https://s3.amazonaws.com/aws-cli/awscli-bundle.zip” -o
“awscli-bundle.zip”
unzip awscli-bundle.zip
sudo ./awscli-bundle/install -i /usr/local/ aws -b /usr/local/bin/aws
rm -rf awscli-bundle.zip awscli-bundle
To test: aws — version
For More Info :
https://gurudathbn.wordpress.com/2018/03/31/installing-aws-cli-on-ubuntu/
When using sudo pip install --upgrade awscli I got the following error:
ERROR: Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: '/lib'
By using sudo with -H option, I could fix the problem.
sudo -H pip install --upgrade awscli
Currently, using pip will get you the old version of awscli, 1.18.103.
The latest version of aws-cli, 2.0.33 is on the v2 branch. You can download the installer for Linux, Windows and macOS from here.
I was trying to install awscli on one of my ec2 instances where I tried both
sudo pip install --upgrade awscli,
sudo pip3 install --upgrade awscli
which didn't worked, as I was getting errors like
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-2nh71cs2/cryptography/
And rebooting servers were not an option.
Luckily, simple
sudo apt update
sudo apt install awscli worked.
I do it by removing & installing the awscli like described in this video
basically:
pip uninstall awscliv2
pip install awscliv2
pip install awscliv2==your-version
pip install awscliv
(you can keep v1 along with v2 if you want)
pip install --upgrade ...
works as well. sure.
I do not install it globally (like some ppl seems still do), btw. Because sometimes I need different v for different cases. so I keep it in separate python virtual environment.
For windows you can try this command
msiexec.exe /i https://awscli.amazonaws.com/AWSCLIV2.msi
Install or update the AWS CLI on macOS
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg ./AWSCLIV2.pkg -target /
Done!
You can verify the same with below cmd
aws --version
Try AWS Cloud Shell, Quick and easy
AWS CloudShell is a browser-based shell that makes it easy to securely manage, explore, and interact with your AWS resources. CloudShell is pre-authenticated with your console credentials.
Benefits
No extra credentials to manage
Always up to date
No cost
More Details here https://aws.amazon.com/cloudshell/
Customizable
To install globally, get on the sudo access
sudo su & then upgrade aws cli by
pip3 install --upgrade awscli

Resources