AWS EMR psycopg2 module not found

AWS EMR psycopg2 module not found - apache-spark

I am executing a spark job in AWS Elastic Map Reduce cluster. I had the same problem with boto3. However, I fixed that with my bootstrap script which is as follows
#!/bin/bash
sudo pip-3.6 install boto3
sudo yum update -y
I tried to do the same thing with psycopg2 but it is not working.
#!/bin/bash
sudo pip-3.6 install boto3 psycopg2
sudo python3 -m pip install psycopg2
sudo yum update -y
emr version: 5.29.0

Related

Cannot connect to EC2 ubuntu 18.04 instance after upgrading to Python3.9

I am using EC2 Ubuntu 18.04 VM.
Due to CVE-2021-3177, Python needs to be upgraded to the latest version of Python3.9 which would be 3.9.9 currently.
I did that using the deadsnakes option as per the steps mentioned below:
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt-get update
sudo apt-get upgrade -y
sudo apt-get install python3.9
sudo apt-get update
sudo apt upgrade -y
The above ensures that Python3.9.9 is now available. But now python3.6 & python3.9 is available. So next we will use the update-alternatives command to make python3.9 as the default version.
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 2
Now that alternatives are defined, we will switch to Option 2 as the default option i.e. Python3.9
sudo update-alternatives --config python3
Once done, the following command would point to the latest version.
sudo python3 -V
However, if you use the sudo apt update command, you will see an error stating that
Traceback (most recent call last):
File "/usr/lib/cnf-update-db", line 8, in <module>
from CommandNotFound.db.creator import DbCreator
File "/usr/lib/python3/dist-packages/CommandNotFound/db/creator.py", line 11, in <module>
import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'
Reading package lists... Done
E: Problem executing scripts APT::Update::Post-Invoke-Success 'if /usr/bin/test -w /var/lib/command-not-found/ -a -e /usr/lib/cnf-update-db; then /usr/lib/cnf-update-db > /dev/null; fi'
E: Sub-process returned an error code
To fix this we will have to add a link using the following command
cd /usr/lib/python3/dist-packages/
sudo ln -s apt-pkg.cpython-{36m,39m}-x86_64-linux-gnu.so
Also below is optional, I tried with and without the following commands
apt purge python3-apt
apt install python3-apt
sudo apt install python3.9-distutils python3.9-dev
Once done following command will now not result in any errors
sudo apt update
This means that the issue is fixed.
But for some reason, I cannot connect with the machine afterwards or if I create an AMI using this I cannot connect to the launched instance using PUTTY or SCP.
The same issue persists with Ubuntu-20.x too.
Appreciate your help.

After upgrading Python, there are issues with the following Python modules that cloud-init depends on, which in turn prevents EC2 from being able to correctly configure your newly booted EC2 instance using cloud-init, and which is why it is inaccessible:
setuptools
urllib3
requests
jinja2
netifaces
You can debug this issue by going to your EC2 instance in the AWS Web Console and clicking:
Actions -> Monitor and troubleshoot -> Get system log
Sometimes it takes a while to update, so click the refresh button until your logs appear. It is easier to read the logs if you download them. This is what helped me solve the issues that I was having.
The following steps resolved the issue for me on Ubuntu 18.04 LTS:
For Ubuntu 20.04 LTS, change the 36m in the symbolic links to 38.
# Add deadsnakes ppa repository
sudo add-apt-repository ppa:deadsnakes/ppa
# Install new python version
sudo apt update
sudo apt install python3.10
# Fix broken apt_inst after python upgrade
sudo ln -s /usr/lib/python3/dist-packages/apt_inst.cpython-36m-x86_64-linux-gnu.so /usr/lib/python3/dist-packages/apt_inst.so
# Fix broken apt_pkg after python upgrade
sudo ln -s /usr/lib/python3/dist-packages/apt_pkg.cpython-36m-x86_64-linux-gnu.so /usr/lib/python3/dist-packages/apt_pkg.so
# Make installed python version an alternative with a priority of 2
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 2
# Make upgraded python version an alternative with a priority of 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
# Reinstall python3-apt
sudo apt remove --purge python3-apt
sudo apt autoclean
sudo apt install python3-apt
# Install required packages
sudo apt install \
build-essential \
python3.10-distutils \
python3.10-venv \
libpython3.10-dev
# Install latest pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python3.10 get-pip.py
# Upgrade outdated python libraries that break cloud-init
sudo -i
pip3 install --upgrade setuptools
pip3 install --upgrade urllib3
pip3 install --upgrade requests
pip3 install --upgrade jinja2
pip3 install --upgrade netifaces
pip3 install --upgrade --ignore-installed pyyaml
exit
# Upgrade cloud-init to latest version
sudo apt install --only-upgrade cloud-init
If you use Ansible, it is also affected by the upgrade.
Ansible can be fixed as follows:
Edit /usr/lib/python3/dist-packages/apt/package.py and change the following line:
from collections import Mapping, Sequence
to:
from collections.abc import Mapping, Sequence
It would be useful if the deadsnakes repository could provide an update for python3-apt (eg. python3.10-apt) to solve this issue.
Reference:
https://cloudbytes.dev/snippets/upgrade-python-to-latest-version-on-ubuntu-linux

How to install PYODBC in Databricks

I have to install pyodbc module in Databricks.
I have tried using this command (pip install pyodbc) but it is failed due to below error.
Error message

I was having the same issue for installation. This is what I tried and it worked.
Databricks does not have default ODBC Driver. Run following commands in a single cell to install MS SQL ODBC driver
%sh
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get -q -y install msodbcsql17
Run this in notebook
dbutils.fs.put("/databricks/init/<YourClusterName>/pyodbc-install.sh","""
#!/bin/bash
sudo apt-get update
sudo apt-get -q -y install unixodbc unixodbc-dev
sudo apt-get -q -y install python3-dev
/databricks/python/bin/pip install pyodbc
""", True)
Restart the cluster
Import pyodbc in Code

I had some problems a while back with connecting using pyobdc, details of my fix are here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
I think the problem stems from PYTHONPATH on the databricks clusters being set to the Python 2 install.
I suspect the lines:
%sh
apt-get -y install unixodbc-dev
/databricks/python/bin/pip install pyodbc
Will work for you.
Update: Even simpler (though you will still need unixodbc-dev from above):
%sh
sudo apt-get install python3-pip -y
pip3 install --upgrade pyodbc

Right-click the Workspace folder where you want to store the library.
Select Create > Library.
Look this https://docs.databricks.com/user-guide/libraries.html for detailed information

Using Shapely on AWS Lambda with Python 3

I am trying to set up a skill using Shapely on Lambda. I got the error
module initialization error: Could not find lib geos_c or load any of its variants ['libgeos_c.so.1', 'libgeos_c.so'].
There's a similar question for Python 2.7. I can't use the lambda-packs by ryfeus cause I'm on Python 3.6, but I figured the EC2 approach described by Graeme should work.
So I started up an EC2 instance using the Public Amazon Linux AMI version from the AWS docs
I then ran these commands
$ sudo yum -y update
$ sudo yum -y install python36 python36-virtualenv python36-pip
$ mkdir ~/forlambda
$ cd ~/forlambda
$ virtualenv -p python3 venv
$ source venv/bin/activate
and then installed Shapely and a few other packages I needed.
$ sudo yum -y groupinstall "Development Tools"
$ pip install python-dateutil
$ pip install shapely
$ pip install pyproj
$ pip install pyshp
I then ran my skill (on the EC2 instance), and it works! So then I copied the files at venv/lib/python3.6/site-packages, plus the myskill.py and zipped them up, uploaded to Lambda, and still get the geos_c error as shown above :(
I have been able to upload a scaled-down version of my skill (minus Shapely, but including other packages that don't come with Lambda) and it works on Lambda, so I don't think it's an error on how I am zipping or uploading.
Am I missing something? Does it make a difference that the Development Tools were installed using "sudo yum install" instead of "pip install"?

For some reason, the pip install of Shapely and Pyproj didn't end up in the virtualenv site-packages. From a fresh EC2 instance, I ran these commands:
$ sudo yum -y update
$ sudo yum -y install python36 python36-virtualenv python36-pip
$ mkdir ~/forlambda
$ cd ~/forlambda
$ virtualenv -p python3 venv
$ source venv/bin/activate
(venv) $ sudo yum -y groupinstall "Development Tools"
(venv) $ pip install python-dateutil
(venv) $ pip install shapely -t ~/forlambda/venv/lib/python3.6/site-packages/
(venv) $ pip install pyproj -t ~/forlambda/venv/lib/python3.6/site-packages/
(venv) $ pip install pyshp
and then zipped up all the contents of site-packages/ plus myskill.py, uploaded to Lambda, and it worked.

How to run python3 on google's dataproc pyspark

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default.
The best I've been able to find is adding these initialization commands
However, when I ssh into the cluster then
(a) python command is still python2,
(b) my job fails due to a python 2 incompatibility.
I've tried uninstalling python2 and also aliasing alias python='python3' in my init.sh script, but alas, no success. The alias doesn't seem to stick.
I create the cluster like this
cluster_config = {
"projectId": self.project_id,
"clusterName": cluster_name,
"config": {
"gceClusterConfig": gce_cluster_config,
"masterConfig": master_config,
"workerConfig": worker_config,
"initializationActions": [
[{
"executableFile": executable_file_uri,
"executionTimeout": execution_timeout,
}]
],
}
}
credentials = GoogleCredentials.get_application_default()
api = build('dataproc', 'v1', credentials=credentials)
response = api.projects().regions().clusters().create(
projectId=self.project_id,
region=self.region, body=cluster_config
).execute()
My executable_file_uri is sits on google storage; init.sh:
apt-get -y update
apt-get install -y python-dev
wget -O /root/get-pip.py https://bootstrap.pypa.io/get-pip.py
python /root/get-pip.py
apt-get install -y python-pip
pip install --upgrade pip
pip install --upgrade six
pip install --upgrade gcloud
pip install --upgrade requests
pip install numpy

I found an answer to this here such that my initialization script now looks like this:
#!/bin/bash
# Install tools
apt-get -y install python3 python-dev build-essential python3-pip
easy_install3 -U pip
# Install requirements
pip3 install --upgrade google-cloud==0.27.0
pip3 install --upgrade google-api-python-client==1.6.2
pip3 install --upgrade pytz==2013.7
# Setup python3 for Dataproc
echo "export PYSPARK_PYTHON=python3" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
echo "export PYTHONHASHSEED=0" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
echo "spark.executorEnv.PYTHONHASHSEED=0" >> /etc/spark/conf/spark-defaults.conf

Configure the Dataproc cluster's Python environment explained it in detail. Basically, you need init actions before 1.4, and the default is Python3 from Miniconda3 in 1.4+.

You can also use the Conda init action to setup Python 3 and optionally install pip/conda packages: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/conda.
Something like:
gcloud dataproc clusters create foo --initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

There are couple of ways to select python interpreter for pyspark.
1.If you want to set python3 as default, set export PYSPARK_PYTHON=python3 while creating dataproc cluster. I added couple of code in the init scripts.
sudo echo "export PYSPARK_PYTHON=python3" | sudo tee -a /etc/profile.d/effective-python.sh
source /etc/profile.d/effective-python.sh
2.Otherwise it's also possible to specify the python version through --properties to use while submitting pyspark job to dataproc cluster. The python version can be passed in the following way:
--properties spark.pyspark.python=python3.7,spark.pyspark.driver=python3.7

aws s3 ls Unknown options: --recursive

I am trying to list the contents of an Amazon S3 bucket using the following command (documentation):
aws s3 ls s3://mybucket --recursive
However, I get the following error:
Unknown options: --recursive
The following is the version information for my Ubuntu Linux EC2 instance:
$aws s3 ls --version
aws-cli/1.2.9 Python/3.4.3 Linux/3.13.0-85-generic
How can I enable the --recursive option on my aws-cli?

'aws s3 ls --recursive' was added in version 1.2.11 - you are using version 1.2.9 - an outdated version. Please upgrade to the latest version.
pip install -U awscli

If you have installed aws-cli using command apt-get install awscli on ubuntu, it installs the older version of aws cli.
You can install the latest aws-cli using pip command, make sure pip is installed on your system. Install aws-cli using this command
pip install -U awscli
To install pip you can use follow commands
sudo apt-get install python-pip
sudo apt-get install python3-pip -> On python3

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AWS EMR psycopg2 module not found - apache-spark

Related

Cannot connect to EC2 ubuntu 18.04 instance after upgrading to Python3.9

How to install PYODBC in Databricks

Using Shapely on AWS Lambda with Python 3

How to run python3 on google's dataproc pyspark

aws s3 ls Unknown options: --recursive

Categories

Resources