spark-submit python packages with venv cannot run program

spark-submit python packages with venv cannot run program - apache-spark

I was following this article to encapsule the fuzzy-c-means lib to run on a spark cluster, I'm using bitnami/spark image on docker. I've used a python image to build a venv with python 3.7 and install the fuzzy-c-means lib. then i used the venv-pack to compress the venv in a environment.tar.gz file.
I have a app.py file:
from pyspark.sql import SparkSession
def main(spark):
import fcmeans
print('-')
if __name__ == "__main__":
print('log')
spark = (
SparkSession.builder
.getOrCreate()
)
main(spark)
So when I run my spark-submit code I got the error: Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory.
spark-sumit code:
PYSPARK_PYTHON=./environment/bin/python spark-submit --archives ./environment.tar.gz#environment ./app.py
I can run the app.py with the .tar.gz file if I remove the statement PYSPARK_PYTHON but I'll have the no module named 'fcmeans' for the import in my app.py.
The thing is, when run --archives ./environment.tar.gz#environment it unpack the tar.gz files in the /tmp/spark-uuid-code/userFiles-uuid-code/environment/
And when i set the PYSPARK_PYTHON it not recongnizes the path to the file has a valid file, but it seens that the spark should manage this.
Any hints of what I should do?

I've managed to make it work by creating the virtualenv inside the EMR cluster, then exporting the .tar.gz file with venv-pack to a S3 bucket. This article helped: gist.github.
Inside the EMR shell:
# Create and activate our virtual environment
virtualenv -p python3 venv-datapeeps
source ./venv-datapeeps/bin/activate
# Upgrade pip and install a couple libraries
pip3 install --upgrade pip
pip3 install fuzzy-c-means boto3 venv-pack
# Package the environment and upload
venv-pack -o pyspark_venv.tar.gz
aws s3 cp pyspark_venv.tar.gz s3://<BUCKET>/artifacts/pyspark/

Related

Python packages for pyspark on AWS EMR notebook will not import

I recently started using pyspark on an AWS EMR cluster to do some analytics. I want to install and use some packages for this, mostly for plotting. Some of the installs happen in this bootstrap script
#!/bin/sh
set -e -x
# boto 3
pip3 install boto3 --user
# pandas
pip3 install Cython --user
pip3 install numpy==1.21.1 --user
pip3 install pandas==1.3.0 --user
# plotnine
pip3 install plotnine --user
# matplotlib
pip3 install plotnine --user
# delta table jar
spark-shell --packages io.delta:delta-core_2.11:0.3.0
The packages do not get installed for either python3 or pyspark. I can however go into a python3 console and pip install. They will then work with python3, but not pyspark, and installing them on a pyspark console does not work either.
My last ditch effort was using sc.install_pypi_package(...) on pyspark but this won't work with all the packages.
For context here is some info on the EMR
release label: emr-6.3.1
Hadoop dist: Amazon 3.2.1
Applications: spark 3.1.1, livy 0.7.0, Hive 3.1.2, JupyterEnterpirceGateway 2.1.0
Running on 1 c4.4xl instance as the master node

Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark

I installed pyspark 3.2.0 via pip install pyspark. I have installed pyspark in a conda environment named pyspark. I cannot find spark-defaults.conf. I am searching for it in ~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark since that is my understanding of what SPARK_HOME should be.
Where can I find spark-defaults.conf? I want to modify it
Am I right in setting SPARK_HOME to the installation location of pyspark ~/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark?

2. The SPARK_HOME environment variables are configured correctly.
1. In the pip installation environment, the $SPARK_HOME/conf directory needs to be created manually, then copy the configuration file template to this directory and modify each configuration file.

Why do I get error ModuleNotFoundError: No module named 'azure.storage' during execution of my azure python function?

I am currently deploying using Python 3.9 to install the depencies and setup the azure function to also use Python 3.9
Here is the requirements file I am currently using
msrest==0.6.16
azure-core==1.6.0
azure-functions
azure-storage-blob==12.5.0
pandas
numpy
pyodbc
requests==2.23.0
snowflake-connector-python==2.4.0
azure.identity
azure.keyvault.secrets==4.1.0
azure.servicebus==0.50.3
pyarrow==3.0.0
stopit==1.1.2
The bash script to install the required dependencies during the build definition
python3.9 -m venv worker_venv
source worker_venv/bin/activate
pip3.9 install setuptools
pip3.9 install --upgrade pip
pip3.9 install -r requirements.txt
My python scripts are using the following imports
import logging
from azure.storage.blob import *
import datetime
import azure.functions as func
import json
The most helpful article I could find was
https://learn.microsoft.com/en-us/azure/azure-functions/recover-python-functions?tabs=coretools
As a work-around I tried the remote build option using command:
func azure functionapp publish . Interestingly enough when I use that command the error disappears during execution and the function works as expected. I would like to enable the automatic build and deploy process again which did work until I needed to include the pyarrow library.
Any suggestions on what I am doing incorrectly?

I was able to download the content which was generated by the remote build. I then discovered it has a .python_packages folder. I now updated my install dependencies bash script to the example below which mimics how the remote build creates the the .python_packages .In essence I am copying the downloaded packages from worker_venv/lib64/python3.9/site-packages to .python_packages/lib/site-packages. My function is now executing without any errors anymore.
python3.9 -m venv worker_venv
source worker_venv/bin/activate
pip3.9 install setuptools
pip3.9 install --upgrade pip
pip3.9 install -r requirements.txt
mkdir .python_packages
cd .python_packages
mkdir lib
cd lib
mv ../../worker_venv/lib64/python3.9/site-packages .

Spark installation for production, pip install or not?

I would like to install Pyspark 2.4.4. I have seen that I can download the Spark package or use pip install. I only need Pyspark, are they the same with both installations?

you could do python pip install pyspark but it doesn't come with Hadoop binaries which is necessary for the spark to function properly.
The easiest way to install is by using python findspark
download .tgz file from the spark website which comes with Hadoop binaries
pip install findspark
In Python:
import findspark
finspark.init('\path\to\extracted\binaries\folder')
import pyspark

ImportError: No module named 'psycopg2._psycopg'

When I try to import psycopg2 it show below log for me:
Traceback (most recent call last):
File "D:/Desktop/learn/python/webcatch/appserver/testpgsql.py", line 2, in <module>
import psycopg2
File "D:/Desktop/learn/python/webcatch/appserver/webcatch/lib/site-packages/psycopg2-2.6.1-py3.5-win32.egg/psycopg2/__init__.py", line 50, in <module>
from psycopg2._psycopg import BINARY, NUMBER, STRING, DATETIME, ROWID
ImportError: No module named 'psycopg2._psycopg'
How can I solve it?
My platform is win10 (64) and version is python 3.5

Eureka! I pulled my hair out for 2 days trying to get this to work. Enlightenment came from this SO Question. Simply stated, you probably installed psycopg2 x64 version like I did, not realizing your python version was 32-bit. Unistall your current psycopg2, then:
Download: psycopg2-2.6.1.win32-py3.4-pg9.4.4-release.exe from HERE, then run the following in a Terminal:
C:\path\to\project> easy_install /path/to/psycopg2-2.6.1.win32-py3.4-pg9.4.4-release.exe
C:\path\to\project> python manage.py makemigrations
C:\path\to\project> python manage.py migrate
You may also need to (re)create super user with:
C:\path\to\project> python manage.py createsuperuser

I had the same problem, solved it in this way:
Reinstall the package psycopg2 using pip (by default installed with python 3)
On Linux:
pip uninstall psycopg2
Confirm with (y) and then:
pip install psycopg2
On Windows I add the prefix ('python -m') to the commands above.
I think the problem occurs when you change the version of Python. (Even between minor versions such as Python 3.5 and 3.6).

I am using psycopg in an AWS Glue Job, where is harder to follow the instructions listed in the other answers.
What I did is installing psycopg2-binary into a directory and zip up the contents of that directory:
mkdir psycopg2-binary
cd psycopg2-binary
pip install psycopg2-binary -t .
# in case using python3:
# python3 -m pip install --system psycopg2-binary -t .
zip -r9 psycopg2.zip *
I then copied psycopg2.zip to an S3 bucket and add it as an extra Python library under "Python library path" in the Glue Spark job.
I then launched the job with the following script to verify if psycopg2 is present (the zip file will be downloaded by Glue into the directory in which the Job script is located)
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import sys
import os
import zipfile
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
zip_ref = zipfile.ZipFile('./psycopg2.zip', 'r')
print os.listdir('.')
zip_ref.extractall('/tmp/packages')
zip_ref.close()
sys.path.insert(0, '/tmp/packages')
import psycopg2
print(psycopg2.__version__)
job.commit()

Download the compiled version of psycopg2 from this link https://github.com/jkehler/awslambda-psycopg2. As psycopg2 is C library for python, which need to be compiled on linux to make it work. The compile instruction also given on that link. Thanks to the https://github.com/jkehler.

This also happens to me in new Ubuntu 18.04. It is caused by missing one file _psycopg.py in the /usr/local/lib/python3.7/site-packages/psycopg2.
It is fixed by:
remove the old psycopg2 from your machine pip3 uninstall psycopg2.
download new pyscopg2 manually from the official page http://initd.org/psycopg/tarballs/PSYCOPG-2-7/psycopg2-2.7.7.tar.gz
tar xvf psycopg2-2.7.7.tar.gz
python setup.py build
sudo python setup.py install

I had this happen in Linux using Python 3.7. It is caused by missing one file _psycopg.cpython-37m-x86_64-linux-gnu.so in the /usr/local/lib/python3.7/site-packages/psycopg2.
I downloaded _psycopg.cpython-37m-x86_64-linux-gnu.so from https://github.com/jkehler/awslambda-psycopg2/tree/master/psycopg2-3.7, and Copied this file into my anaconda lib.

I had this happen in Linux using Python 2 because I had accidentally had my PYTHONPATH set to Python 3 libraries, and it was trying to load the python3 version of psycopg2. Solution was to unset PYTHONPATH.

I had the same error on Windows, this worked for me:
pip install -U psycopg2
I had an older version installed, must have depreciated

For lambda functions on Python 3.7, I ended up using the psycopg2-binary library mentioned in these threads:
https://github.com/jkehler/awslambda-psycopg2/issues/51
Using psycopg2 with Lambda to Update Redshift (Python)
pip3 install psycopg2-binary==2.8.3
Snippet from these links:
I ended up using a different library: psycopg2-binary in my requirement.txt file and it working fine now.
solved it by using psycopg2-binary==2.8.3

I came to know that most times the WINDOWS packaging does not go fine with LAMBDA.
I faced same issue while running LAMBDA with WINDOWS installed 3rd party pscyopg2 packaging.
Solution:
step1>
I installed psycopg2 in Linux.
Copied both the directories psycopg2_binary-2.8.2.dist-info and psycopg2 from Linux to windows.
step2>
Along with source *.py, packaged with copied 3rd party dependencies psycopg2 in windows to *.zip file
step3>
Upload the file to LAMBDA - Here it goes, It runs successfully without any error.

Windows 10 with conda environment manager (fresh install of Django, wagtail with PostgreSQL), had the same error. Removed psycopg2
conda remove -n myenv psycopg2
it updated some packages, removed others (it also removed django, wagtail...). Then installed psycopg2 back
conda install -n myenv psycopg2
Tested it, import worked
python
>>> import psycopg2
Installed django, wagtail back. python manage.py migrate now populated PostgreSQL.

In my case, it was other site-packages that was exposed by installing pgcli, uninstalling pgcli resolved the issue for the time being.
This somehow penetrated virtualenv too.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark-submit python packages with venv cannot run program - apache-spark

Related

Python packages for pyspark on AWS EMR notebook will not import

Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark

Why do I get error ModuleNotFoundError: No module named 'azure.storage' during execution of my azure python function?

Spark installation for production, pip install or not?

ImportError: No module named 'psycopg2._psycopg'

Categories

Resources