I used apache/airflow:2.3.3 docker image to set up the airflow. I've a pyenv installed in another drive. I've binded the volume with pyenv while creating the airflow containers. I've also updated env variable PATH in the containers to point to the python versions available in pyenv volume mounted. But airflow is still making use of the python (3.7.7) that comes with image apache/airflow:2.3.3. Is it possible to make airflow use a different python environment? I'm looking for if there is some additional environment variables that need to be updated so that airflow will take another python environment.
Have you taken a look at
PythonVirtualenvOperator
ExternalPythonOperator <-- this might require an upgrade to 2.4.0
Both of these offer flexibility call Python callables inside new Python environments.
Related
I am attempting to run a Dataflow job using Apache Beam v 2.25 and Python 3.7. Everything runs ok when using DirectRunner, but the job errors out when it attempts to invoke a function form another private Python module.
The error is
AttributeError: Can't get attribute '_create_code' on <module 'dill._dill' from '/usr/local/lib/python3.7/site-packages/dill/_dill.py'>
My setup file looks like this:
setup(
name="Rich Profile and Activiation reports",
version="0.1",
description="Scripts for reports",
author="Kim Merino",
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
include_package_data=True,
package_data={"": ["*.json"]},
)
My question is what version of dill should I be using for Apache Beam v. 2.25? I am currently using Dill v 0.3.3
I have an external dependency that required dill in order to work
Have you tried to set --save_main_session=True when running the pipeline?
https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
I had a similar issue with the same error message just now. What solved it was to make sure that I was using a Python virtual environment dedicated to my Beam/Dataflow code, in other words something like:
cd [code dir]
python -m venv env
source ./env/bin/activate
pip install apache-beam[gcp]
pip freeze > requirements.txt
Previously I was lazily using my global copy of Python 3. I'm not entirely sure why this fixed it, but I might have been using an incompatible version of dill (which is just a pickling-related library used by Beam).
When you either package your Beam code as a Classic Template or invoke Dataflow directly from your laptop, in either case, your code ends up getting pickled and sent to Dataflow. So if you are using a Python installation with an incompatible module, that can cause issues.
And of course in any case, it's a best practice to have a virtual environment for your repo, so this can narrow down the universe of possible issues.
I removed the external dependency and kept the requirements to a minimum
I have been working on Databricks notebook using Python/ R. Once job is done we need to terminate the cluster to save cost involved. ( As we are utilizing the machine).
So we also have to start the cluster if we want to work on any notebook. I have seen it takes a lot of time and install the packages again in the cluster. Is there any way to avoid installation everytime we start cluster?
Update: Databricks now allows custom docker containers.
Unfortunately not.
When you terminate a cluster its memory state is lost, so when you start it again it comes with a clean image. Even if you add the desired packages into an init script they will have to be installed each initialization.
You may ask Databricks support to check if it is possible to create a custom cluster image for you.
I am using conda env to install the packages. After my 1st installation, I am saving the environment as a yaml file in dbfs and using the same yaml file in the all other runs. This way I don't have to install the packages again.
Save the environment as a conda YAML specification.
%conda env export -f /dbfs/filename.yml
Import the file to another notebook using conda env update.
%conda env update -f /dbfs/filename.yml
List the packages -
%conda list
Here are the steps I have taken so far:
I installed Anaconda3 and everything included on the directory $HOME/anaconda3/bin.
I cd'ed into $HOME/anaconda3/bin and ran the command ./conda install -c conda-forge pyspark. It was successful.
I didn't do anything else. More specifically, there are no variables set in my .bashrc
Here are some important details:
I am on a distributed cluster running Hadoop, so there might be other directories outside of my home folder that I have yet to discover but I might need. I also don't have admin access.
Jupyter notebook runs just fine.
Here is my goal:
Goal. To do something along the lines of adding variables or configuring some files so that I can run pyspark on Jupyter Notebook.
What are other steps I need to do after Step 3 in order to achieve this goal?
Since you have installed pyspark with conda, and as you say Jupyter notebook runs fine (presumably for the same Anaconda distribution), there are no further steps required - you should be able to open a new notebook and import pyspark.
Notice though that installing pyspark that way (i.e. with pip or conda) gives only limited functionality; from the package docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
Installing pyspark with pip or conda is a relatively recent add-on, aimed at the cases described in the docs above. I don't know what limitations you may face (have never tried it) but if you need the full functionality, you should download the full Spark distribution (of which pyspark is an integral part).
What's the "correct" way to set the sys path for Python worker node?
Is it a good idea for worker nodes to "inherit" sys path from master?
Is it a good idea to set the path in the worker nodes' through .bashrc? Or is there some standard Spark way of setting it?
A standard way of setting environmental variables, including PYSPARK_PYTHON, is to use conf/spark-env.sh file. Spark comes with a template file (conf/spark-env.sh.template) which explains the most common options.
It is a normal bash script so you can use it the same way as you would with .bashrc
You'll find more details in a Spark Configuration Guide.
By the following code you can change the python path only for the current job, which also allow different python path for driver and executors:
PYSPARK_DRIVER_PYTHON=/home/user1/anaconda2/bin/python PYSPARK_PYTHON=/usr/local/anaconda2/bin/python pyspark --master ..
You may do either of the below -
In config,
Update SPARK_HOME/conf/spark-env.sh, add below lines:
# for pyspark
export PYSPARK_PYTHON="path/to/python"
# for driver, defaults to PYSPARK_PYTHON
export PYSPARK_DRIVER_PYTHON="path/to/python"
OR
In the code, add:
import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = 'path/to/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'path/to/python'
The error of my case was that:
Exception: Python in worker has different version 2.6 than that in
driver 2.7, PySpark cannot run with different minor versions
The solution that helped:
export PYSPARK_PYTHON=python2.7
export PYSPARK_DRIVER_PYTHON=python2.7
jupyter notebook
Of course, I installed python2.7 locally on workers.
I suppose it is also important that I also set the PATH.
I did not rely on local workers' settings. The path was inherited from setting the edge node where is jupyter-notebook.
I have an account to a computing cluster that uses Scientific Linux. Of course I only have user access. I'm working with python and I need to run python scripts, so I need to import some python modules. Since I don't have root access, I installed a local python copy on my $HOME with all the required modules. When I run the scripts on my account (hosting node), they run correctly. But in order to submit jobs to the computing queues (to process on much faster machines), I need to submit a bash script that has a line that executes the scripts. The computing cluster uses SunGrid Engine. However when I submit the bash script, I get an error that the modules I installed can't be found! I can't figure out what is wrong. I hope if you can help.
You could simply call your python program from the bash script with something like: PYTHONPATH=$HOME/lib/python /path/to/my/python my_python_script
I don't know how SunGrid works, but if it uses a different user than yours, you'll need global read access to your $HOME. Or at least to the python libraries.
First, whether or not this solution works for you depends heavily on how the cluster is set up. That said, the general solution to your problem is below. If the compute cluster has access to the same files as you do in your home directory, I see no reason why this would not work.
You need to be using a virtualenv. Install your software inside that virtualenv along with any additional python packages you need. Then in your batch bash script, provide the full path to the python interpreter within that virtualenv.
Note: to install python packages inside your virtualenv, you need to use the pip instance that is in your virtualenv, not the system pip.
Example:
$ virtualenv foo
$ cd foo
$ ./bin/pip install numpy
Then in your bash script:
/path/to/foo/bin/python /path/to/your/script.py
Have you tried to add these in your python code:
import sys
sys.path.append("..")
from myOtherPackage import myPythonFile
This works very well for my code when I run it on Cluster and I wanted to call my "myPythonFile" from other package "myOtherPackage"