Cannot call Keras from Slurm - keras

I want to use Keras on a cluster using Slurm as the job engine.
If I open a terminal and run the following commands, everything works fine:
$python
>>> import tensorflow
>>> import keras
However, if I place import tensorflow and import keras in a Python file that I then call from slurm :
srun [bunch of parameters for my cluster] python mypythonfile.py
Then I get the following error: ImportError: No module named keras.
Is there something specific to do when using Keras in a cluster with Slurm?

I'm reiterating my comment just to show that this question has been answered:
It's common to module load xxxx where xxxx is a different Python installation than the default. You usually stick this in your .bash_profile or a similar file to make sure that you have the Python version you want, always available.
When you submit a job with Slurm, it doesn't call your .bash_profile. It just executes the script. You need to make sure that loading your Python distribution is part of that script.

I also met this problem and solved it recently. As the above answer mentioned, slurm command won't excute .bash_profile and the reason that you can import kears directly in python is the environment setting sentence in the .bash_profile. As a result, I added export PATH="/n/home01/username/miniconda3/bin:$PATH" into my sbatch file and then everything worked.

Related

Why are these import errors occurring when running python scripts from cmd or windows task scheduler, but not anaconda?

I am encoutering import errors, but only when running my python scripts from cmd or windows task scheduler (effectively the same issue I assume). I have researched answers already and attempted various solutions (detailed below), but nothing has worked yet. I need to understand the problem in any case so that I can manage anything like it in the future.
Here is the issue:
Windows 10. Anaconda Python 3.9.7. Virtual enviromnent.
I have a script that works fine if I open an anaconda prompt, activate the virtual environment and run it.
However, this is where the fun starts. If I try to run the script from the non-anaconda cmd prompt deploying the commands: "C:\Users\user\anaconda3\envs\venv\python.exe" "C:\Users\user\scripts\script.py" if get the following error:
ImportError: DLL load failed while importing etree: The specified module could not be found.
Traceback includes:
"C:\Users\user\anaconda3\envs\venv\lib\site-packages\lxml\html\__init__.py", line 53, in <module>
from ..import etree
This is not as simple as one specific module not being installed, because of course running the script from within the anaconda prompt and the virtual environment works. Similar also happens when I run other scripts. Other errors I have seen include, for example:
ImportError: DLL load failed while importing _imaging: The specified module could not be found.
Traceback includes:
"C:\Users\user\anaconda3\envs\venv\lib\site-packages\PIL\Image.py", line 114, in <module>
from . import _imaging as core
Also, I think this may be somehow related. Importing numpy (1.22.3) from within the python interpreter in the virtual environment works fine, but when I try to run a test script that imports numpy it fails both from anaconda and the cmd with the following error:
ImportError: cannot import name SystemRandom
The oveall issue was noted originally when trying to run various scripts from Windows Task Scheduler with the path to python "C:\Users\user\anaconda3\envs\venv\python.exe" entered as the Program/script and the script "script.py" entered as an argument. The above errors were produced, then reproduced by running the scripts from a non-anaconda cmd.
I am looking to understand what is happening here and for a solution that can get the scripts running from the virtual enviroment from Windows Task Scheduler effectively.
Update:
I have uninstalled and reinstalled numpy (and pandas) using conda. This has left the venv with numpy==1.20.3 (and pandas=1.4.2). On attempting to re-run one of the scripts, it runs fine from within the venv in anaconda, but produces the following error when attempting to run from cmd or from within Windows Task Scheduler as above:
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions faled. This error can happen for many reasons, often due to issues with your setup or how NumPy was installed.
We have complied some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.9 from "C:\Users\user\anaconda3\envs\venv\python.exe"
* The NumPy version is "1.20.3"
and make sure that they are the versions you expect.
Please carefull study the documentation linked above for further help.
Original error was: DLL load failed while importing _multiarray_umath: The specified module could not be found.
I have looked into the solutions suggested, but am still completely at a loss, especially as to why the script runs from the venv in one place, but NOT the other.

Run python script with module/ modules already imported

I have tried nearly all solutions in my mind to do this, but the fact is I can't run a python script with modules imported already.
Here's my code for the module cls.py:
import os
def cls():
os.system('cls')
Given below is the code for opening python in cmd:
#echo off
python
pause
What I need is to open the python in the command prompt with the module cls imported. Also, when I try python -m <module> way, it doesn't show any errors, but the program ends.
Any help would be greatly appreciated and thanks in advance.
Saw this question related to mine, but it is not my problem: Run python script that imports modules with batch file
I think what you'r looking for is the interactive mode:
-i
When a script is passed as first argument or the -c option is used, enter interactive mode after executing the script or the command
So just use python -i cls.py in your batch file. This would import your Python file and stay in the Python interpreter prompt. You could then just call cls() because it's already imported.
Alternatively, you could set the environment variable PYTHONSTARTUP in your batch file:
If this is the name of a readable file, the Python commands in that file are executed before the first prompt is displayed in interactive mode. The file is executed in the same namespace where interactive commands are executed so that objects defined or imported in it can be used without qualification in the interactive session.

nextflow does not find all my python modules

I am trying to make a Nextflow script that utilizes a python script. My python script imports a number of modules but within Nextflow python3 does not find two (cv2 and matplotlib) of 7 modules and crashes. If I call the script directly from bash it works fine. I would like to avoid creating a docker image to run this script.
Error executing process > 'grab_images (1)'
Caused by:
Process `grab_images (1)` terminated with an error exit status (1)
Command executed:
python3 --version
echo 'processing image-1.npy'
python3 /home/hq/cv_proj/k_means2.py image-1.npy
Command exit status:
1
Command output:
Python 3.7.3
processing image-1.npy
Command error:
Traceback (most recent call last):
File "/home/hq/cv_proj/k_means2.py", line 5, in <module>
import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
Work dir:
/home/hq/cv_proj/work/7f/b787c62ec420b2b5eb490603ef913f
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
I think there is a path issue as modules like numpy, sys, re, time are successfully loaded. How can I fix?
Thanks in advance
UPDATE
To assist other who may have problems using python in nextflow scripts... Make sure your shebang is correct. I was using
#!/usr/bin/python
instead of
#!/usr/bin/python3
Since all of my packages were installed with pip3 and I exclusively use python3 you need to have the right shebang.
Best to avoid absolute paths to your script(s) in your process declarations. This section of the docs is worth taking some time to read: https://www.nextflow.io/docs/latest/sharing.html#manage-dependencies, particularly the subsection on how to manage third party scripts:
Any third party script that does not need to be compiled (Bash,
Python, Perl, etc) can be included in the pipeline project repository,
so that they are distributed with it.
Grant the execute permission to these files and copy them into a
folder named bin/ in the root directory of your project repository.
Nextflow will automatically add this folder to the PATH environment
variable, and the scripts will automatically be accessible in your
pipeline without the need to specify an absolute path to invoke them.
Then the problem is how to manage your Python dependencies. You mentioned Docker is not an option. Is Conda also not an option? The config for Conda might look something like:
name: myenv
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- conda-forge::matplotlib-base=3.4.3
- conda-forge::numpy=1.21.2
- conda-forge::opencv=4.5.2
Then if the above is in a file called environment.yml, create the environment with:
conda env create
See also the best practices for using Conda.

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark.
I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
When I run the following code:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
I got error:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code?
but none of them work for me.
I have tried to updated
nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work.
It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks
When spinning up a Databricks single node cluster this will work fine. Installing nltk via pip and then using the nltk.download module to get the prebuilt models/text works.
Assumptions: User is programming in a Databricks notebook with python as the default language.
When spinning up a multinode cluster there are a couple of issues you will run into.
You are registering a UDF that relies on code from another module. In order for this to UDF to work on every node in the cluster the module needs to be installed at the cluster level (i.e. nltk installed on driver and all worker nodes). The module can be installed like this via an init script at cluster start time or installed via the libraries section in the Databricks Compute section. More on that here...(I also give code examples below)
https://learn.microsoft.com/enus/azure/databricks/libraries/cluster-libraries.
Now when you run the UDF the module will exist on all nodes of the cluster.
Using nltk.download() to get data that the module references. When we do nltk.download() in a multinode cluster interactively it will only download to the driver node. So when your UDF executes on the other nodes those nodes will not contain the needed references in the specified paths that it looks in by default. To see these paths default paths run nltk.data.path.
To overcome this there are two possibilities I have explored. One of them works.
(doesn't work) Using an init script, install nltk, then in that same init script call nltk.download via a one-liner bash python expression after the install like...
python -c 'import nltk; nltk.download('all');'
I have run into the issue where the nltk is installed but not found after it has installed. I'm assuming virtual environments are playing a role here.
(works) Using an init script, install nltk.
Create the script
dbutils.fs.put('/dbfs/databricks/scripts/nltk-install.sh', """
#!/bin/bash
pip install nltk""", True)
Check it out
%sh
head '/dbfs/databricks/scripts/nltk-install.sh'
Configure cluster to run init script on start up
Databricks Cluster Init Script Config
In the cluster configuration create the environment variable NLTK_DATA="/dbfs/databricks/nltk_data/". This is used by the nltk package to search for data/model dependencies.
Databricks Cluster Env Variable Config
Start the cluster.
After it is installed and the cluster is running check to maker sure the environment variable was correctly created.
import os
os.environ.get("NLTK_DATA")
Then check to make sure that nltk is pointing towards the correct paths.
import nltk
nltk.data.path
If '/dbfs/databricks/nltk_data/ is within the list we are good to go.
Download the stuff you need.
nltk.download('all', download_dir="/dbfs/databricks/nltk_data/")
Notice that we downloaded the dependencies to Databricks storage. Now every node will have access to the nltk default dependencies. Because we specified the environment variable NLTK_DATA at cluster creation time when we import nltk it will look in that directory. The only difference here is that we now pointed nltk to our Databricks storage which is accessible by every node.
Now since the data exists in mounted storage at cluster start up we shouldn't need to redownload the data every time.
After following these steps you should be all set to play with nltk and all of its default data/models.
I recently encountered the same issue when using NLTK in a Glue job.
Adding the 'missing' file to all nodes resolved the issue for me. I'm not sure if it will help in databricks but is worth a shot.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Drew Ringo's suggestion almost worked for me.
If you're using a multinode cluster in Databricks, you will face the problems Ringo mentioned. For me a much simpler solution was running the following init_script:
dbutils.fs.put("dbfs:/databricks/scripts/nltk_punkt.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader punkt""",True)
Make sure to add the filepath under Advanced options -> Init Scripts found within the Cluster Configuration menu.
The first of Drew Ringo's 2 possibilities will work if your cluster's init_script looks like this:
%sh
/databricks/python/bin/pip install nltk
/databricks/python/bin/python -m nltk.downloader punkt
He is correct to assume that his original issue relates to virtual environments.
This helped me to solve the issue:
import nltk
nltk.download('all')

PySpark: No such file or directory error when creating a SparkContext on Jupyter Notebook [duplicate]

I've spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here's what my .bash_profile looks like:
PATH="/my/path/to/anaconda3/bin:$PATH"
export JAVA_HOME="/my/path/to/jdk"
export PYTHON_PATH="/my/path/to/anaconda3/bin/python"
export PYSPARK_PYTHON="/my/path/to/anaconda3/bin/python"
export PATH=$PATH:/my/path/to/spark-2.1.0-bin-hadoop2.7/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
export SPARK_HOME=/my/path/to/spark-2.1.0-bin-hadoop2.7
alias pyspark="pyspark --conf spark.local.dir=/home/puifais --num-executors 30 --driver-memory 128g --executor-memory 6g --packages com.databricks:spark-csv_2.11:1.5.0"
When I type /my/path/to/spark-2.1.0-bin-hadoop2.7/bin/spark-shell, I can launch Spark just fine in my command line shell. And the output sc is not empty. It seems to work fine.
When I type pyspark, it launches my Jupyter Notebook fine. When I create a new Python3 notebook, this error appears:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
And sc in my Jupyter Notebook is empty.
Can anyone help solve this situation?
Just want to clarify: There is nothing after the colon at the end of the error. I also tried to create my own start-up file using this post and I quote here so you don't have to go look there:
I created a short initialization script init_spark.py as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
and placed it in the ~/.ipython/profile_default/startup/ directory
When I did this, the error then became:
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file /my/path/to/spark-2.1.0-bin-hadoop2.7/python/pyspark/shell.py:
[IPKernelApp] WARNING | Unknown error in handling startup files:
Well, it really gives me pain to see how crappy hacks, like setting PYSPARK_DRIVER_PYTHON=jupyter, have been promoted to "solutions" and tend now to become standard practices, despite the fact that they evidently lead to ugly outcomes, like typing pyspark and ending up with a Jupyter notebook instead of a PySpark shell, plus yet-unseen problems lurking downstream, such as when you try to use spark-submit with the above settings... :(
(Don't get me wrong, it is not your fault and I am not blaming you; I have seen dozens of posts here at SO where this "solution" has been proposed, accepted, and upvoted...).
At the time of writing (Dec 2017), there is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
I have not bothered to change my details to /my/path/to etc., and you can already see that there are some differences between our cases (I use Intel Python 2.7, and not Anaconda Python 3), but hopefully you get the idea (BTW, don't worry about the connection_file - I don't use one either).
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.
If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.
Finally, don't forget to remove all the PySpark-related environment variables from your bash profile (leaving only SPARK_HOME should be OK). And confirm that, when you type pyspark, you find yourself with a PySpark shell, as it should be, and not with a Jupyter notebook...
UPDATE (after comment): If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
Conda can help correctly manage a lot of dependencies...
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Create a conda environment with all needed dependencies apart from spark:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
Activate the environment
$ source activate findspark-jupyter-openjdk8-py3
Launch a Jupyter Notebook server:
$ jupyter notebook
In your browser, create a new Python3 notebook
Try calculating PI with the following script (borrowed from this)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
I just conda installed sparkmagic (after re-installing a newer version of Spark).
I think that alone simply works, and it is much simpler than fiddling configuration files by hand.

Resources