Error while importing 'en_core_web_sm' for spacy in Azure Databricks - databricks

I am getting an error while loading 'en_core_web_sm' of spacy in Databricks notebook. I have seen a lot of other questions regarding the same, but they are of no help.
The code is as follows
import spacy
!python -m spacy download en_core_web_sm
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
# Process
text = ("This is a test document")
doc = nlp(text)
I get the error "OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory"
The details of installation are
Python - 3.8.10
spaCy version 3.3
It simply does not work. I tried the following
ℹ spaCy installation:
/databricks/python3/lib/python3.8/site-packages/spacy
NAME SPACY VERSION
en_core_web_sm >=2.2.2 3.3.0 ✔
But the error still remains
Not sure if this message is relevant
/databricks/python3/lib/python3.8/site-packages/spacy/util.py:845: UserWarning: [W094] Model 'en_core_web_sm' (2.2.5) specifies an under-constrained spaCy version requirement: >=2.2.2. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.3.0,<3.4.0
warnings.warn(warn_msg)
Also the message when installing 'en_core_web_sm"
"Defaulting to user installation because normal site-packages is not writeable"
Any help will be appreciated
Ganesh

I suspect that you have cluster with autoscaling, and when autoscaling happened, new nodes didn't have the that module installed. Another reason could be that cluster node was terminated by cloud provider & cluster manager pulled a new node.
To prevent such situations I would recommend to use cluster init script as it's described in the following answer - it will guarantee that the module is installed even on the new nodes. Content of the script is really simple:
#!/bin/bash
pip install spacy
python -m spacy download en_core_web_sm

Related

Python and SpaCy - cannot download specific version

I am using Python 3.7.7
I ran the following (after pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz) and got results
[XXXXX#localhost some-folder]$ python3 -m spacy download en_core_web_sm-2.2.0
2021-02-16 17:58:24.921639: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-02-16 17:58:24.921671: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Ignoring the CUDA*.
No compatible package found for 'en_core_web_sm-2.2.0
the only way I have made it to work is to remove the version 2.2.0 from the code. But SpaCy documentation suggests that the version number should be able to download the correct file.
So, what am I doing wrong ?
Your spacy version also matters, not only the Python version.
spaCy models-languages compatibility
If you are already using spaCy v3, you won't be able to download language versions < 3

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark.
I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import nltk
nltk.download('punkt')
def get_keywords1(col):
sentences = []
sentence = nltk.sent_tokenize(col)
get_keywords_udf = F.udf(get_keywords1, StringType())
I run the above code and got:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
When I run the following code:
t = spark.createDataFrame(
[(2010, 1, 'rdc', 'a book'), (2010, 1, 'rdc','a car'),
(2007, 6, 'utw', 'a house'), (2007, 6, 'utw','a hotel')
],
("year", "month", "u_id", "objects"))
t1 = t.withColumn('keywords', get_keywords_udf('objects'))
t1.show() # error here !
I got error:
<span class="ansi-red-fg">>>> import nltk
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/databricks/python/nltk_data'
- '/databricks/python/share/nltk_data'
- '/databricks/python/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
I have downloaded 'punkt'. It is located at
/root/nltk_data/tokenizers
I have updated the PATH in spark environment with the folder location.
Why it cannot be found ?
The solution at NLTK. Punkt not found and this How to config nltk data directory from code?
but none of them work for me.
I have tried to updated
nltk.data.path.append('/root/nltk_data/tokenizers/')
it does not work.
It seems that nltk cannot see the new added path !
I also copied punkz to the path where nltk will search for.
cp -r /root/nltk_data/tokenizers/punkt /root/nltk_data
but, nltk still cannot see it.
thanks
When spinning up a Databricks single node cluster this will work fine. Installing nltk via pip and then using the nltk.download module to get the prebuilt models/text works.
Assumptions: User is programming in a Databricks notebook with python as the default language.
When spinning up a multinode cluster there are a couple of issues you will run into.
You are registering a UDF that relies on code from another module. In order for this to UDF to work on every node in the cluster the module needs to be installed at the cluster level (i.e. nltk installed on driver and all worker nodes). The module can be installed like this via an init script at cluster start time or installed via the libraries section in the Databricks Compute section. More on that here...(I also give code examples below)
https://learn.microsoft.com/enus/azure/databricks/libraries/cluster-libraries.
Now when you run the UDF the module will exist on all nodes of the cluster.
Using nltk.download() to get data that the module references. When we do nltk.download() in a multinode cluster interactively it will only download to the driver node. So when your UDF executes on the other nodes those nodes will not contain the needed references in the specified paths that it looks in by default. To see these paths default paths run nltk.data.path.
To overcome this there are two possibilities I have explored. One of them works.
(doesn't work) Using an init script, install nltk, then in that same init script call nltk.download via a one-liner bash python expression after the install like...
python -c 'import nltk; nltk.download('all');'
I have run into the issue where the nltk is installed but not found after it has installed. I'm assuming virtual environments are playing a role here.
(works) Using an init script, install nltk.
Create the script
dbutils.fs.put('/dbfs/databricks/scripts/nltk-install.sh', """
#!/bin/bash
pip install nltk""", True)
Check it out
%sh
head '/dbfs/databricks/scripts/nltk-install.sh'
Configure cluster to run init script on start up
Databricks Cluster Init Script Config
In the cluster configuration create the environment variable NLTK_DATA="/dbfs/databricks/nltk_data/". This is used by the nltk package to search for data/model dependencies.
Databricks Cluster Env Variable Config
Start the cluster.
After it is installed and the cluster is running check to maker sure the environment variable was correctly created.
import os
os.environ.get("NLTK_DATA")
Then check to make sure that nltk is pointing towards the correct paths.
import nltk
nltk.data.path
If '/dbfs/databricks/nltk_data/ is within the list we are good to go.
Download the stuff you need.
nltk.download('all', download_dir="/dbfs/databricks/nltk_data/")
Notice that we downloaded the dependencies to Databricks storage. Now every node will have access to the nltk default dependencies. Because we specified the environment variable NLTK_DATA at cluster creation time when we import nltk it will look in that directory. The only difference here is that we now pointed nltk to our Databricks storage which is accessible by every node.
Now since the data exists in mounted storage at cluster start up we shouldn't need to redownload the data every time.
After following these steps you should be all set to play with nltk and all of its default data/models.
I recently encountered the same issue when using NLTK in a Glue job.
Adding the 'missing' file to all nodes resolved the issue for me. I'm not sure if it will help in databricks but is worth a shot.
sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')
Drew Ringo's suggestion almost worked for me.
If you're using a multinode cluster in Databricks, you will face the problems Ringo mentioned. For me a much simpler solution was running the following init_script:
dbutils.fs.put("dbfs:/databricks/scripts/nltk_punkt.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader punkt""",True)
Make sure to add the filepath under Advanced options -> Init Scripts found within the Cluster Configuration menu.
The first of Drew Ringo's 2 possibilities will work if your cluster's init_script looks like this:
%sh
/databricks/python/bin/pip install nltk
/databricks/python/bin/python -m nltk.downloader punkt
He is correct to assume that his original issue relates to virtual environments.
This helped me to solve the issue:
import nltk
nltk.download('all')

Spacy es_core_news_sm model not loading

I'm trying to use Spacy for pos tagging in Spanish, for this I have checked the official documentation and also have read various post in Stackoverflow nonetheless neither has worked to me.
I have Python 3.7 and Spacy 2.2.4 installed and I'm running my code from a jupyter notebook
So as documentation suggests I tried:
From my terminal:
python -m spacy download en_core_web_sm
This gave the result:
Download and installation successful
Then in my jupyter notebook:
import spacy
nlp = spacy.load("es_core_news_sm")
And I got the following error:
ValueError: [E173] As of v2.2, the Lemmatizer is initialized with an instance of Lookups containing the lemmatization tables. See the docs for details: https://spacy.io/api/lemmatizer#init
Additionally, I tried:
import spacy
nlp = spacy.load("es_core_news_sm")
And this gave me a different error:
OSError: Can't find model 'es_core_news_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory
Could you please help me to solve this error?
You downloaded English model. In order to use Spanish model, you have to download it python -m spacy download es_core_news_sm
After downloading the right model you can try import it as follow
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()

spaCy loading model fails

I am trying to load spaCy model de_core_news_sm without any success. Since our company police seems to block the python -m spacy download de_core_news_sm prompt command, I downloaded the model manually and used pip install on the local tar.gz archive, which worked out well.
However, calling nlp = spacy.load("de_core_news_sm") in my code throws the following exception:
Exception has occurred: ValueError
[E149] Error deserializing model. Check that the config used to create the
component matches the model being loaded.
File "pipes.pyx", line 642, in
spacy.pipeline.pipes.Tagger.from_disk.load_model
I have no idea how to deal with this. Does anybody know what to do?
Run python -m spacy validate to check whether the model you downloaded is compatible with the version of spacy you have installed. This kind of error happens when the versions aren't compatible. (Probably one is v2.1 and the other is v2.2.)

Unable to load 'en' from spacy in jupyter notebook

I run the following lines of code in a jupyter notebook:
import spacy
nlp = spacy.load('en')
And get following error:
Warning: no model found for 'en_default'
Only loading the 'en' tokenizer.
I am using python 3.5.3, spacy 1.9.0, and jupyter notebook 5.0.0.
I downloaded spacy using conda install spacy and python3 spacy install en.
I am able to import spacy and load 'en' from my terminal but not from a jupyter notebook.
Based on the answer in your comments, it seems fairly clear that the two Python interpreters for Jupyter and your system Python are not the same, and therefore likely do not have shared libraries between them.
I would recommend re-running the installation or just specifically installation the en tool in the correct Spacy. Replace the path with the full path to the file, if the above is not the full path.
//anaconda/envs/capstone/bin/python -m spacy download
That should be enough. Let me know if there are any issues.
You can also download en language model in the jupyter notebook:
import sys
!{sys.executable} -m spacy download en

Resources