Is there a way to reduce the size of the Spacy installation? - python-3.x

I'm using Spacy in a project and noticed that my Docker images are pretty big. A bit of research led me to find out that just the Spacy installation itself (in /usr/local/lib/python3.6/site-packages/spacy) accounts for 267MB, so I was wondering if there's anything that can be done to reduce that footprint?

Out of interest, SpaCy's 2.2 was released yesterday (Oct 2nd, 2019).
One of the product features for this 2.2 is "Smaller disk foot-print, better language resource handling". So, upgrading to SpaCy 2.2. may be one way to reduce the size of a SpaCy installation.
(Although this post doesn't solve your specific problem, I believe it does answer this specific question.)

Related

Running stable-diffusion on graphcore IPU's

I have been looking for a version of Stable-Diffusion which would be able to run on the IPU's. Currently (due to the high availability) so far I can find CUDA based ones only.
Now I wonder if there is a way to run scripts/trainers/learning etc that are Cuda based on IPU? For example a translation program in between.
I doubt there is, and I bet as I cannot find a IPU version I'll have to modify the scripts :(.
There is the HuggingFace optimum library which acts as the interoperability layer for transformers to run on IPUs. You can find Stable Diffusion there.
For other models that are not supported in the library, there's a guide on how you could modify your script to make it IPU-compatible here

What happened to sklearn.datasets.load_boston?

While I was coding a Boston housing model using sklearn.datasets.load_boston, it gave me an error saying that the database was deprecated due to 'ethical' issues. What are those issues? I looked online, and could not find anything.
Here's the full error:
DEPRECATED: load_boston is deprecated in 1.0 and will be removed in 1.2.
The Boston housing prices dataset has an ethical problem. You can refer to the documentation of this function for further details.
The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original source:
Actually, it is just exactly as it is in the error. You can check https://scikit-learn.org/1.1/modules/generated/sklearn.datasets.load_boston.html for further details.
As I understand, there are 2 problems in the data:
Racism: There is a great article, which was also cited in the Scikit-Learn documentation by M. Carlisle. It focuses on the main issues of the Boston Housing dataset, which he found that house prices effected by neighbourhood race.
No suitable goal: "the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption."
However, you can get the data from the source:
http://lib.stat.cmu.edu/datasets/boston
I hope these help.
Yes, this dataset is removed from scikit_learn version 1.2. If you want to use it you can install the earlier version of scikit-learn:
pip install scikit-learn==1.1.3
This will still show the warning and you can use it only for educational purposes.

Legacy torchtext 0.9.0

In the latest release of torchtext they moved a lot of features to torchtext.legacy, I want to do the same things like torchtext.legacy.data.Field and other features without using legacy, is that can be done? and how?
EDIT:
here is a release note about 0.9.0 version
here is the migration guide
Also, in the first link, there are counterparts for legacy Datasets.
Old answer (might be useful)
You could go for an alias, namely:
import torchtext.legacy as torchtext
But this is a bad idea for multiple reasons:
It became legacy for a reason (you can always change your existing code to torchtext.legacy.data.Field)
Very confusing - torchtext should torchtext, not torchtext.legacy
Unable to import torchtext as... torchtext - because this alias is already taken.
You could also do some workarounds like assigning torch.legacy.data.Field to torch.data.Field and what not, but it is a horrible idea.
If you want to work with this legacy stuff, you can always stay with smaller version of torchtext like 0.8.0 and this would make sense

Is there any solution if spacy can't be located on my system?

As the picture shows, spacy is well installed:
But I can't still "import" it:
By consulting the official site, it seems that spacy can't be located on my system(win32):
I want to know if there are some solution to it. In fact, I want to use it to treat french corpus, so if it's impossible to profit it, is there any other similar tool I can use to lemmatize french and so on?

Identifying software and version range in a sentence

I have sentences similar to the following format
This vulnerability happened in Firefox 1.x before 1.8, Safari
2.x before 2.8.
Given the above sentence, I want to extract a dictionary
{Firefox: 1.0-1.8, Safari: 2.0-2.8}
Problem is how should I identify the version range with the software they belong to, using NLP techniques?
I'd use a combination of NERs, one for detection of names and one for versions:
You may have to:
- Keep a list of popular softwares in case NER misses it.
- Hacky way to fix the software version numbers; like "1.x" is not properly detected.
You can play with it here: http://nlp.cogcomp.org

Resources