Spacy Natural Language Processing Pickle file issue - nlp

For the Spacy package, model files for deps, ner, and pos throw an invalid load key or EOF error when I try to load them using pickle.
I have executed the code on windows and linux systems. I don't think it is a binary mode transfer issue. I have checked it in detail. I am not able to figure out the issue. Most likely the file is corrupt but I am not sure. Is there a way it can be fixed using the hex editor?
Any help is highly appreciated. It will be great if someone can explain pickling in a bit detail.
Appreciate your help.

The English() object in Spacy is not pickable. See issue #125

Related

Is there any Python equivalent library as the Filehash in R programming?

I have been using R's filehash library to solve the "out of memory" problem, to store the large datasets in hashfiles, and the load/updating the file when use it. Given that most of the systems are now using SSD harddisk, I found this solution is good enough for me to solve my "out of memory" problem and a good balance between running time. I am now writing my codes in Python, but could not find any equivalent package. Could anyone shed some light for me? Thanks.
I have been reading some information about hashlib package in Python, but not quite sure if it is the samilar application as to create a hashfile and load it out.

What is the edifference between spacy.load('en_core_web_sm') vs spacy.load(en)

I have seen both of these written down in Colab Notebooks, Can someone please explain the difference between them? Thanks
In spaCy v2, it was possible to use shorthand to refer to a model in some circumstances, so "en" could be the same as "en_core_web_sm".
The way this worked internally kind of relied on symlinks, which added file system state and caused issues on Windows. This caused troubleshooting problems and confusion, so it was decided the convenience of the short names wasn't worth it, and there are no short names in v3.
So if you see code using spacy.load("en") it's using v2. There's no meaningful difference in how it works though.

Doc2Vec' object has no attribute 'neg_labels' when trying to use pretrained model

So I'm trying to use a pretrained Doc2vec for my semantic search project. I tried with this one https://github.com/jhlau/doc2vec (English Wikipedia DBOW) and with the forked version of Gensim (0.12.4) and python 2.7
It works fine when I use most_similar but when i try to use infer_vector I get this error:
AttributeError: 'Doc2Vec' object has no attribute 'neg_labels'
what can i do to make this work?
For reasons given in this other answer, I'd recommend against using a many-years-old custom fork of Gensim, and also find those particular pre-trained models a little fishy in their sizes to actually contain all the purported per-article vectors.
But also: that error resembles a very-old bug which only showed up if Gensim was not fully installed to have the necessary Cython-optimized routines for fast training/inference operations. (That caused some older, seldom-run code to be run that had a dependency on the missing neg_labels. Newer versions of Gensim have eliminated that slow code-path entirely.)
My comment on an old Gensim issue has more details, and a workaround that might help - but really, the much better thing to do for quality results & speedy code is to use a current Gensim, & train your own model.

Looking for the implementation of the math operations in the pytorch library. (such as torch.add, troch.mm etc.)

I want to have the source code for the math operations of pytorch. I know they are not all in the same file but hopefully someone can help me. I saw that there is an Aten folder on the github of pytorch but for me its quite confusing to go through.
Its my first question here. Sorry for anything annoying.

How to get "Universal dependencies, enhanced" in response from Stanford coreNLP?

I am playing around with the Stanford coreNLP parser and I am having a small issue that I assume is just something stupid I'm missing due to my lack of experience. I am currently using the node.js stanford-corenlp wrapper module with the latest full Java version of Stanford CoreNLP.
My current results are returning somehting similar to the "Collapsed Dependencies with CC processed" data here: http://nlp.stanford.edu/software/example.xml
I am trying to figure out how I can get the dependencies titled "Universal dependencies, enhanced" as show here: http://nlp.stanford.edu:8080/parser/index.jsp
If anyone can shed some light on even just what direction I need to research more about, it would be extremely helpful. Currently Google has not been helping much with the specific "Enhanced" results and I am just trying to find out what I need to pass,call or include in my annotators to get the results shown at the link above. Thanks for your time!
Extra (enhanced) dependencies can be enabled in the depparse annotator by using its 'depparse.extradependencies' option.
According to http://nlp.stanford.edu/software/corenlp.shtml it is set to NONE by default, and can be set to SUBJ_ONLY or MAXIMAL.

Resources