Spacy's site said they use universal dependencies scheme in their annotations specifications page. But when I parse "I love you", '''you''' was made a "dobj" of "love". There's no "dobj" in the universal dependency relations doc. So I have two questions :
How to get spacy to use the universal dependency relations?
How to get the doc for the relations spacy uses?
Spacy's provided models don't use UD dependencies for English or German. From the docs, where you can find tables for the dependency labels (https://spacy.io/api/annotation#dependency-parsing):
The individual labels are language-specific and depend on the training corpus.
For most other models / languages, UD dependencies are used.
How to get spacy to use the universal dependency relations?
According to the spaCy official documentation, all spaCy models are trained using the Universal Dependency Corpora which is language-specific. According to English, you can see the full list of labels from this link where you can find dojb listed as direct object.
How to get the doc for the relations spacy uses?
I don't know what you mean by doc. If you mean documentation, I already provided the official documentation when answering the first question. Also, you can use spacy.explain() to get faster results like so:
>>> import spacy
>>>
>>> spacy.explain('dobj')
direct object
>>>
>>> spcay.explain('nsubj')
nominal subject
Hope this answers your question!
Related
I want to run entity-linking for a project of mine. I used Spacy for the NER on a corpus of documents. Is there an existing linking model I can simply use to link the entities found?
The documentation I have found seems to be how to train a custom one.
Examples:
https://spacy.io/api/kb
https://github.com/explosion/spaCy/issues/4511
Thanks!
spaCy does not distribute pre-trained entity linking models. See here for some comments on why not.
I found one - Facebook GENRE:
https://github.com/facebookresearch/GENRE
!pip install transformers
from transformers import InputExample, InputFeatures
What are InputExample and InputFeatures here?
thanks.
Check out the documentation.
Processors
This library includes processors for several traditional tasks. These
processors can be used to process a dataset into examples that can be
fed to a model.
And
class transformers.InputExample
A single training/test example for simple sequence classification.
As well as
class transformers.InputFeatures
A single set of features of data. Property names are the same names as
the corresponding inputs to a model.
So basically InputExample is just a raw input and InputFeatures is the (numerical) feature representation of that Input that the model uses.
I couldn't find any tutorial explicitly explaining this but you can check out Chapter 4 (From text to features) in this tutorial where it is nicely explained on an example.
From my experience the transformers library has an absolute ton of classes and structures so going too deep into the technical implementation can make it easy to get lost in. For starters I would recommend trying to get an idea of the broader picture by just getting some example projects to work as well as checking out their π€ Course.
I'm try to get started with the gensim library. My goal is pretty simple. I want to use the keywords extraction provided by gensim on a german text. Unfortunately, i'm failing hard.
Gensim comes with a keywords extraction build in, it is build on TextRank. While the results look good on english text, it seems not to work on german. I simple installed gensim via pypi and used it out of the box. Well such AI Products are usually driven by a model. My guess is that gensim comes with a english model. A word2vec model for german is available on a github page.
But here i'm stuck, i can't find a way how the summarization module of gensim, which provides the keywords function i'm looking for, can work with a external model.
So the basic question is, how do i load the german model and get keywords from german text?
Thanks
There's nothing in the gensim docs, or the original TextRank paper (from 2004), suggesting that algorithm requires a Word2Vec model as input. (Word2Vec was 1st published around 2013.) It just takes word-tokens.
See examples of its use in the tutorial notebook that's included with gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb
I'm not sure the same algorithm would work as well on German text, given the differing importance of compound words. (To my eyes, TextRank isn't very impressive with English, either.) You'd have to check the literature to see if it still gives respected results. (Perhaps some sort of extra stemming/intraword-tokenizing/canonicalization would help.)
According to the authors of Discrete-State Variational Autoencoders
for Joint Discovery and Factorization of Relations paper, the first field of this dataset is a lexicalized dependency path between the pair of entities of the training sentences.
What tool (preferably in python) can extract such lexicalized path from a sentence with an identified pair of entities?
You can use NLTK
NLTK has been called βa wonderful tool for teaching, and working in,
computational linguistics using Python,β and βan amazing library to
play with natural language.β
Using NLTK, you can parse a given sentence to get dependency relations between its words and their POS tags.
It doesn't provide a way to get those lexicalized dependency paths directly,
but it gives you what you need to write your own method to achieve that.
I am trying to extract collocations using nltk from a corpus an then use their occurrences as features for a scikit-learn classifier.
Unfortunately I am not so familiar with nltk and I don't see an easy way to do this.
I got this far:
extract collocations using BigramCollocationFinder from corpus
for each document, extract all bigrams (using nltk.bigrams) and check if they are one of the collocations
create a TfidfVectorizer with an analyzer that does nothing
feed it the documents in form of the extracted bigrams
That seems pretty overcomplicated to me. Also it has the problem that the BigramCollactionFinder has a window_size parameter for bigrams that span over words. The standard nltk.bigrams extraction can not do that.
A way to overcome this would be to instantiate a new BigramCollocationFinder for each document and extract bigrams again and match them against the ones I found before... but again, that seems way to complicated.
Surely there is an easier way to do that, that I overlook.
Thanks for your suggestions!
larsmans has already contributed a NLTK / scikit-learn feature mapper for simple, non collocation features. That might give you some inspiration for your own problem:
http://nltk.org/_modules/nltk/classify/scikitlearn.html