Where can I get CoNLL-X training data? - nlp

I'm trying to train the Stanford Neural Network Dependency Parser to check phrase similarity.
The way I tried is:
java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile trainPath -devFile devPath -embedFile wordEmbeddingFile -embeddingSize wordEmbeddingDimensionality -model modelOutputFile.txt.gz
The error that I got is:
Train File: C:\Users\rohit\Downloads\CoreNLP-master\CoreNLP-master\data\edu\stanford\nlp\parser\trees\en-onetree.txt
Dev File: null
Model File: modelOutputFile.txt.gz
Embedding File: null
Pre-trained Model File: null
################### Train
#Trees: 1
0 tree(s) are illegal (0.00%).
1 tree(s) are legal but have multiple roots (100.00%).
0 tree(s) are legal but not projective (0.00%).
###################
#Word: 3
#POS:3
#Label: 2
###################
#Transitions: 3
#Labels: 1
ROOTLABEL: null
Random generator initialized with seed 1459831358061
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.parser.nndep.Util.scaling(Util.java:49)
at edu.stanford.nlp.parser.nndep.DependencyParser.readEmbedFile. (DependencyParser.java:636)
at edu.stanford.nlp.parser.nndep.DependencyParser.setupClassifierForTraining(DependencyParser.java:787)
at edu.stanford.nlp.parser.nndep.DependencyParser.train(DependencyParser.java:676)
at edu.stanford.nlp.parser.nndep.DependencyParser.main(DependencyParser.java:1247)
The help embedded within the code says that the training file should be a - "Path to a training treebank in CoNLL-X format".
Does anyone know where I can find some CoNLL-X training data to train?
I gave training file but not embedding file and got this error.
My guess is if I give the embedding file it might work.
Please shed some light on which training file & embedding file I should use and where I can find them.

CoNLL-X treebanks
You can get the training data for Danish, Dutch, Portuguese, and Swedish available for free here. For other languages, you'll probably need to license a treebank from LDC, unfortunately (details for many languages on that page).
Universal Dependencies are in CoNLL-U format, which can usually be converted to CoNLL-X format with some work.
Lastly, there's a large list of treebanks and their availability on this page. You should be able to convert many of the dependency treebanks in this list into CoNLL-X format if they're not already in that format.
Training the Stanford Neural Net Dependency parser
From this page: The embedding file is optional, but the treebank is not. The best treebank and embedding files to use depend on which language and type of text you'd like to parse. Ideally, you would train on as much data as possible in the domain/genre that you're trying to parse.

Related

Bert for relation extraction

i am working with bert for relation extraction from binary classification tsv file, it is the first time to use bert so there is some points i need to understand more?
how can i get an output like giving it a test data and show the classification results whether it is classified correctly or not?
how bert extract features of the sentences, and is there a method to know what are the features that is chosen?
i used once the hidden layers and another time i didn't use i got the accuracy of not using the hidden layer higher than using it, is there an reason for that?

Spacy 3.0 Training custom NER --> Validation of this custom NER model

I trained a custom SpaCy Named entity recognition model to detect biased words in job description. Now that I trained 8 variantions (using different base model, training model, and pipeline setting), I want to evaluate which model is performing best.
But.. I can't find any documentation on the validation of these models.
There are some numbers of recall, f1-score and precision on the meta.json file, in the output folder, but that is no sufficient.
Anyone knows how to validate or can link me to the correct documentation? The documentation seem nowhere to be found.
NOTE: Talking about SpaCy V3.x
During training you should provide "evaluation data" that can be used for validation. This will be evaluated periodically during training and appropriate scores will be printed.
Note that there's a lot of different terminology in use, but in spaCy there's "training data" that you actually train on and "evaluation data" which is not training and just used for scoring during the training process. To evaluate on held-out test data you can use the cli evaluate command.
Take a look at this fashion brands example project to see how "eval" data is configured and used.

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

Stanford CoreNLP Dependency Parser Usage with Unsupported Languages

I am trying to train CoreNLP's NN based dependency parser in Turkish. I have found the command below in the documentation of the parser:
Train a parser with CoNLL treebank data: java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile trainPath
-devFile devPath -embedFile wordEmbeddingFile -embeddingSize wordEmbeddingDimensionality -model modelOutputFile.txt.gz
I couldn't exactly figure out what the modelOutputFile is. It is stated in the documentation that this file is written in the training phase. Is modelOutputFile a pregenerated file that I should create or just an empty file that will be written automatically in the training phase?
Any help will be appreciated, thank you!
When the training process is done it should write the trained model to modelOutputFile.txt.gz You can then use that trained file to parse new text. Full documentation here: https://nlp.stanford.edu/software/nndep.shtml

unary class text classification in weka?

I have a training dataset (text) for a particular category (say Cancer). I want to train a SVM classifier for this class in weka. But when i try to do this by creating a folder 'cancer' and putting all those training files to that folder and when i run to code i get the following error:
weka.classifiers.functions.SMO: Cannot handle unary class!
what I want to do is if the classifier finds a document related to 'cancer' it says the class name correctly and once i fed a non cancer document it should say something like 'unknown'.
What should I do to get this behavior?
The SMO algorithm in Weka only does binary classification between two classes. Sequential Minimal Optimization is a specific algorithm for solving an SVM and in Weka this a basic implementation of this algorithm. If you have some examples that are cancer and some that are not, then that would be binary, perhaps you haven't labeled them correctly.
However, if you are using training data which is all examples of cancer and you want it to tell you whether a future example fits the pattern or not, then you are attempting to do one-class SVM, aka outlier detection.
LibSVM in Weka can handle one-class svm. Unlike the Weka SMO implementation, LibSVM is a standalone program which has been interfaced into Weka and incorporates many different variants of SVM. This post on the Wekalist explains how to use LibSVM for this in Weka.

Resources