How to print out to a file using Stanford Classifier - nlp

I am using Stanford Classifier for my project.
This project takes training data to tune the algorithm then test data to classify text inputs into categories.
So the format for test and training data is tab-delimited text which means predictor -TAB- input text
The software prints out the output to stdout (command line).
Is there anyway to output to a text file ?
I searched the javadoc of the project site, and I found
But I don't know how to use this property.
I tried -csvoutput=%1%n%c on command line
But it gives me java null pointer exception error when I try to run it.

If you want to save it to a file just add this to the end of your command:
> output_file.txt

Related

Unable to solve multiprocessing.Manager.Lock() error in Python code (VS editor)

I am using machine learning in my Python (version 3.8.5) code. In the preprocessing part, I need to hash encode few features. So earlier I have dumped a hash encoder pickle file using the features in the training phase. Saved the file with the name of 'hash_encoder.pkl'. Now in the testing phase, I need to transform the features using this pickle file. I'm using the following code given in screenshot to hash encode three string features as given in the first line.
In the encoder.transform line, I'm getting the error of "data_lock=mutiprocessing.Manager().Lock()".
At the end I'm also getting 'raise EOF error'.
I have tried using same version of pandas (1.1.3) to dump the hash_encoder file and also to load it. I'm not sure why is this coming up.
Can someone help me in understand or debugging this part?
I have added the screenshot of the error.

Stanford CoreNLP Output

Is there anyway to generate this output from the Stanford CoreNLP server?
https://drive.google.com/drive/folders/1K2g7nBzHgOpiBQZFRQBNWbylIvCANsdQ?usp=sharing
I have tried running the server on sample sentences with the following annotators:
'tokenize','ssplit','pos','lemma','depparse','natlog','openie', 'ner', 'parse'
and get similar data, just in a different format.
I am assuming that the format I am trying to get the output into is the default output from an older version of CoreNLP. Is there any way to get the output in the format needed?
Output format is fixed and it cannot be configured out of the box, however you can de-serialize the output and convert it in required format. You can use #dataclasses based design pattern to achieve the operation optimally!

How to do language model training on BERT

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation.
They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?
The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:
We're using the raw WikiText-2 (no tokens were replaced before the tokenization).
The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:
train_data_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a text file)."}
)
Therefore you can just specify your text files.

How to Train GloVe algorithm on my own corpus

I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file).
I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?)
the output was:
cooccurrence.bin
cooccurrence.shuf.bin
text8
corpus.txt
vectors.txt
How can I used those files to load it as a GloVe model on python?
You can do it using GloVe library:
Install it: pip install glove_python
Then:
from glove import Corpus, Glove
#Creating a corpus object
corpus = Corpus()
#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)
glove = Glove(no_components=5, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')
Reference: word vectorization using glove
This is how you run the model
$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make
To train it on your own corpus, you just have to make changes to one file, that is demo.sh.
Remove the script from if to fi after 'make'.
Replace the CORPUS name with your file name 'corpus.txt'
There is another if loop at the end of file 'demo.sh'
if [ "$CORPUS" = 'text8' ]; then
Replace text8 with your file name.
Run the demo.sh once the changes are made.
$ ./demo.sh
Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.
your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.
Here is my take on this::
After cloning the repository, edit the demo.sh file as you have to train it using your own corpus replace the CORPUS name with your file's name.
Then remove the script between MAKE and CORPUS as that is for downloading an example corpus for you.
Then run make which will form the four files in the build folder.
Now run ./demo.sh which will train and do all the stuff mentioned in the script on your own corpus and output will be generated as vectors.txt file.
Note : Don't forget to keep your corpus file directly inside the Glove folder.

CrfSharp file not found

when I try to run crfsharp, I get the following error at VS2012,
+err{"Could not find file 'C:\codeplex\POIParser\data\training\POIParser_corpus.train.tag'.":"C:\codeplex\POIParser\data\training\POIParser_corpus.train.tag"} System.Exception {System.IO.FileNotFoundException}
where can I find this file "POIParser_corpus.train.tag" ? I have downloaded both source code and main program of crfsharp and running it in VS2012.
Also I want to ask you can I use the CRFsharp to extract aspects by using training templates?
How do you run it ?
To train a CRF model, you need to prepare training corpus, template file at first and run CRFSharpConsole.exe with some parameters. CRFSharpConsole.exe will show usage, if you run it without any parameters.
Actually, I recommend you to download demo package from [DOWNLOADS] section in CRFSharp project web site(http://crfsharp.codeplex.com) at first, and then play with demo. In demo package, it will show you how to run CRFSharp in command line. For example, you can download Named entity recognized demo in English demo and run batch file to train a new model and test it.
For POIParser_corpus.train.tag you mentioned, it's the training corpus for Chinese POI inner-structure parser. You can also download it and run build_model.bat to train the model, and run test_model.bat to test it.

Resources