How to do language model training on BERT - nlp

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation.
They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?

The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:
We're using the raw WikiText-2 (no tokens were replaced before the tokenization).
The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:
train_data_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a text file)."}
)
Therefore you can just specify your text files.

Related

How To Import The MNIST Dataset From Local Directory Using PyTorch

I am writing a code of a well-known problem MNIST database of handwritten digits in PyTorch. I downloaded the train and testing dataset (from the main website) including the labeled dataset. The dataset format is t10k-images-idx3-ubyte.gz and after extract t10k-images-idx3-ubyte. My dataset folder looks like
MINST
Data
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
Now, I wrote a code to load data like bellow
def load_dataset():
data_path = "/home/MNIST/Data/"
xy_trainPT = torchvision.datasets.ImageFolder(
root=data_path, transform=torchvision.transforms.ToTensor()
)
train_loader = torch.utils.data.DataLoader(
xy_trainPT, batch_size=64, num_workers=0, shuffle=True
)
return train_loader
My code is showing Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif,.tiff,.webp
How can I solve this problem and I also want to check that my images are loaded (just a figure contains the first 5 images) from the dataset?
Read this Extract images from .idx3-ubyte file or GZIP via Python
Update
You can import data using this format
xy_trainPT = torchvision.datasets.MNIST(
root="~/Handwritten_Deep_L/",
train=True,
download=True,
transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
)
Now, what is happening at download=True first your code will check at the root directory (your given path) contains any datasets or not.
If no then datasets will be downloaded from the web.
If yes this path already contains a dataset then your code will work using the existing dataset and will not download from the internet.
You can check, first give a path without any dataset (data will be downloaded from the internet), and then give another path which already contains dataset data will not be downloaded.
Welcome to stackoverflow !
The MNIST dataset is not stored as images, but in a binary format (as indicated by the ubyte extension). Therefore, ImageFolderis not the type dataset you want. Instead, you will need to use the MNIST dataset class. It could even download the data if you had not done it already :)
This is a dataset class, so just instantiate with the proper root path, then put it as the parameter of your dataloader and everything should work just fine.
If you want to check the images, just use the getmethod of the dataloader, and save the result as a png file (you may need to convert the tensor to a numpy array first).

UnpicklingError: invalid load key, '`'

Tried use pretrained model for russian lang. from
https://wikipedia2vec.github.io/wikipedia2vec/pretrained/
But can't load model from pkl file.
Tried to use other encoders as cp1251, latin1, windows-1252. Unfortunately, it drops down.
model = Word2Vec.load_word2vec_format('ruwiki_20180420_100d.pkl')
UnpicklingError: invalid load key, '`'
According to the text on the page you've referenced, https://wikipedia2vec.github.io/wikipedia2vec/pretrained/, the binary files there should be loaded with Wikipedia2Vec.load().
Only the other text files there, with suffixes .txt, can be loaded with gensim's load_word2vec_format() method.
Either use Wikipedia2Vec.load() with the file you've mentioned, or try the text file variants instead.

Tensorflow object detection API tfrecord

Im new to the tensorflow TFRecord. so Im studying Tensorflow object detection API codes
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md
but I can`t find the codes that load tfrecord.
I think they use .config file to load tfrecord because I found this in config file.
tf_record_input_reader {
input_path: "/path/to/train_dataset.record-?????-of-00010"
}
Anyone can help?
Have you converted your dataset to TFRecord format yet?
If so, you should have a path which contains your training dataset sharded to a few record files, with the format
<path_to_training_data>/<train_dataset>.record-?????-of-xxxxx
Where <path_to_training_data> is the abovementioned path to your training dataset, <train_dataset> is the file name you gave to each file, xxxxx is the number of record files created (e.g. 00010), and ????? should be left as is, and is used as a format to all record files.
Once you've replaced <path_to_training_data>, <train_dataset> and xxxxx to the correct values of your dataset, TF OD API should handle everything else (finding all records, reading them, etc.).
Note there's usually tf_record_input_reader for both training dataset and eval dataset (validation/test), and each should have the corresponding above-mentioned values (path, dataset name, number of files).

How to print out to a file using Stanford Classifier

I am using Stanford Classifier for my project.
This project takes training data to tune the algorithm then test data to classify text inputs into categories.
So the format for test and training data is tab-delimited text which means predictor -TAB- input text
The software prints out the output to stdout (command line).
Is there anyway to output to a text file ?
I searched the javadoc of the project site, and I found
But I don't know how to use this property.
I tried -csvoutput=%1%n%c on command line
But it gives me java null pointer exception error when I try to run it.
If you want to save it to a file just add this to the end of your command:
> output_file.txt

how can i create my own model in Stanford Pos tagger?

I want to add new tagged words( local words that is used in our region ) and create a new model. I created a .prop file from command line but how can i create a .tagger file?
When i tried to create such file as mentioned on Stanford website it shows an error like
"No model specified"
what is the -model argument, is it the corpus? how can i add my new tagged words into that?
How do I train a tagger, then?
The Stanford site says that:
You need to start with a .props file which contains options for the
tagger to use. The .props files we used to create the sample taggers
are included in the models directory; you can start from whichever one
seems closest to the language you want to tag.
For example, to train a new English tagger, start with the left3words
tagger props file. To train a tagger for a western language other than
English, you can consider the props files for the German or the French
taggers, which are included in the full distribution. For languages
using a different character set, you can start from the Chinese or
Arabic props files. Or you can use the -genprops option to
MaxentTagger, and it will write a sample properties file, with
documentation, for you to modify. It writes it to stdout, so you'll
want to save it to some file by redirecting output (usually with >).
The # at the start of the line makes things a comment, so you'll want
to delete the # before properties you wish to specify.
Here are two links that can help you, describing step-by-step instructions on how to create (train) your tagger:
https://medium.com/#klintcho/training-a-swedish-pos-tagger-for-stanford-corenlp-546e954a8ee7
http://www.florianboudin.org/wiki/doku.php?id=nlp_tools_related&DokuWiki=9d6b70b2ee818e600edc0359e3d7d1e8
Please note that inside .conf file you should point to your treebank (that is, real-world sentences parsed in a dependency tree format with POS tags and dependency relations). In this same line you should specify your format:
TEXT // represents a tokenized file separated by text
TSV // represents a tsv file such as a conll file
TREES // represents a file in PTB format
In my case, I used a CoNLL file, which is a TAB-SEPARATED-VALUES format (TSV). I must confess that couldn't find clear documentation and had to appeal to source code.
My config:
model = portuguese.tagger
arch = left3words,naacl2003unknowns,allwordshapes(-1,1)
trainFile = format=TSV,wordColumn=1,tagColumn=4,C:\\path\\universal-dev.conll
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
tagSeparator = _
encoding = utf-8 # that's because I based my config on spanish!
iterations = 100
lang = spanish
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags =
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tokenize = true
tokenizerOptions = asciiQuotes
verbose = false
verboseResults = false
veryCommonWordThresh = 250
xmlInput = null
outputFormat = slashTags
nthreads = 16
The model property specifies the file to which the built model will be saved. You can provide any valid path, e.g. mymodel.tagger.
You can use this same properties file at test time, and MaxentTagger will then load from the specified model file rather than saving to it.
To be clear: your training corpus should be provided with the property trainFile. See the tagger properties files included with the Stanford Tagger for examples.

Resources