UnpicklingError: invalid load key, '`' - python-3.x

Tried use pretrained model for russian lang. from
https://wikipedia2vec.github.io/wikipedia2vec/pretrained/
But can't load model from pkl file.
Tried to use other encoders as cp1251, latin1, windows-1252. Unfortunately, it drops down.
model = Word2Vec.load_word2vec_format('ruwiki_20180420_100d.pkl')
UnpicklingError: invalid load key, '`'

According to the text on the page you've referenced, https://wikipedia2vec.github.io/wikipedia2vec/pretrained/, the binary files there should be loaded with Wikipedia2Vec.load().
Only the other text files there, with suffixes .txt, can be loaded with gensim's load_word2vec_format() method.
Either use Wikipedia2Vec.load() with the file you've mentioned, or try the text file variants instead.

Related

Confusion regarding joblib.dump()

One way to save sklearn models is to use joblib.dump(model,filename). I have a confusion regarding the filename argument. One way to run this function is through :
joblib.dump(model,"model.joblib")
This saves the model successfully and also the model is loaded correctly using the:
model=joblib.load("model.joblib")
Another way is to use :
joblib.dump(model,"model")
With no ".joblib" extension this time. This also runs successfully and the model is loaded correctly using the:
model=joblib.load("model")
What confuses me is the file extension in the filename, Is there a certain file extension that I should use for saving the model? Or it is not necessary to use a file extension as I did above? If it is not necessary, then why?
There is no file extension that "must" be used to serialize a model. You can specify the compression method by using one of the supported filename extensions (.z, .gz, .bz2, .xz or .lzma). By default joblib will use zlib to serialize objects.
Therefore you can use any file extension. However it is a good practice to use the library as the extension in order to know how to load it.
I name my serialized model model.pickle when I am using pickle library and model.joblib when I am using joblib.

How to do language model training on BERT

I want to train BERT on a target corpus. I am looking at this HuggingFace implementation.
They are using .raw files for the training data. If I have .txt files of my training data, how can I use their implementation?
The .raw only indicates that they use the raw version of the WikiText, they are regular text files containing the raw text:
We're using the raw WikiText-2 (no tokens were replaced before the tokenization).
The description of the data files options also says that they are text files. From run_language_modeling.py - L86-L88:
train_data_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a text file)."}
)
Therefore you can just specify your text files.

Creating a file in node.js, using an encoding (CP437 / IBM) which is not part of the supported standard node encodings [ascii/base64/latin1/...]

I am processing Files with different encoding-types.
Right now, any encoded file is transformed to utf-8 and saved to my SQL DB.
My goal ist to generate new files with the same encoding as the original data.
I am able to decode hex as CP437/IBM but unable to write the resulting String to a File maintaining the desired encoding.
decodedString = cptable.utils.decode(437, myHexString);
fs.appendFile(filename, decodedString, [options.encoding],(err)=>{
console.log("please help me")
}
The result is a file with faulty encoding, but also contains a hidden message.

Tensorflow object detection API tfrecord

Im new to the tensorflow TFRecord. so Im studying Tensorflow object detection API codes
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md
but I can`t find the codes that load tfrecord.
I think they use .config file to load tfrecord because I found this in config file.
tf_record_input_reader {
input_path: "/path/to/train_dataset.record-?????-of-00010"
}
Anyone can help?
Have you converted your dataset to TFRecord format yet?
If so, you should have a path which contains your training dataset sharded to a few record files, with the format
<path_to_training_data>/<train_dataset>.record-?????-of-xxxxx
Where <path_to_training_data> is the abovementioned path to your training dataset, <train_dataset> is the file name you gave to each file, xxxxx is the number of record files created (e.g. 00010), and ????? should be left as is, and is used as a format to all record files.
Once you've replaced <path_to_training_data>, <train_dataset> and xxxxx to the correct values of your dataset, TF OD API should handle everything else (finding all records, reading them, etc.).
Note there's usually tf_record_input_reader for both training dataset and eval dataset (validation/test), and each should have the corresponding above-mentioned values (path, dataset name, number of files).

How to print out to a file using Stanford Classifier

I am using Stanford Classifier for my project.
This project takes training data to tune the algorithm then test data to classify text inputs into categories.
So the format for test and training data is tab-delimited text which means predictor -TAB- input text
The software prints out the output to stdout (command line).
Is there anyway to output to a text file ?
I searched the javadoc of the project site, and I found
But I don't know how to use this property.
I tried -csvoutput=%1%n%c on command line
But it gives me java null pointer exception error when I try to run it.
If you want to save it to a file just add this to the end of your command:
> output_file.txt

Resources