Now I have lmdb and hdf5 data for image classification. And I want to use theano to train some convnets. So do theano support lmdb or hdf5 data?
Theano does not "natively" support any file format, input data must be in the form of lists or numpy arrays. You just need to load the data using another library and use it with Theano.
h5py is a popular library to read/write HDF5 files from python. LMDB also has python bindings so reading that kind of data should not be an issue with python.
Related
If we want to use weights from pretrained BioBERT model, we can execute following terminal command after downloading all the required BioBERT files.
os.system('python3 extract_features.py \
--input_file=trial.txt \
--vocab_file=vocab.txt \
--bert_config_file=bert_config.json \
--init_checkpoint=biobert_model.ckpt \
--output_file=output.json')
The above command actually reads individual file containing the text, reads the textual content from it, and then writes the extracted vectors to another file. So, the problem with this is that it could not be scaled easily for very large data-sets containing thousands of sentences/paragraphs.
Is there is a way to extract these features on the go (using an embedding layer) like it could be done for the word2vec vectors in PyTorch or TF1.3?
Note: BioBERT checkpoints do not exist for TF2.0, so I guess there is no way it could be done with TF2.0 unless someone generates TF2.0 compatible checkpoint files.
I will be grateful for any hint or help.
You can get the contextual embeddings on the fly, but the total time spend on getting the embeddings will always be the same. There are two options how to do it: 1. import BioBERT into the Transformers package and treat use it in PyTorch (which I would do) or 2. use the original codebase.
1. Import BioBERT into the Transformers package
The most convenient way of using pre-trained BERT models is the Transformers package. It was primarily written for PyTorch, but works also with TensorFlow. It does not have BioBERT out of the box, so you need to convert it from TensorFlow format yourself. There is convert_tf_checkpoint_to_pytorch.py script that does that. People had some issues with this script and BioBERT (seems to be resolved).
After you convert the model, you can load it like this.
import torch
from transformers import *
# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('directory_with_converted_model')
model = BertModel.from_pretrained('directory_with_converted_model')
# Call the model in a standard PyTorch way
embeddings = model([tokenizer.encode("Cool biomedical tetra-hydro-sentence.", add_special_tokens=True)])
2. Use directly BioBERT codebase
You can get the embeddings on the go basically using the code that is exctract_feautres.py. On lines 346-382, they initialize the model. You get the embeddings by calling estimator.predict(...).
For that, you need to format your format the input. First, you need to format the string (using code on line 326-337) and then apply and call convert_examples_to_features on it.
LightGBM and XGBoost models can be dumped to plain text files containing human-readable model structure. In the end, they are just tree ensembles.
Is there any library to load these dumped models to the scikit-learn framework, e.g. construct sklearn ensembles with same splits and values?
That could be quiet convinient as there are some nice libraries attached to sklearn API, e.g. treeinterpreter.
For XGBoost you can use the xgbfir library which parses the xgb model display feature interactions and ranking. Install it with:
pip install xgbfir
For lightGBM, I'm not aware of good options. Microsoft's lightGBM library allows PMML export, so perhaps you could export then use some PMML parsers.
I have trained a model from caffe in ".caffemodel.h5" format. I want to parse it to extract parameters and feed it to a lasagne model. How can I do it?
You will have to replicate the architecture of your caffemodel in terms of Lasagne layers first, by hand, or use a script that does this (a script that takes a caffe protobuf and translates it to lasagne exists AFAIK, or should be made if not. In any case it exists in sklearn-theano.)
By hand, you need to open the hf5 file using e.g. pytables, and then replicate the architecture in Lasagne.
Most algorithms that use matrix operations in spark have to use either Vectors or store their data in a different way. Is there support for building matrices directly in spark?
Apache recently released Spark-1.0. It has support for creating Matrices in Spark, which is a really appealing idea. Although right now it is in experimental phase and has support for limited operations that can be performed over the Matrix you create but this is sure to grow in future releases. The idea of Matrix operations being performed with the speed of Spark is amazing.
The way I use matrices in Spark is through python and with numpy scipy. Pull the data into the matrices from a csv file and use as needed. I treated the matrices the same as I would in normal python scipy. It is how you parallelize the data that makes it slightly different.
Something like this:
for i in range(na+2):
data.append(LabeledPoint(b[i], A[i,:]))
model = WhatYouDo.train(sc.parallelize(data), iterations=40, step=0.01,initialWeights=wa)
The pain was getting numpy scipy into spark. Found the best way to make sure all the other libraries and files need were included was to use:
sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose
I have trained a SVM (svc) using scikit-learn over half a terabyte of data. The model is working fine and I need to port it to C, but I don't want to re-train the SVM from scratch because it takes way too long for me. Is there a way to easily export the model generated by scikit-learn and import it into LibSVM? Internally scikit-learn uses LibSVM so theoretically it should be possible, but I haven't been able to find anything in the documentation. Any suggestion?
Is there a way to easily export the model generated by scikit-learn and import it into LibSVM?
No. The scikit-learn version of LIBSVM has been hacked up severely to fit it into the Python environment and the model is stored as NumPy/SciPy data structures.
Your best shot is to study the SVM decision function and reimplement it in C. The support vectors can be obtained from the SVC object as NumPy arrays, which are easily translated to C arrays.