adding and accessing auxiliary tf.Dataset attributes with Keras - keras

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.
I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks

As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset can be extended.
On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

Is there a built in sklearn cross validation function for only validating against higher indices?

I'd like to use GridSearchCV, but with the condition that the lowest index of the data in the validation set be greater than the largest in the training set. The reason being that the data is in time, and future data gives unfair insight that would inflate the score. There's some discussion on this:
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
but it's not clear to me whether any of the splitting methods listed can accomplish what i'm looking for. It seems to be the case that I can define an itterable of indices and pass that into cv, but in that case it's not clear how many I should define (does it always use all of them? do different tests get different indices?)

How Does the Hashing Trick in Machine Learning Work?

I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).
I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.
I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.
I have reviewed the following links to try and understand it:
https://learn.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing
https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f
https://en.wikipedia.org/wiki/Vowpal_Wabbit
I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
Generate short hash string based using VBA
Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.
I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?
'CATEGORY 'HASH SEQUENCE
STEEL 37152
PLASTIC 31081
ALUMINUM 2310
BRONZE 9364
So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.
The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.

How to get batch size for `tf.PaddingFIFOQueue.dequeu_up_to` in case of fewer elements?

I'm referring to the example codes in
http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/
and
https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/
for standard approaches in feeding the data to the language models (rnn) using TensorFlow.
So before feeding the input, I'm making sure they are padded to the max input length in the current batch and then randomly shuffled within the current batch. so far good.
However, I have difficulty in specifying the initial state for tf.nn.dynamic_rnn whose shape depends on the current batch_size.
If the batch size is fixed there is should not be any problem.
However, if I'm using tf.PaddingFIFOQueue.dequeue_up_to(batch_size), it may be possible to return less than the batch_size if not many elements in the queue. (which is possible if dequeuing the last set of elements).
In this case how do we specify the initial state with the exact batch size returned.
You can use the None dimension in your Tensors to specify TensorFlow that the batch dimension can be different from one run to another.
You might want to read this faq on tensor shapes to get a better sense, and in particular this section (formatting is mine):
How do I build a graph that works with variable batch sizes?
It is often useful to build a graph that works with variable batch sizes, for example so that the same code can be used for (mini-)batch training, and single-instance inference. The resulting graph can be saved as a protocol buffer and imported into another program.
When building a variable-size graph, the most important thing to remember is not to encode the batch size as a Python constant, but instead to use a symbolic Tensor to represent it. The following tips may be useful:
Use batch_size = tf.shape(input)[0] to extract the batch dimension from a Tensor called input, and store it in a Tensor called batch_size.
Use tf.reduce_mean instead of tf.reduce_sum(...) / batch_size.
If you use placeholders for feeding input, you can specify a variable batch dimension by creating the placeholder with tf.placeholder(..., shape=[None, ...]). The None element of the shape corresponds to a variable-sized dimension.

Vectorizer (Hashing, Count, etc.): reusing after fit_transform

Suppose I have two features which are both text based; for example, say I'm trying to predict sports games, and I've got:
1) Excerpt from sports commentary (a body of text)
2) Excerpt from Internet fan predictions (also a body of text).
If I were to use a text vectorizer (say HashingVectorizer) on feature 1), with fit_transform(), would it be bad to use it again (fit_transform()) on feature 2, or should I create a new vectorizer for that? I'm just wondering whether reusing fit_transform() on multiple features with the same vectorizer might perhaps have bad side effects.
I would say it depends on whether or not you want reproducibility of the text-to-vector conversion step. For example, if you want to use the same classifier (or whatever) you made from the first data set, you need to reuse the vectorizer. If you fit a new one on a different data set, it will build a different vocabulary, ie pull out different tokens, and make the vectors differently. That might be what you want with a very different data set (if you're going to retrain). It could be that the second data set contains new words that are critical for predictions. Those would be missed if you reused the vectorizer.
By the way, the vectorizers can be pickled if you want to save to disk. For an example, see: how to pickle customized vectorizer?.

Resources