When you prompt GPT3, what happens to the input data? - nlp

For example, let's say I open up the playground and type "Quack". What does the model do with those 5 characters to figure out what letters or words should come next?
(As it happens, GPT3 filled in that prompt with "Quackery", then a tirade against cell therapy. Weird).

It is hard to give a good summary of all that happens in GPT-3 but i will try.
First the model encodes the word Quack into token representations, these tokens have an embedding representation, the tokens are later passed through the decoder components of the model passing through several neural network layers. Once the first decoder transformer block processes the token, it sends its resulting vector up the stack to be processed by the next block. The process is identical in each block, but each block has its own weights in both self-attention and the neural network sublayers. In the end you end up with an array of output token probabilities and you use the combined (or parts of the) array to select what the model considers as the most optimal combination of tokens for the output. These tokens are decoded back into normal text and you get your rant against cell therapy back.
The result varies depending of the engine, temperature and logit biases that are feed in the request.
I recommend reading the following two links for getting more insights about what happens internally, both written by the brilliant Jay Alammar.
https://jalammar.github.io/how-gpt3-works-visualizations-animations/
https://jalammar.github.io/illustrated-gpt2/

Related

The size of Logits of Roberta model is weird

My input size is [8,22]. A batch with 8 tokenized sentences with a length of 22.
I dont want to use the default classifier.
model = RobertaForSequenceClassification.from_pretrained("xlm-roberta-large")
model.classifier=nn.Identity()
After model(batch)
The size of result is torch.Size([8, 22, 1024]). I have no idea why. Should it be [8,1024]?
The model.classifier object you have replaced used to be an instance of a RobertaClassificationHead. If you take a look at its source code[1], the layer is hard-coded into indexing the first item of the second dimension of its input, which is supposed to be the [CLS] token.
By replacing it with an Identity you miss out on the indexing operation, hence your output shape.
Long story short, don't assume functionality you haven't verified when it comes to non-own code, huggingface in particular (lots of ad-hoc classes and spaghetti interfaces, least as far as I'm concerned).
[1] source

Extracting hidden representations for each token - PyTorch LSTM

I am currently working on a NLP project involving recurrent neural networks. I implemented a LSTM with PyTorch, following the tutorial here.
For my project, I need to extract the hidden representation for every token of an input text. I thought that the easiest way would be to test using a batch size and sequence length of 1, but when I do that the loss gets orders of magnitude larger than in training phase (during training I used a batch size of 64 and a sequence length of 35).
Is there any other way I can easily access these word-level hidden representations? Thank you.
Yes, that is possible with nn.LSTM as long as it is a single layer LSTM. If u check the documentation (here), for the output of an LSTM, you can see it outputs a tensor and a tuple of tensors. The tuple contains the hidden and cell for the last sequence step. What each dimension means of the output depends on how u initialized your network. Either the first or second dimension is the batch dimension and the rest is the sequence of word embeddings you want.
If u use a packed sequence as input, it is a bit of a different story.

How to use forward() method instead of model.generate() for T5 model

For my use case, I need to use the model.forward() instead of the model.generate() method
i.e instead of the below code
outs = model.model.generate(input_ids=batch['source_ids'],
attention_mask=batch['source_mask'],
output_scores=True,
max_length=model.model_arguments.max_output_seq_length)
preds_cleaned = [model.tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=True) for ids in outs]
I need to use
model_outputs = model.model(
input_ids=batch["source_ids"],
attention_mask=batch["source_mask"],
labels=lm_labels.to(device),
decoder_attention_mask=batch['target_mask']
)
logits = model_outputs.logits
softmax_logits = m(logits)
max_logits = torch.max(softmax_logits, dim=2)
decoding these logits gives unprocessed text that has many issues like repetition of words at the end etc.
What do I need to do to get the same result as model.generate() ?
The two methods do something completely different.
Calling the model (which means the forward method) uses the labels for teacher forcing. This means inputs to the decoder are the labels shifted by one (see documentation). With teacher forcing, the decoder always gets the ground-truth token in the next step, no matter what the prediction was. Teacher forcing is used from model training, all steps are fully differentiable.
When you call the generate method, the model is used in the autoregressive fashion. Any token it generates is put as the input in the next step. However, selecting the token is a "hard" decision, and the gradient cannot be propagated through this decision. The generate method cannot be used for training. The output is coherent because the decoder reacts to what was previously generated.
With teacher forcing, the model might want to prefer generating a token and continue consistently with the generated token. However, it cannot continue consistently, because it is forced to continue as if it generated the token that actually is in the labels argument. This why you observe the incoherent output (which was nevertheless never intended to be output but only to be used for training).

How to find number of tokens in gensim model

This is the code for my model using Gensim.i run it and it returned a tuple. I wanna know that which one is the number of tokens?
model = gensim.models.Word2Vec(mylist5,size=100, sg=0, window=5, alpha=0.05, min_count=5, workers=12, iter=20, cbow_mean=1, hs=0, negative=15)
model.train(mylist5, total_examples=len(mylist5), epochs=10)
The value that was returned by my model is: I need to know what is this?
(167131589, 208757070)
I wanna know what is the number of tokens?
Since you already passed in your mylist5 corpus` when you instantiated the model, it will have automatically done all steps to train the model with that data.
(You don't need to, and almost certainly should not, be calling .train() again. Typically .train() should only be called if you didn't provide any corpus at instnatiation. And in such a case, you'd then call both .build_vocab() and .train().)
As noted by other answerers, the numbers reported by .train() are two tallies of the total tokens seen by the training process. (Most users won't actually need this info.)
If you want to know the number of unique tokens for which the model learned word-vectors, len(model.wv) is one way. (Before Gensim 4.0, len(model.wv.vocab) would have worked.)
Gensim Code
The Gensim Github Line573 Shows that model.train returns two values trained_word_count, raw_word_count.
"raw_word_count" is the number of words used in training.
"trained_word_count" is number of raw words after ignoring unknown words and trimming the sentence length.

How Does the Hashing Trick in Machine Learning Work?

I have a large categorical dataset and a feedforward ANN that I am using for classification purposes. I programmed the machine learning model using Excel VBA (the only programming language I have access too currently).
I have 150 categories in my dataset that I need to process. I have tried using Binary Encoding and One-Hot Encoding, however because of the number of categories I need to process, these vectors are often too large for VBA to handle and I end up with a memory error.
I’d like to give the Hashing trick a go, and see if it works any better. I don't understand how to do this with Excel however.
I have reviewed the following links to try and understand it:
https://learn.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing
https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f
https://en.wikipedia.org/wiki/Vowpal_Wabbit
I still don’t completely understand it. Here is what I have done so far. I used the following code example to create a hash sequence for my categorical date:
Generate short hash string based using VBA
Using the code above, I have been able to produce collision free numerical hash sequences. However, what do I do now? Does the hash sequence need to be converted to a binary vector now? This is where I get lost.
I provided a small example of my data thus far. Would somebody be able to show me step by step how the hashing trick works (preferably for Excel)?
'CATEGORY 'HASH SEQUENCE
STEEL 37152
PLASTIC 31081
ALUMINUM 2310
BRONZE 9364
So what the hashing trick does is it prevents ~fake words from taking up extra memory. In a regular Bag-Of-Words (BOW) model, you have 1 dimension per word in the vocabulary. This means that a misspelled word and the regular word can both take up separate dimensions - if you have the misspelled word in the model at all. If the misspelled word is not in the model, (depending on your model) you might ignore it completly. This adds up over time. And by misspelled word, I'm just using an example of any word not in the vocabulary you use to create the vectors to train your model with. Meaning any model trained this way cannot adapt to new vocab without being trained all over again.
The hashing method allows you to incorporate out-of-vocab words, with some potential accuracy loss. It also ensures that you can bound your memory. Essentially the hashing method starts by defining a hash function that takes some input (typically a word) and mapping it to an output value Within an Already Determined Range. You would choose your hash function to output somewhere between say 0-2^16. Thus you know your output vectors will always be capped at size 2^16 (arbitrary value really), so you can prevent memory issues. Further, hash functions have "collisions" - what this means is that hash(a) might equal hash(b) - very rarely with an appropriate output range, but its possible. This means that you lose some accuracy - but since the hash function is theoretically able to take any input string, it can work with out of vocabulary words to get a new vector Of the Same Size as the original vectors used to train the model. Since your new data vector is the Same Size as those used to train the model previously, you can use it to refine your model instead of being forced to train a new model.

Resources