I know the context supported by GPT2 is 1024, but I assume there's some technique they utilized to train and generate text longer than that in their results. Also, I saw many gpt2-based repos training text with length longer than 1024. But when I tried generating text using run_generation.py to generate text longer than 1024 it throws a runtime error :The size of tensor a (1025) must match the size of tensor b (1024) at non-singleton dimension 3. I have the following questions:
Shouldn't it be possible to generate longer text since a sliding window is used?
Can you please explain what's necessary to generate longer text? What changes will I have to make to the run_generation.py code?
Related
I wanted to generate large text in openAi but the problem arise with the stop sequence. It only generates a few sentences for a topic. Is there any way I can bypass the stop sequence for openAI?
You should increase Maximum length parameter for this, the higher the value you use the larger text you will get.
max_tokens: integer, Optional, Defaults to 16.
The maximum number of tokens to generate in the completion.
The token count of your prompt plus max_tokens cannot exceed the model's context length. Most models have a context length of 2048 tokens (except for the newest models, which support 4096).
See the official docs for more information.
does anyone has an idea what is the input text size limit that can be passed to the
predict(passage, question) method of the AllenNLP Predictors.
I have tried with passage of 30-40 sentences, which is working fine. But eventually it is not working for me when I am passing some significant amount of text around 5K statement.
Which model are you using? Some models truncate the input, others try to handle arbitrary length input using a sliding window approach. With the latter, the limit will depend on the memory available on your system.
I used python 3 and when i insert transform random crop size 224 it gives miss match error.
here my code
what did i wrong ?
Your code makes variations on resnet: you changed the number of channels, the number of bottlenecks at each "level", and you removed a "level" entirely. As a result, the dimension of the feature map you have at the end of layer3 is not 64: you have a larger spatial dimension than you anticipated by the nn.AvgPool2d(8). The error message you got actually tells you that the output of level3 is of shape 64x56x56 and after avg pooling with kernel and stride 8 you have 64x7x7=3136 dimensional feature vector, instead of only 64 you are expecting.
What can you do?
As opposed to "standard" resnet, you removed stride from conv1 and you do not have max pool after conv1. Moreover, you removed layer4 which also have a stride. Therefore, You can add pooling to your net to reduce the spatial dimensions of layer3.
Alternatively, you can replace nn.AvgPool(8) with nn.AdaptiveAvgPool2d([1, 1]) an avg pool that outputs only one feature regardless of the spatial dimensions of the input feature map.
Example:
Given n number of images marked 1 to n where n is unknown, I can calculate a property of every image which is a scalar quantity. Now I have to represent this property of all images in a fixed size vector (say 5 or 10).
One naive approach can be this vector- [avg max min std_deviation]
And I also want to include the effect of relative positions of those images.
What your are looking for is called feature extraction.
There are many techniques for the same, for images:
For your purpose try:
PCA
Auto-encoders
Convolutional Auto-encoders, 1 & 2
You could also look into conventional (old) methods like SIFT, HOG, Edge Detection, but they all will need an extra step for making them to a smaller-fixed size.
I understand (and please correct me if my understanding is wrong) that the primary purpose of a CNN is to reduce the number of parameters from what you would need if you were to use a fully connected NN. And CNN achieves this by extracting "features" of images.
CNN can do this because in a natural image, there are small features such as lines and elementary curves that may occur in an "invariant" fashion, and constitute the image much like elementary building blocks.
My question is: when we create layers of feature maps, say, 5 of them, and we get these by using the sliding window of a size, say, 5x5 on an image that has pixels of, say, 100x100, Initially, these feature maps are initialized as random number weight matrices, and must progressively adjust the weights with gradient descent right? But then, if we are getting these feature maps by using the exactly same sized windows, sliding in exactly the same ways (sharing the same starting point and the same stride value), on the exactly same image, how can these maps learn different features of the image? Won't they all come out the same, say, a line or a curve?
Is it due to the different initial values of the weight matrices? (I.e. some weight matrices are more receptive to learning a certain particular feature than others?)
Thanks!! I wrote my 4 questions/opinions and indexed them, for the ease of addressing them separately!