Errors in 'Intent' intent:
The number of training phrases exceeds 10.
com.google.dialogflow.designtime.exceptions.DesigntimeException: generic::FAILED_PRECONDITION: Errors in 'Intent' intent:
The number of training phrases exceeds 10.
Looks to the documentation and found that the limit is 2000. This problem was caused just today, previously 1 year everything was working fine.
Does someone know what the reason is?
The phrasing on this is awkward.
I think it is saying that you should have at least 10 training phrases (which is suggested). Tho I haven't seen this as a requirement before.
Related
In the documentation for GPT-3 API, it says One limitation to keep in mind is that, for most models, a single API request can only process up to 2,048 tokens (roughly 1,500 words) between your prompt and completion.
In the documentation for fine tuning model, it says The more training samples you have, the better. We recommend having at least a couple hundred examples. in general, we've found that each doubling of the dataset size leads to a linear increase in model quality.
My question is, does the 1,500 words limit also apply to fine tune model? Does "Doubling of the dataset size" mean number of training datasets instead of size of each training dataset?
As far as I understand...
GPT-3 models have token limits because you can only provide 1 prompt and you only get 1 completion. Therefore, as stated in the official OpenAI article:
Depending on the model used, requests can use up to 4097 tokens shared
between prompt and completion. If your prompt is 4000 tokens, your
completion can be 97 tokens at most.
Whereas, fine-tuning as such does not have a token limit (i.e., you can have a million training examples, a million prompt-completion pairs) as stated on the official OpenAI website:
The more training examples you have, the better. We recommend having
at least a couple hundred examples. In general, we've found that each
doubling of the dataset size leads to a linear increase in model
quality.
But, each fine-tuning prompt-completion pair does have a token limit. Each fine-tuning prompt-completion pair should not exceed the token limit.
I just want little guidance that there are 3 IOB file dev, test & train.
Dev has 1 million lines.
Test has 4 million lines.
Train has 30 million.
I am currently just converting dev file as of now because i wasn't sure whether is there any error or not in it.
(the IOB format is correct) It's been over 3 hours as of now can idea will this file work or shall I use something else.
I am fine-tuning a bert model using spacy in google colab the Runtime hardware chosen is GPU and the , and for reference I have followed this article:
https://towardsdatascience.com/how-to-fine-tune-bert-transformer-with-spacy-3-6a90bfe57647
I have followed the exact steps of the article.
I am not familiar with NLP domain neither do I have profound knowledge of pipelining. Can someone please help regarding this, it's really important.
Below i would attach the image regarding time and the statement executed for conversion.
Image showing time elapsed and command executed
I am trying to use the TensorFlow object detection API to recognize a specific object (guitars) in pictures and videos.
As for the data, I downloaded the images from the OpenImage dataset, and derived the .tfrecord files. I am testing with different numbers, but for now let's say I have 200 images in the training set and 100 in the evaluation one.
I'm traininig the model using the "ssd_mobilenet_v1_coco" as a starting point, and the "model_main.py" script, so that I can have training and validation results.
When I visualize the training progress in TensorBoard, I get the following results for train:
and validation loss:
respectively.
I am generally new to computer vision and trying to learn, so I was trying to figure out the meaning of these plots.
The training loss goes as expected, decreasing over time.
In my (probably simplistic) view, I was expecting the validation loss to start at high values, decrease as training goes on, and then start increasing again if the training goes on for too long and the model starts overfitting.
But in my case, I don't see this behavior for the validation curve, which seems to be trending upwards basically all the time (excluding fluctuations).
Have I been training the model for too little time to see the behavior I'm expecting? Are my expectations wrong in the first place? Am I misinterpreting the curves?
Ok, I fixed it by decreasing the initial_learning_rate from 0.004 to 0.0001.
It was the obvious solution, considering the wild oscillations of the validation loss, but at first I thought it wouldn't work since there seems to be already a learning rate scheduler in the config file.
However, immediately below (in the config file) there's a num_steps option, and it's stated that
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
Honestly, I don't remember if I commented out the num_steps option...if I didn't, it seems my learning rate was kept to the initial value of 0.004, which turned out to be too high.
If I did comment it out (so that the learning scheduler was active), I guess that, instead of the decrease, it still started from too high of a value.
Anyway, it's working much better now, I hope this can be useful if anyone is experiencing the same problem.
I have dataset of more than 5 million of records which has many noise features(words) in it So i thought of doing spell correction and abbreviation handling.
When i googled for spell correction packages in python i got packages like autocorrect, textblob, hunspell etc and Peter norvig's method
Below is the sample of my dataset
Id description
1 switvch for air conditioner..............
2 control tfrmr...........
3 coling pad.................
4 DRLG machine
5 hair smothing kit...............
I Tried spell correction function by above packages using the code
dataset['description']=dataset['description'].apply(lambda x: list(set([spellcorrection_function(item) for item in x])))
For entire dataset it took more than 12 hours to complete spell correction and also it introduces few noise( for 20% of total words which are important)
for eg: In last row, "smothing" corrected as "something" but it should be "smoothing" ( i dont get "something" in this context)
Approaching Further
When I observed the dataset not all time the spelling of word is wrong, there were also correct instance of spelling somewhere in dataset.So I tokenize the entire dataset and split correct words and wrong words by using dictionary , applied jarowinkler similarity method between all pair of words and selected pairs which is having similarity value 0.93 and more
Wrong word correct word similarity score
switvch switch 0.98
coling cooling 0.98
smothing smoothing 0.99
I got more than 50k pair of similar words which I put in dictionary with wrong word as key and correct word as value
I also kept words with its abbreviation list( ~3k pairs) in dictionary
key value
tfrmr transformer
drlg drilling
Search and replace key-value pair using code
dataset['description']=dataset['description'].replace(similar_word_dictionary,regex=true)
dataset['description']=dataset['description'].replace(abbreviation_dictionary,regex=true)
This code took more than a day to complete for only 10% of my entire dataset which I found is not efficient one.
Along With Python packages I had also found deep spelling which is something very efficient way of doing spelling correction.There was a very clear explanation of RNN-LSTM as spell checker.
As I dont know much about RNN and LSTM i got very basic understanding of above link.
Question
I am confused how to consider trainset for RNN to my problem,
whether
I need to consider correct words ( without any spelling mistake) in entire dataset as trainset and entire description of my dataset as testset.
or Pair of similar words and abbrievation list as trainset and description of my dataset as testset ( where model find wrong word in description and correct it)
or any other way? could some one please tell me how can I approach further
Could you give some more information about the model you are building?
It makes sense to use a character level sequence to sequence model, similar to the one you would use for translation. There are already some approaches trying to do the same (1, 2, 3).
Maybe draw on them for some inspiration?
Now, with regards to the dataset, It seems that the one you are trying to use mostly has errors? If you don't have the correct version of each phrase, I don't think you can use this dataset.
A simple approach would be to get an existing dataset and introduce random noise in it. The deep spelling blog talks about how you can do that an existing text corpus. Also, a recommendation from myself would be to use small-ish standalone sentences as the training set. A good place to find those is from machine translation datasets (like the tatoeba project) and only use the english phrases. Out of those you can create pairs of (input_phrase, target_phrase) where the input_phrase is potentially noisy (but not always).
With regards to performance, firstly 12hrs training for 1 pass of a 5M dataset sounds about right for a home pc. You can use a GPU or a cloud solution (1, 2) for faster training.
Now for false-positive correction, the dictionary you have created could indeed be handy: if a word exists in this dictionary, don't accept a "correction" on it from the model.
I just wanted to understand (from your experience), that if I have to create a sentiment analysis classification model (using NLTK), what would be a good training data size. For instance if my training data is going to contain tweets, and I intend to classify them as positive,negative and neutral, how many tweets each should I ideally have per category to get a reasonable model working?
I understand that there are many parameters like quality of data, but if one has to get started what might be a good number.
That's a really hard question to answer for people who are not familiar with the exact data, its labelling and the application you want to use it for. But as a ballpark estimate, I would say start with 1,000 examples of each and go from there.