Using GPT2 to find commonalities in text records - nlp

I have a dataset with many incidents and most of the data is in free text form. One row per incident and a text field of what happened. I tried to train a gpt2 model on the free text then try prompts such as
"The person got burned because" and want to find the most common causes of burns.
The causes may be written in many ways so I thought maybe to get the meaning of each might work.
The prompts work but give some funny made up reasons so I do not think it's working well for what I want to do.

Related

GPT-J and GPT-Neo generate too long sentences

I trained a GPT-J and GPT-Neo models (fine tuning) on my texts and am trying to generate new text. But very often the sentences are very long (sometimes 300 characters each), although in the dataset the sentences are of normal length (50-100 characters usually). I tried a lot of things, changed, adjusted the temperature, top_k, but still half of the results with long phrases and I neen more short.
What can you try?
Here are long examples of generated results:
The support system that they have built has allowed us as users who
are not code programmers or IT administrators some ability to create
our own custom solutions without needing much programming experience
ourselves from scratch!
All it requires are documents about your inventory process but
I've found them helpful as they make sure you do everything right for
maximum efficiency because their knowledge base keeps reminding me
there's new ways i can be doing some things wrong since upgrading my
license so even though its good at finding errors with documentation
like an auditor may bring up later downline someone else might benefit
if those files dont exist anymore after one year when upgrades renews
automatically!
With all GPT models you can specify the "max_length" parameter during generation. This will force the model to generate an amount of tokens equal to max_length. You could also play with num_return_sequences and use a helper function to choose the shortest sequence.
Example:
output = model.generate(input_ids, do_sample=True, top_k=50, max_length=100, top_p=0.95, num_return_sequences=1)
These large language models are trained on massive amounts of data, and fine-tuning them can take patience as they learn to adapt to what you're feeding it. Try different things - adjust your training data format, try different samples, use a pre-prompt during generation to guide the model, etc.. A model like GPT-J does a mind-numbingly large amount of calculations just to spit out a single word, so it is hard to predict what exactly is causing it to say one thing over another.

BERT models: how robust are they to typos?

let me introduce the context briefly: I'm fine tuning a generic BERT model for the context of food and beverage. The final goal is a classification task.
To train this model, I'm using a corpus of text gathered from blog posts, articles, magazines etc... that cover the topic.
I am however facing a predicament that I don't know how to handle: specifically, there are sometimes words that either contain a typo, or maybe different accents, but that are semantically the same.
Let me give you an example to briefly illustrate what I mean:
The wine Gewürztraminer is correctly written with the ü, however sometimes you also find it written with just a normal u, or some other times even just Gewurtz. There are several situations like this one.
Now, a human being would obviously know that we're talking exactly about the same thing, but I have absolutely no idea about how BERT would handle these situations. Would it understand that they're the same thing? Would it consider them instead to be completely different words?
I am currently in the process of cleaning my training data, fixing the typos and trying to even out all these inconsistencies, but at this point I'm not even sure if I should do that at all, considering that the text that will need to be classified can potentially contain typos and situations like the one described above.
What would you guys suggest?

Any way to get passed the minimum of 20 tokens for text classification - Google NLP API

Is there anyway to get passed the minimum token requirement for google's NLP API text classification method? I'm trying to input a short simple sentence such as "I can't wait for the presidential debates" but this would return an error saying:
Invalid text content: too few tokens (words) to process.
Is there any way to get around this? I've inputting random words until the inputted string got to 20 characters but that messes up the labels and confidence a lot of the time. If there is any way around this such as setting an option or adding something that would be awesome! If there is no workaround, let me know if you know of another pre-trained text classification model that would work for me!
Also, I can't create the categorizes and labels I want. There would just be too many needed for what I'm doing so that's why these predefined categories in nlp api is great. Just need to get rid of that 20 character requirement.
As clarified in the official Content Classification documentation:
Important: You must supply a text block (document) with at least twenty tokens (words) to the classifyText method.
Considering that, checking for possible alternatives, it seems that, unfortunately, there isn't a way to workaround this. Indeed, you will need to supply at least 20 words.
For this reason, searching around, I found this one here and this other - this one in Chinese, but it might help you :) - of pre-trained models for Text Classification that I believe might help you.
Anyway, feel free to raise a Feature Request in Google's Issue Tracker, for them to check about the possibility of removing this limitation.
Let me know if the information helped you!

Excel - how to store several words into one combined substring

I am working with a document, where each row contains a description for a specific incident (fire incidents, where firefighters turn up and thereafter write a report).
The incidents/reports are written by several different people, so the language varies a lot, which makes it difficult to code for one specific context using one word: is.number(search(substring;text))
Because even if the word is in the text piece, the context is not related to what I am trying to analyse.
I want to broaden my word search to be more flexible, by being able to "put" or "store" several different words/phrases into my "substring" - being able to get closer to the specific context that I wish to analyse.
This way to cover more data that is in fact related, but different in how it is described in the individual incident reports.
I have tried to search for a solution myself, but am unsure on how to phrase this specific inquiry.
So far I have only been able to use the code piece above, which is a bit insufficient, when trying to comb through 2000 rows.
I hope that someone is able to help me!
Thank you
An example:
Store the following words: stopped fire, killed fire, fire was put out into: Killed fire
So that when I use Killed fire all the above wordings are included in my search.

tensorflow for classification of strings vs elasticsearch

So, a little bit on my problem.
TL;DR
Can I use machine-learning instead of Elastic Search to find results depending on the user's text input? Is it a good idea?
I am working on a car spare parts project, and we have split the car into 300 parts that we store on the database, with some data for each part (weight, availability, etc).
When the customer inputs the text of his part, we need to be able to classify the part, and map it to one in our database.
The current way it's being done is by people on our team manually mapping the customer text with the parts on our database, we want to automate that process.
We tried using MongoDB text search, but it was often inaccurate since parts have different names in different parts of the country.
So we wanted something that got more accurate results, and improved by the more data we have, we immediately considered TensorFlow, after some research and taking part of Google's Machine Learning Crash Course, I got to that point where it specified:
Models can't learn from string values, so you'll have to perform some feature engineering to convert those values to something numeric
That would be useful in the case we have limited number of features as strings, but we don't know what the user will input as a text.
So, my questions are:
1- Can we use Machine Learning to map text input by the user with some documents on our database?
2- If we can do that, is it a good idea to favor it over other search tools like ElasticSearch?
3- Can ElasticSearch improve its results the more data we have? How?
4- How would you go about this problem?
Note: I'd be doing that in Node.js, and since TensorFlow.js is new, I am inclining to go for other solutions, but if push comes to shove, and the results are much better, I would definitely go there.
TL;DR: Yes and yes.
TS;WM:
This is a perfectly suited problem for machine learning. Especially so, if you have a database of past customer texts that have already been mapped to parts. Ideally, you have hundreds of texts mapped to each part. If that is present, you can design and train a network. And models can learn from string values with some engineering, and it's not that bad.
I'm not sure ElasticSearch would improve much on the network. I don't know much about auto parts trading, but as a wild guess, "the large round thingy that helps change direction" would never be mapped to "steering wheel" by ES but could be learned easily by a network - provided there are at least some examples of people using that text to specify steering wheel.
You can but don't have to necessarily use tensorflow.js for your network. The AI could run on your server as a webservice, and you'd just send over the customer's text to it and it would send back it's recommendations of part SKUs and names.

Resources