Creating a smart text generator - text

I'm doing this for fun (or as 4chan says "for teh lolz") and if I learn something on the way all the better. I took an AI course almost 2 years ago now and I really enjoyed it but I managed to forget everything so this is a way to refresh that.
Anyway I want to be able to generate text given a set of inputs. Basically this will read forum inputs (or maybe Twitter tweets) and then generate a comment based on the learning.
Now the simplest way would be to use a Markov Chain Text Generator but I want something a little bit more complex than that as the MKC basically only learns by word order (which word is more likely to appear after word x given the input text). I'm trying to see if there's something I can do to make it a little bit more smarter.
For example I want it to do something like this:
Learn from a large selection of posts in a message board but don't weight it too much
For each post:
Learn from the other comments in that post and weigh these inputs higher
Generate comment and post
See what other users' reaction to your post was. If good weigh it positively so you make more posts that are similar to the one made, and vice versa if negative.
It's the weighing and learning from mistakes part that I'm not sure how to implement. I thought about Artificial Neural Networks (mainly because I remember enjoying that chapter) but as far as I can tell that's mainly used to classify things (i.e. given a finite set of choices [x1...xn] which x is this given input) not really generate anything.
I'm not even sure if this is possible or if it is what should I go about learning/figuring out. What algorithm is best suited for this?
To those worried that I will use this as a bot to spam or provide bad answers to SO, I promise that I will not use this to provide (bad) advice or to spam for profit. I definitely will not post it's nonsensical thoughts on SO. I plan to use it for my own amusement.
Thanks!

I was thinking about something like this, too. I think it could pose a significant improvement to use a grammatical analyzer together with a Markov Chain Generator. Then the MC can be trained on text phrases (verb "drive" often together with object "car") and produce grammatically correct sentences.

Related

Python sentiment / text analysis advice

I don't know if this is the right place to ask this but, i am trying to build a bot in Python that will read incoming messages on a Slack channel where customer post their issues such as 'unable to connect to VPN', 'can someone reply to my ticket' etc…
The bot will analyze the message, determine if the customer is angry or not, and then propose a solution until an agent is free to actually check the issue.
Now, I was experimenting with TextBlob for the sentiment analysis part, but I don't know which technologies to actually use to determine the issue based on specific keywords and provide a solution to the user. Can someone propose me some python libraries/technologies that I could use to achieve this ?
To be honest your question is to generic to answer in one go.
Nontheless, you first have to clearly define the scope of your project. In doing so, you might want to first do a quick literaty review (Google Scholar) to familiarize with the state of the art technologies and methods.
From my little experience, a common (maybe simple) technique (lexicon-based approach) used to determine the sentiment of a word, is to use a pre-compiled dictionary (you can create your own though) that contains words - sentiment mappings. For example:
word:tired, sentiment:negative, score:5
So each time the bot finds the keyword "tired" in a sentence it will assign its corresponding negative value (polarity) to the sentence.
You might want to consider applying POS tags in the input text, as sometimes nouns or ``verbs carry significant meaning, compared to adjectives for example.
Keep in mind though, that negative comments can be written in the form of sarcasm. Sarcasm detectioin is a more difficult task though.
Alternatively, you could try using a pre-trained model such as bert-base-multilingual-uncased-sentiment that can be found here in Hugging Face.
For more information on the matter you have a look at this post.
Again as I mentioned, you have to clearly define your goals. This will enable you to specify the libraries or methodology available to solve your problem. Hope my answer helps.

BERT models: how robust are they to typos?

let me introduce the context briefly: I'm fine tuning a generic BERT model for the context of food and beverage. The final goal is a classification task.
To train this model, I'm using a corpus of text gathered from blog posts, articles, magazines etc... that cover the topic.
I am however facing a predicament that I don't know how to handle: specifically, there are sometimes words that either contain a typo, or maybe different accents, but that are semantically the same.
Let me give you an example to briefly illustrate what I mean:
The wine Gewürztraminer is correctly written with the ü, however sometimes you also find it written with just a normal u, or some other times even just Gewurtz. There are several situations like this one.
Now, a human being would obviously know that we're talking exactly about the same thing, but I have absolutely no idea about how BERT would handle these situations. Would it understand that they're the same thing? Would it consider them instead to be completely different words?
I am currently in the process of cleaning my training data, fixing the typos and trying to even out all these inconsistencies, but at this point I'm not even sure if I should do that at all, considering that the text that will need to be classified can potentially contain typos and situations like the one described above.
What would you guys suggest?

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

NLP classify sentences/paragraph as funny

Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this.
There is research on this, it's called Computational Humor. It's an interdisciplinary area that takes elements from computational linguistics, psycholinguistics, artificial intelligence, machine learning etc. They are trying to find out what it is that makes stories or jokes funny (e.g. the unexpected connection, or using a taboo topic in a surprising way etc) and apply it to text (either to generate a funny story or to measure the 'funniness' of text).
There are books and articles about it (e.g. by Graeme Ritchie).
Yes, you should use a Training Corpora to build a predictive model able to detect funny sentences. Sometimes this is known as "Sentiment Analysis" in the literature. Take a look at this article about Sentiment Analysis with LingPipe.
If you can use Java, you can use their library (see license matrix). I found it very useful, not exactly in the same context than you.
The only way to pull this off is to get a couple of thousand people (monkeys won't do, sorry) to look through thousands of funny sentences/stories, rate them, and then build some sort of expert system/neural network out of it. Given the problem scope and the subjectivity of it (a thing funny to one person might not be funny - even offensive - to another), I'd say it's an impossible task.
You can use the same technique as spam filters. Instead of spam/non-spam you classify on funny/not-funny. Look into naive bayesian classifiers for more information.
http://en.wikipedia.org/wiki/Naive_Bayesian_classification
Also, try Computational Humor # Google Scholar if you're serious about getting into the field. Sentiment Analysis has been mentioned too, see wikipedia on that.
Of course, this all depends on what your scope and aims are...

Resources