I want to manually implement a classifier for certain short strings of words, getting a "goodness" rank for each of them. I have made a naive Bayesian classifier which is basically spam-filter-like and scores strings based on previous "good"/"bad" ratings. So far so good.
Now, there are two problems that I want to solve (by properly understanding things)...
The question is - what would be good introductory material for below, not of "cookbook" variety but more systematic, and yet ideally shorter than a university statistics course :) Set of articles that is shorter than the book, or a good book. Aimed at programmers ideally.
The problems are:
first, in my system there are actually 3 types of user feedback - "good", "bad", and "neutral". Most items are neutral, and right now I simply don't include them in the ranking. I am wondering how these things are properly handled (I still need to obtain a single "goodness probability" per item, so if I calculate probability of good and bad separately, are there any pitfalls/proper methods to combining those).
Then, I want to remove the naive part from my classifier (i.e. take relations between words into account), so some different classifier may be in order. Or, I could add all pairs-triples-etc. of words as features, since the strings are short - this feels like a hack, but then again my CS/maths background is rusty enough and/or insufficient to say whether this is a valid technique.
Related
I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/
I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.
Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good approach to this problem? I know about Levenhstein's distance, and I also suppose that this problem may have no easy, good solution.
A good text search engine will provide capabilities such as you describe, fsh. A typical approach would be to create a query that matches if any of the words occurs and orders the results using a weight based on number of terms occurring in proximity to each other and weighted inversely to their probability of occurring, since uncommon words will be less likely to co-occur by chance. There's a whole theory of this sort of thing called information retrieval, but maybe you know about that. Furthermore you'd like to make sure that word-level fuzziness gets accounted for by normalizing case, punctuation and the like and applying some basic linguistic transformations (stemming), and in some cases introducing a dictionary of synonyms, especially when there is domain knowledge available to condition it.
If you're interested in messing around with this stuff, try an open-source search engine, this article by Vik gives a reasonable survey from the perspective of 2009, and this one by Middleton and Baeza-Yates gives a good detailed introduction to the topic.
Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf
Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.
Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.
How do you go about finding the possibly related documents?
I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.
edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.
This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:
The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.
Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.
If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").
Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.
Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.
These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.
You should read the book "Programming Collective Intelligence: Building Smart Web 2.0 Applications" (ISBN 0596529325)!
For some method and code: First ask yourself, whether you want to find direct similarities based on word matches, or whether you want to show similar articles that may not directly relate to the current one, but belong to the same cluster of articles.
See Cluster analysis / Partitional clustering.
A very simple (but theoretical and slow) method for finding direct similarities would be:
Preprocess:
Store flat word list per article (do not remove duplicate words).
"Cross join" the articles: count number of words in article A that match same words in article B. You now have a matrix int word_matches[narticles][narticles] (you should not store it like that, similarity of A->B is same as B->A, so a sparse matrix saves almost half the space).
Normalize the word_matches counts to range 0..1! (find max count, then divide any count by this) - you should store floats there, not ints ;)
Find similar articles:
select the X articles with highest matches from word_matches
This is a typical case of Document Classification which is studied in every class of Machine Learning. If you like statistics, mathematics and computer science, I recommend that you have a look at the unsupervised methods like kmeans++, Bayesian methods and LDA. In particular, Bayesian methods are pretty good at what are you looking for, their only problem is being slow (but unless you run a very large site, that shouldn't bother you much).
On a more practical and less theoretical approach, I recommend that you have a look a this and this other great code examples.
A small vector-space-model search engine in Ruby. The basic idea is that two documents are related if they contain the same words. So we count the occurrence of words in each document and then compute the cosine between these vectors (each terms has a fixed index, if it appears there is a 1 at that index, if not a zero). Cosine will be 1.0 if two documents have all terms common, and 0.0 if they have no common terms. You can directly translate that to % values.
terms = Hash.new{|h,k|h[k]=h.size}
docs = DATA.collect { |line|
name = line.match(/^\d+/)
words = line.downcase.scan(/[a-z]+/)
vector = []
words.each { |word| vector[terms[word]] = 1 }
{:name=>name,:vector=>vector}
}
current = docs.first # or any other
docs.sort_by { |doc|
# assume we have defined cosine on arrays
doc[:vector].cosine(current[:vector])
}
related = docs[1..5].collect{|doc|doc[:name]}
puts related
__END__
0 Human machine interface for Lab ABC computer applications
1 A survey of user opinion of computer system response time
2 The EPS user interface management system
3 System and human system engineering testing of EPS
4 Relation of user-perceived response time to error measurement
5 The generation of random, binary, unordered trees
6 The intersection graph of paths in trees
7 Graph minors IV: Widths of trees and well-quasi-ordering
8 Graph minors: A survey
the definition of Array#cosine is left as an exercise to the reader (should deal with nil values and different lengths, but well for that we got Array#zip right?)
BTW, the example documents are taken from the SVD paper by Deerwester etal :)
Some time ago I implemented something similiar. Maybe this idea is now outdated, but I hope it can help.
I ran a ASP 3.0 website for programming common tasks and started from this principle: user have a doubt and will stay on website as long he/she can find interesting content on that subject.
When an user arrived, I started an ASP 3.0 Session object and recorded all user navigation, just like a linked list. At Session.OnEnd event, I take first link, look for next link and incremented a counter column like:
<Article Title="Cookie problem A">
<NextPage Title="Cookie problem B" Count="5" />
<NextPage Title="Cookie problem C" Count="2" />
</Article>
So, to check related articles I just had to list top n NextPage entities, ordered by counter column descending.