Sentence Transformers Using BOW? - nlp

I have a collection of terms that appear or are somehow related to web pages (e.g. keywords from the HTML tags). These are not sentences, they are just a collection of keywords, words in a title etc. I am interested in, given such a webpage, to find those most similar. In a case where one has sentences / paragraphs I would think of using a sentence transformer or even like Doc2vec. But in this case I only have the set of words of a page and there is no real context or sentences. Am I correct this precludes me from using sentence transformer / Doc2vec ?

Nothing precludes you from using anything. The relevant test is: does using it work, for your unique data & goals?
Doc2Vec and other shallow techniques work fine on things like lists-of-keywords that aren't perfect grammatical sentences: they're generally using the presence or absence of words, without rigorous grammatical understanding, as signals. And that's plenty for many purposes!
Some deeper transformers might have more order-dependent reliance on coherent natural-language utterances – but I wouldn't be sure of that until it was tried and shown lacking. It might work! And noone with only the vaguest sketch (from your question) of your data & goals can give you hints better than your own experiments.
Try things – including super-simple things like cosine-similarities on bag-of-words representation, or keyword searches based on some measure of most significant terms – then evaluate the results according to your needs/desired results.
You might start some evaluations via ad-hoc eyeballing – "this seems good, this seems wrong" – but would ideally record judgements of which docs "should" be more-similar than others, in your desired end-system, so that eventually you can run an automatic, quantitative comparison of alternate approaches.

Related

Embeddings vs text cleaning (NLP)

I am a graduate student focusing on ML and NLP. I have a lot of data (8 million lines) and the text is usually badly written and contains so many spelling mistakes.
So i must go through some text cleaning and vectorizing. To do so, i considered two approaches:
First one:
cleaning text by replacing bad words using hunspell package which is a spell checker and morphological analyzer
+
tokenization
+
convert sentences to vectors using tf-idf
The problem here is that sometimes, Hunspell fails to provide the correct word and changes the misspelled word with another word that don't have the same meaning. Furthermore, hunspell does not reconize acronyms or abbreviation (which are very important in my case) and tends to replace them.
Second approache:
tokenization
+
using some embeddings methode (like word2vec) to convert words into vectors without cleaning text
I need to know if there is some (theoretical or empirical) way to compare this two approaches :)
Please do not hesitate to respond If you have any ideas to share, I'd love to discuss them with you.
Thank you in advance
I post this here just to summarise the comments in a longer form and give you a bit more commentary. No sure it will answer your question. If anything, it should show you why you should reconsider it.
Points about your question
Before I talk about your question, let me point a few things about your approaches. Word embeddings are essentially mathematical representations of meaning based on word distribution. They are the epitome of the phrase "You shall know a word by the company it keeps". In this sense, you will need very regular misspellings in order to get something useful out of a vector space approach. Something that could work out, for example, is US vs. UK spelling or shorthands like w8 vs. full forms like wait.
Another point I want to make clear (or perhaps you should do that) is that you are not looking to build a machine learning model here. You could consider the word embeddings that you could generate, a sort of a machine learning model but it's not. It's just a way of representing words with numbers.
You already have the answer to your question
You yourself have pointed out that using hunspell introduces new mistakes. It will be no doubt also the case with your other approach. If this is just a preprocessing step, I suggest you leave it at that. It is not something you need to prove. If for some reason you do want to dig into the problem, you could evaluate the effects of your methods through an external task as #lenz suggested.
How does external evaluation work?
When a task is too difficult to evaluate directly we use another task which is dependent on its output to draw conclusions about its success. In your case, it seems that you should pick a task that depends on individual words like document classification. Let's say that you have some sort of labels associated with your documents, say topics or types of news. Predicting these labels could be a legitimate way of evaluating the efficiency of your approaches. It is also a chance for you to see if they do more harm than good by comparing to the baseline of "dirty" data. Remember that it's about relative differences and the actual performance of the task is of no importance.

NLP: retrieve vocabulary from text

I have some texts in different languages and, potentially, with some typo or other mistake, and I want to retrieve their own vocabulary. I'm not experienced with NLP in general, so maybe I use some word improperly.
With vocabulary I mean a collection of words of a single language in which every word is unique and the inflections for gender, number, or tense are not considered (e.g. think, thinks and thought are are all consider think).
This is the master problem, so let's reduce it to the vocabulary retrieving of one language, English for example, and without mistakes.
I think there are (at least) three different approaches and maybe the solution consists of a combination of them:
search in a database of words stored in relation with each others. So, I could search for thought (considering the verb) and read the associated information that thought is an inflection of think
compute the "base form" (a word without inflections) of a word by processing the inflected form. Maybe it can be done with stemming?
use a service by any API. Yes, I accept also this approach, but I'd prefer to do it locally
For a first approximation, it's not necessary that the algorithm distinguishes between nouns and verbs. For instance, if in the text there were the word thought like both noun and verb, it could be considered already present in the vocabulary at the second match.
We have reduced the problem to retrieve a vocabulary of an English text without mistakes, and without consider the tag of the words.
Any ideas about how to do that? Or just some tips?
Of course, if you have suggestions about this problem also with the others constraints (mistakes and multi-language, not only Indo-European languages), they would be much appreciated.
You need lemmatization - it's similar to your 2nd item, but not exactly (difference).
Try nltk lemmatizer for Python or Standford NLP/Clear NLP for Java. Actually nltk uses WordNet, so it is really combination of 1st and 2nd approaches.
In order to cope with mistakes use spelling correction before lemmatization. Take a look at related questions or Google for appropriate libs.
About part of speech tag - unfortunately, nltk doesn't consider POS tag (and context in general), so you should provide it with the tag that can be found by nltk pos tagging. Again, it is already discussed here (and related/linked questions). I'm not sure about Stanford NLP here - I guess it should consider context, but I was sure that NLTK does so. As I can see from this code snippet, Stanford doesn't use POS tags, while Clear NLP does.
About other languages - google for lemmatization models, since algorithm for most languages (at least from the same family) is almost the same, differences are in training data. Take a look here for example of German; it is a wrapper for several lemmatizers, as I can see.
However, you always can use stemmer at cost of precision, and stemmer is more easily available for different languages.
Topic Word has become an integral part of the rising debate in the present world. Some people perceive that Topic Word (Synonyms) beneficial, while opponents reject this notion by saying that it leads to numerous problems. From my point of view, Topic Word (Synonyms) has more positive impacts than negative around the globe. This essay will further elaborate on both positive and negative effects of this trend and thus will lead to a plausible conclusion.
On the one hand, there is a myriad of arguments in favour of my belief. The topic has a plethora of merits. The most prominent one is that the Topic Word (Synonyms). According to the research conducted by Western Sydney University, more than 70 percentages of the users were in favour of the benefits provided by the Topic Word (Synonyms). Secondly, Advantage of Essay topic. Thus, it can say that Topic Word (Synonyms) plays a vital role in our lives.
On the flip side, critics may point out that one of the most significant disadvantages of the Topic Word (Synonyms) is that due to Demerits relates to the topic. For instance, a survey conducted in the United States reveals that demerit. Consequently, this example explicit shows that it has various negative impacts on our existence.
As a result, after inspection upon further paragraphs, I profoundly believe that its benefits hold more water instead of drawbacks. Topic Word (Synonyms) has become a crucial part of our life. Therefore, efficient use of Topic Word (Synonyms) method should promote; however, excessive and misuse should condemn.

Finding how relevant a text is, given a whitelist and blacklist of words/phrases

This is a case of me wanting to search for something online but not knowing what it's called.
I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.
For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:
want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position
What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:
Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?
Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?
You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.
To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.

Document Analysis and Tagging

Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

Finding related words (specifically physical objects) to a specific word

I am trying to find words (specifically physical objects) related to a single word. For example:
Tennis: tennis racket, tennis ball, tennis shoe
Snooker: snooker cue, snooker ball, chalk
Chess: chessboard, chess piece
Bookcase: book
I have tried to use WordNet, specifically the meronym semantic relationship; however, this method is not consistent as the results below show:
Tennis: serve, volley, foot-fault, set point, return, advantage
Snooker: nothing
Chess: chess move, checkerboard (whose own meronym relationships shows ‘square’ & 'diagonal')
Bookcase: shelve
Weighting of terms will eventually be required, but that is not really a concern now.
Anyone have any suggestions on how to do this?
Just an update: Ended up using a mixture of both Jeff's and StompChicken's answers.
The quality of information retrieved from Wikipedia is excellent, specifically how (unsurprisingly) there is so much relevant information (in comparison to some corpora where terms such as 'blog' and 'ipod' do not exist).
The range of results from Wikipedia is the best part. The software is able to match terms such as (lists cut for brevity):
golf: [ball, iron, tee, bag, club]
photography: [camera, film, photograph, art, image]
fishing: [fish, net, hook, trap, bait, lure, rod]
The biggest problem is classifying certain words as physical artefacts; default WordNet is not a reliable resource as many terms (such as 'ipod', and even 'trampolining') do not exist in it.
I think what you are asking for is a source of semantic relationships between concepts. For that, I can think of a number of ways to go:
Semantic similarity algorithms. These algorithms usually perform a tree walk over the relationships in Wordnet to come up with a real-valued score of how related two terms are. These will be limited by how well WordNet models the concepts that you are interested in. WordNet::Similarity (written in Perl) is pretty good.
Try using OpenCyc as a knowledge base. OpenCyc is a open-source version of Cyc, a very large knowledge base of 'real-world' facts. It should have a much richer set of sematic realtionships than WordNet does. However, I have never used OpenCyc so I can't speak to how complete it is, or how easy it is to use.
n-gram frequency analysis. As mentioned by Jeff Moser. A data-driven approach that can 'discover' relationships from large amounts of data, but can often produce noisy results.
Latent Semantic Analysis. A data-driven approach similar to n-gram frequency analysis that finds sets of semantically related words.
[...]
Judging by what you say you want to do, I think the last two options are more likely to be successful. If the relationships are not in Wordnet then semantic similarity won't work and OpenCyc doesn't seem to know much about snooker other than the fact that it exists.
I think a combination of both n-grams and LSA (or something like it) would be a good idea. N-gram frequencies will find concepts tightly bound to your target concept (e.g. tennis ball) and LSA would find related concepts mentioned in the same sentence/document (e.g. net, serve). Also, if you are only interested in nouns, filtering your output to contain only nouns or noun phrases (by using a part-of-speech tagger) might improve results.
In the first case, you probably are looking for n-grams where n = 2. You can get them from places like Google or create your own from all of Wikipedia.
For more information, check out this related Stack Overflow question.

Resources