How to find the most meaningful words in the text with using word2vec? - nlp

So, for instance, I'm typing, as an input, some sentence with some semantic meaning and, as an output, I get some list of closest (in cosine distance) words (mostly single words).
But I want to understand which cluster my sentence belongs to and compute how far is located each word from it. And eliminate non-meaningful words from sentence.
For example:
"I want to buy a pizza";
"pizza": 0.99123
"buy": 0.7834
"want": 0.1443
How such requirement can be achieved out of the box, without any C coding?
Maybe I need to compute cosine distance equation for this?
Thank you!

It seems like you need topic modeling instead of word2vec. Word2vec is used to capture local information, it is not a good idea to use it directly to classify or clustering words or sentences.
One other aspect can be stop word removal since you are mentioning about non-meaningful words. By the way, they are not non-meaningful, they are actually not aligned with any topic. So, you are thinking them as non-meaningful.
I believe you should use LDA topic modeling approach and you don't need to implement anything since there are many implementation out there for LDA.

Related

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Unsupervised sentiment Analysis using doc2vec

Folks,
I have searched Google for different type of papers/blogs/tutorials etc but haven't found anything helpful. I would appreciate if anyone can help me. Please note that I am not asking for code step-by-step but rather an idea/blog/paper or some tutorial.
Here's my problem statement:
Just like sentiment analysis is used for identifying positive and
negative tone of a sentence, I want to find whether a sentence is
forward-looking (future outlook) statement or not.
I do not want to use bag of words approach to sum up the number of forward-looking words/phrases such as "going forward", "in near future" or "In 5 years from now" etc. I am not sure if word2vec or doc2vec can be used. Please enlighten me.
Thanks.
It seems what you are interested in doing is finding temporal statements in texts.
Not sure of your final output, but let's assume you want to find temporal phrases or sentences which contain them.
One methodology could be the following:
Create list of temporal terms [days, years, months, now, later]
Pick only sentences with key terms
Use sentences in doc2vec model
Infer vector and use distance metric for new sentence
GMM Cluster + Limit
Distance from average
Another methodology could be:
Create list of temporal terms [days, years, months, now, later]
Do Bigram and Trigram collocation extraction
Keep relevant collocations with temporal terms
Use relevant collocations in a kind of bag-of-collocations approach
Matched binary feature vectors for relevant collocations
Train classifier to recognise higher level text
This sounds like a good case for a Bootstrapping approach if you have large amounts of texts.
Both are semi-supervised really, since there is some need for finding initial temporal terms, but even that could be automated using a word2vec scheme and bootstrapping

Why are TF-IDF vocabulary words represented as axes/dimensions?

I want an intuitive way for understanding why each word in a TF-IDF vocabulary are represented as separate dimensions.
Why can't I just add the TF-IDF values of all the words together and use that as a representation of the document?
I have a basic understanding of why we do this. Apples =/= Oranges.
But apparently I don't know it well enough to convince someone else!
Ultimately all of NLP is arbitrary. If you wanted to add up the tf-idf values for all words in a phrase/sentence/document and found the resulting number useful for some task you were trying to do you are free to do so. But that number probably won't be very useful for most standard NLP tasks such as search, summarization, sentiment analysis, etc. It's hard to represent the meaning of a phrase/sentence/document with a single number.
By representing a phrase/sentence/document as a vector which has a separate row for each word in your vocabulary, you can leverage vector/matrix algebra to represent some standard operations you might want to do when solving NLP problems. For example, you could compute the cosine similarity between the vectors representing 2 documents and use that to judge how similar those 2 documents are.
Something else you might be interested in: There is an NLP concept called word2vec which lets you represent every word as a different vector of numbers and then lets you add/subtract them to discover semantic relations between them.
For example, it might say
king - man + woman ≈ queen
You can read more about this at https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

What is the best way to split a sentence for a keyword extraction task?

I'm doing a keyword extraction using TD-IDF on a large number of documents. Currenly I'm splitting each sentence based on n-gram. More particularly I'm using tri-gram. However, this is not the best way to split each sentence into ints constituting keywords. For example a noun phrase like 'triple heart bypass' may not always get detected as one term.
The other alternative to chunk each sentence into its constituting elements look to be part of speech tagging and chunking in Open NLP. In this approach phrase like 'triple heart bypass' always gets extracted as a whole but the downside is in TF-IDF the frequency of extracted terms (phrases) dramatically drops.
Does anyone have any suggestion on either or these two approaches or have any other ideas to improve the quality of the keywords?
What is :
the goal of your application ?
--impacts the tokenization rules and defines the quality of your keywords
type of documents?
--chunking is not the same if you have forum data or news article data.
You can implement some boundary recognizer by yourself, or using a statistical model as in openNLP.
The typical pipeline is that you should first tokenize as simple as possible, apply stop words removal (language-dependent), and then if needed POS tagging-based filtering (but this is a costly operation).
other options : java.text.BreakIterator, com.ibm.icu.text.BreakIterator, com.ibm.icu.text.RuleBasedBreakIterator...

How to find similarity of sentence?

How to find the semantic similarity between any two given sentences?
Eg:
what movies did ron howard direct?
movies directed by ron howard.
I know its a hard problem. But, would like to ask the views of experts.
I don't know how to use the Parts of Speech to achieve this.
http://nlp.stanford.edu:8080/parser/index.jsp
Its a broad problem. I would personally go for cosine similarity.
You need to convert your sentences into a vector. For converting the sentence into vector you can consider several rules, like number of occurances, order, synonyms etc. Then taking the cosine distance as mentioned here
You can also explore elasticsearch for finding associated words. You can create your custom analyzers, stemmer, tokenizer, filters(like synonyms) etc. which can be very helpful in finding similar sentences. Elasticsearch also provides more like this query which finds similar documents using the tf-idf scores.

Resources