Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working on a dataset of amazon alexa reviews and wish to cluster them in positive and negative clusters. I am using Word2Vec for vectorization so wanted to know the difference between Text Embedding and Word Embedding. Also, which one of them will be useful for my clustering of reviews (Please consider that I want to predict the cluster of any reviews that I enter.)
Thanks in advance!
Text Embeddings are typically a way to aggregate a number of Word Embeddings for a sentence or a paragraph of text. There are various ways this can be done. The easiest way is to average word embeddings but not necessarily yielding best results.
Application-wise:
Doc2vec from gensim
par2vec vs. doc2vec
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to build a question answering system where I have a set of predefined questions and their answers. For any given question from the user I have to find if the similar question already exists in the predefined questions and send answers. If it doesn't exist it has to reply a generic response. Any ideas on how to implement this using NLP would be really helpful.
Thanks in advance!!
As you have already mentioned in the question, this calls for a solution that computes text similarity. In this case question-question similarity. You have got a bunch of questions and for an incoming query/question, a similarity score has to be computed with every available question in hand. From a previous answer of mine, to do simple sentence similarity,
Convert the sentences into as suitable representation
Compute some of distance metric between the two representations and figure out the closest match
To achieve 1, you can consider converting every word in a sentence to corresponding vectors. There are libraries/algorithms like fasttext that provide vector mapping. A vector representation of the entire sentence is obtained by taking an average over all word vectors. Use cosine similarity to compute a score between the query and each question in the available list.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am curious about how many ways can we normalize data in data processing step before we use it to train machine learning model, deep learning model and so on.
All I know is
Z-score normalization = (data - mean)/variance.
Min-Max normalization = (data - min)/(max - min)
Do we have other ways except these two that I know?
There are many ways to normalize the data prior to training a model, some depends on the task, data type (tabular, image, signals) and data distribution. You can find the most important ones in scikit-learn preprocessing subpackage:
To highlight few that I have been using consistently, Box-Cox or Yeo-Johnson transformation, where it is used when your feature's distribution is skewed. This will minimize the skewness through maximum likelihood.
Another normalization technique is called Robust Scaler that is can perform better than the Z-score normalization if your dataset contains many outliers as they can falsely influence the sample mean and variance.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
https://nlp.stanford.edu/projects/glove/
I'm trying to use GloVe for summarizing music reviews, but I'm wondering which version is the best for my project. Will "glove.840B.300d.zip" give me a more accurate text summarization since it used way more tokens? Or perhaps the Wikipedia 2014 + Gigaword 5 is more representative than Common Crawl? Thanks!
Unfortunately I don't think anyone can give you a better answer for this than:
"try several options, and see which one works the best"
I've seen work that uses the Wikipedia 2014 + Gigaword 100d vectors that produced SOTA results for reading comprehension. Without experimentation, it's difficult to say conclusively which corpus is closer to your music review set, or what the impact of larger dimensional word embeddings will be.
This is just random advice, but I guess I would suggest trying in this order:
100d from Wikipedia+Gigaword
300d from Wikipedia+Gigaword
300d from Common Crawl
You might as well start with the smaller dimensional embeddings while prototyping, and then you could experiment with larger embeddings to see if you get a performance enhancement.
And in the spirit of promoting other group's work, I would definitely say you should look at these ELMo vectors from AllenNLP:
http://allennlp.org/elmo
They look very promising!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I am trying to remove stop words before performing topic modeling. I noticed that some negation words (not, nor, never, none etc..) are usually considered to be stop words. For example, NLTK, spacy and sklearn include "not" on their stop word lists. However, if we remove "not" from these sentences below they lose the significant meaning and that would not be accurate for topic modeling or sentiment analysis.
1). StackOverflow is helpful => StackOverflow helpful
2). StackOverflow is not helpful => StackOverflow helpful
Can anyone please explain why these negation words are typically considered to be stop words?
Per IMSoP's advice above, please reference https://datascience.stackexchange.com/questions/15765/nlp-why-is-not-a-stop-word for the answer.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have a huge string of raw text that is about 200,000 words long. It's a book.
I want to use these words to analyze the word relationships, so that I can apply those relationships to other applications.
Is this called a "corpus"?
A corpus, in linguistics, is any coherent body of real-life(*) text or speech being studied. So yes, a book is a corpus. The fact that it's in one string doesn't matter, as long as you don't randomly shuffle the characters.
(*) As opposed to a bunch of made up phrases being shown to test subjects to measure their responses, as is commonly done in psycholinguistics.
Yes.
http://en.wikipedia.org/wiki/Text_corpus
Specifically, because it's uses for statistics.
Usually "corpus" is used to refer to a structured collection, but linguists would know what you're talking about.