Suggestions for question answering system NLP [closed] - nlp

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to build a question answering system where I have a set of predefined questions and their answers. For any given question from the user I have to find if the similar question already exists in the predefined questions and send answers. If it doesn't exist it has to reply a generic response. Any ideas on how to implement this using NLP would be really helpful.
Thanks in advance!!

As you have already mentioned in the question, this calls for a solution that computes text similarity. In this case question-question similarity. You have got a bunch of questions and for an incoming query/question, a similarity score has to be computed with every available question in hand. From a previous answer of mine, to do simple sentence similarity,
Convert the sentences into as suitable representation
Compute some of distance metric between the two representations and figure out the closest match
To achieve 1, you can consider converting every word in a sentence to corresponding vectors. There are libraries/algorithms like fasttext that provide vector mapping. A vector representation of the entire sentence is obtained by taking an average over all word vectors. Use cosine similarity to compute a score between the query and each question in the available list.

Related

How to remove unnecessary words from string for better search [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have different strings for searching the related data but due to unnecessary words, retrieved results are not good. For example, "Working of genetic algorithm", so the words "working of" are not important in here. I can remove "of" by considering it as a stop word. But how about "working"? I can do stemming but it will just remove "ing", which doesn't solve the problem. Similarly another string "Determination of.....", I consider that other words in the string are important and "Determination of" are not important, so I want to remove them before proceeding further. Any ideas or hints how I can remove these words, since there are a lot of these types of words and I cannot hardcode them.
Well, instead of removing such terms, I would suggest to focus on ngrams. Using the ngrams you can make different combination of search strings, and it could help you find the related information efficiently. Now it depends upon you to what number of combinations you want to make i.e. bigrams or trigrams. To do this, you can use python nltk library.

Difference between Text Embedding and Word Embedding [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working on a dataset of amazon alexa reviews and wish to cluster them in positive and negative clusters. I am using Word2Vec for vectorization so wanted to know the difference between Text Embedding and Word Embedding. Also, which one of them will be useful for my clustering of reviews (Please consider that I want to predict the cluster of any reviews that I enter.)
Thanks in advance!
Text Embeddings are typically a way to aggregate a number of Word Embeddings for a sentence or a paragraph of text. There are various ways this can be done. The easiest way is to average word embeddings but not necessarily yielding best results.
Application-wise:
Doc2vec from gensim
par2vec vs. doc2vec

What is the difference between the different GloVe models? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
https://nlp.stanford.edu/projects/glove/
I'm trying to use GloVe for summarizing music reviews, but I'm wondering which version is the best for my project. Will "glove.840B.300d.zip" give me a more accurate text summarization since it used way more tokens? Or perhaps the Wikipedia 2014 + Gigaword 5 is more representative than Common Crawl? Thanks!
Unfortunately I don't think anyone can give you a better answer for this than:
"try several options, and see which one works the best"
I've seen work that uses the Wikipedia 2014 + Gigaword 100d vectors that produced SOTA results for reading comprehension. Without experimentation, it's difficult to say conclusively which corpus is closer to your music review set, or what the impact of larger dimensional word embeddings will be.
This is just random advice, but I guess I would suggest trying in this order:
100d from Wikipedia+Gigaword
300d from Wikipedia+Gigaword
300d from Common Crawl
You might as well start with the smaller dimensional embeddings while prototyping, and then you could experiment with larger embeddings to see if you get a performance enhancement.
And in the spirit of promoting other group's work, I would definitely say you should look at these ELMo vectors from AllenNLP:
http://allennlp.org/elmo
They look very promising!

NLP - why is "not" a stop word? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I am trying to remove stop words before performing topic modeling. I noticed that some negation words (not, nor, never, none etc..) are usually considered to be stop words. For example, NLTK, spacy and sklearn include "not" on their stop word lists. However, if we remove "not" from these sentences below they lose the significant meaning and that would not be accurate for topic modeling or sentiment analysis.
1). StackOverflow is helpful => StackOverflow helpful
2). StackOverflow is not helpful => StackOverflow helpful
Can anyone please explain why these negation words are typically considered to be stop words?
Per IMSoP's advice above, please reference https://datascience.stackexchange.com/questions/15765/nlp-why-is-not-a-stop-word for the answer.

Is this the correct definition of a "corpus"? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have a huge string of raw text that is about 200,000 words long. It's a book.
I want to use these words to analyze the word relationships, so that I can apply those relationships to other applications.
Is this called a "corpus"?
A corpus, in linguistics, is any coherent body of real-life(*) text or speech being studied. So yes, a book is a corpus. The fact that it's in one string doesn't matter, as long as you don't randomly shuffle the characters.
(*) As opposed to a bunch of made up phrases being shown to test subjects to measure their responses, as is commonly done in psycholinguistics.
Yes.
http://en.wikipedia.org/wiki/Text_corpus
Specifically, because it's uses for statistics.
Usually "corpus" is used to refer to a structured collection, but linguists would know what you're talking about.

Resources