NLP - why is "not" a stop word? [closed] - nlp

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I am trying to remove stop words before performing topic modeling. I noticed that some negation words (not, nor, never, none etc..) are usually considered to be stop words. For example, NLTK, spacy and sklearn include "not" on their stop word lists. However, if we remove "not" from these sentences below they lose the significant meaning and that would not be accurate for topic modeling or sentiment analysis.
1). StackOverflow is helpful => StackOverflow helpful
2). StackOverflow is not helpful => StackOverflow helpful
Can anyone please explain why these negation words are typically considered to be stop words?

Per IMSoP's advice above, please reference https://datascience.stackexchange.com/questions/15765/nlp-why-is-not-a-stop-word for the answer.

Related

Difference between Text Embedding and Word Embedding [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working on a dataset of amazon alexa reviews and wish to cluster them in positive and negative clusters. I am using Word2Vec for vectorization so wanted to know the difference between Text Embedding and Word Embedding. Also, which one of them will be useful for my clustering of reviews (Please consider that I want to predict the cluster of any reviews that I enter.)
Thanks in advance!
Text Embeddings are typically a way to aggregate a number of Word Embeddings for a sentence or a paragraph of text. There are various ways this can be done. The easiest way is to average word embeddings but not necessarily yielding best results.
Application-wise:
Doc2vec from gensim
par2vec vs. doc2vec

Suggestions for question answering system NLP [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am trying to build a question answering system where I have a set of predefined questions and their answers. For any given question from the user I have to find if the similar question already exists in the predefined questions and send answers. If it doesn't exist it has to reply a generic response. Any ideas on how to implement this using NLP would be really helpful.
Thanks in advance!!
As you have already mentioned in the question, this calls for a solution that computes text similarity. In this case question-question similarity. You have got a bunch of questions and for an incoming query/question, a similarity score has to be computed with every available question in hand. From a previous answer of mine, to do simple sentence similarity,
Convert the sentences into as suitable representation
Compute some of distance metric between the two representations and figure out the closest match
To achieve 1, you can consider converting every word in a sentence to corresponding vectors. There are libraries/algorithms like fasttext that provide vector mapping. A vector representation of the entire sentence is obtained by taking an average over all word vectors. Use cosine similarity to compute a score between the query and each question in the available list.

Finding presence of palindromic sequence in a string [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
How do I find if a string contains a contiguous palindromic sequence ? I could try the naive solution in O(n^2) time where n is the string size , but any efficient algos to it ?
Well looking for just any palindrome isn't particularly interesting since every one character string is a palindrome. If you are looking for the longest palindrome you may be interested in Manacher's Algorithm.
A good description of the algorithm can be found here.
This is a quite common problem, and has ample results on google:
http://en.wikipedia.org/wiki/Longest_palindromic_substring
Rather than using Manacher's Algorithm you should use one of the parallel algorithms.
duplicate of : how to find longest palindromic subsequence?

NLP tools for right-to-left languages? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm trying to use NLP within a web application. What I wanna do is a little information extraction on Persian sentences. So I need some RTL-friendly NLP tools. I've tried python's nltk before but I don't know if it does support RTL languages as well. It's very good if it does because I have a good relationship with Django as well. Any information on this topic is appreciated.
I have never tried using it for RTL, but I think it is perfectly capable of serving your needs, as it is a toolkit, not a system per se.
I could not find any restrictions regarding this. In fact, I have found some other references on people using it for Arabic:
Tokenization of Arabic words using NLTK
Python Arabic NLP
Now, you do need to find some Persian corpora. I could not find any during my brief research, but you can always hit the NLTK Users Mailing List.

News Article Data Sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am doing a project in news classification. Basically the system will classifying news articles based on the pre-defined topic (e.g. sports, politic, international). To build the system, I need free data sets for training the system.
So far, after few hours googling and links from here the only suitable data sets I could find is this. While this will hopefully enough, I think I will try to find more.
Note that the data sets I want:
Contains full news articles, not just title
Is in English
In .txt format,not XML or db
Can anybody help me?
Have you tried to use Reuters21578? It is the most common dataset for text classification. It is formated in SGML, but it is quite simple to parse and transform to a txt format.
You can build it, you can write a Python/Perl/PHP script where you run a search, then when you find the answers you can isolate the attributes with regex... I think is the best option. Is not easy but should be fun, finally you can share this dataset with us.

Resources