Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
I don't think there's anything particularly wrong with your algorithm, it's a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes.
Ambiguous sentiment words - "This product works terribly" vs. "This product is terribly good"
Missed negations - "I would never in a millions years say that this product is worth buying"
Quoted/Indirect text - "My dad says this product is terrible, but I disagree"
Comparisons - "This product is about as useful as a hole in the head"
Anything subtle - "This product is ugly, slow and uninspiring, but it's the only thing on the market that does the job"
I'm using product reviews for examples instead of news stories, but you get the idea. In fact, news articles are probably harder because they will often try to show both sides of an argument and tend to use a certain style to convey a point. The final example is quite common in opinion pieces, for example.
As far as NLP helping you with any of this, word sense disambiguation (or even just part-of-speech tagging) may help with (1), syntactic parsing might help with the long range dependencies in (2), some kind of chunking might help with (3). It's all research level work though, there's nothing that I know of that you can directly use. Issues (4) and (5) are a lot harder, I throw up my hands and give up at this point.
I'd stick with the approach you have and look at the output carefully to see if it is doing what you want. Of course that then raises the issue of what you want you understand the definition of "sentiment" to be in the first place...
my favorite example is "just read the book". it contains no explicit sentiment word and it is highly depending on the context. If it apears in a movie review it means that the-movie-sucks-it's-a-waste-of-your-time-but-the-book-is-good. However, if it is in a book review it delivers a positive sentiment.
And what about - "this is the smallest [mobile] phone in the market". back in the '90, it was a great praise. Today it may indicate that it is a way too small.
I think this is the place to start in order to get the complexity of sentiment analysis: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html (by Lillian Lee of Cornell).
You may find the OpinionFinder system and the papers describing it useful.
It is available at http://www.cs.pitt.edu/mpqa/ with other resources for opinion analysis.
It goes beyond polarity classification at the document level, but try to find individual opinions at the sentence level.
I believe the best answer to all of the questions that you mentioned is reading the book under the title of "Sentiment Analysis and opinion mining" by Professor Bing Liu. This book is the best of its own in the field of sentiment analysis. it is amazing. Just take a look at it and you will find the answer to all your 'why' and 'how' questions!
Machine-learning techniques are probably better.
Whitelaw, Garg, and Argamon have a technique that achieves 92% accuracy, using a technique similar to yours for dealing with negation, and support vector machines for text classification.
Why don't you try something similar to how SpamAsassin spam filter works? There really not much difference between intension mining and opinion mining.
Related
I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.
I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.
For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).
Sample Space: 329.5 million humans currently alive in the US
Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.
Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.
Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.
Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".
Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.
My questions for you guys are as follows:
If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.
If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?
Is my intuition even correct? Or are there faster ways to approach this?
What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?
Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?
What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?
Thx
I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/
Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.
Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good approach to this problem? I know about Levenhstein's distance, and I also suppose that this problem may have no easy, good solution.
A good text search engine will provide capabilities such as you describe, fsh. A typical approach would be to create a query that matches if any of the words occurs and orders the results using a weight based on number of terms occurring in proximity to each other and weighted inversely to their probability of occurring, since uncommon words will be less likely to co-occur by chance. There's a whole theory of this sort of thing called information retrieval, but maybe you know about that. Furthermore you'd like to make sure that word-level fuzziness gets accounted for by normalizing case, punctuation and the like and applying some basic linguistic transformations (stemming), and in some cases introducing a dictionary of synonyms, especially when there is domain knowledge available to condition it.
If you're interested in messing around with this stuff, try an open-source search engine, this article by Vik gives a reasonable survey from the perspective of 2009, and this one by Middleton and Baeza-Yates gives a good detailed introduction to the topic.
Opinion Mining/Sentiment Analysis is a somewhat recent subtask of Natural Language processing.Some compare it to text classification,some take a more deep stance towards it. What do you think about the most challenging issues in Sentiment Analysis(opinion mining)? Can you name a few?
The key challenges for sentiment analysis are:-
1) Named Entity Recognition - What is the person actually talking about, e.g. is 300 Spartans a group of Greeks or a movie?
2) Anaphora Resolution - the problem of resolving what a pronoun, or a noun phrase refers to. "We watched the movie and went to dinner; it was awful." What does "It" refer to?
3) Parsing - What is the subject and object of the sentence, which one does the verb and/or adjective actually refer to?
4) Sarcasm - If you don't know the author you have no idea whether 'bad' means bad or good.
5) Twitter - abbreviations, lack of capitals, poor spelling, poor punctuation, poor grammar, ...
I agree with Hightechrider that those are areas where Sentiment Analysis accuracy can see improvement. I would also add that sentiment analysis tends to be done on closed-domain text for the most part. Attempts to do it on open domain text usually winds up having very bad accuracy/F1 measure/what have you or else it is pseudo-open-domain because it only looks at certain grammatical constructions. So I would say topic-sensitive sentiment analysis that can identify context and make decisions based on that is an exciting area for research (and industry products).
I'd also expand his 5th point from Twitter to other social media sites (e.g. Facebook, Youtube), where short, ungrammatical utterances are commonplace.
I think the answer is the language complexity, mistakes in grammar, and spelling. There is vast of ways people expresses there opinions, e.g., sarcasms could be wrongly interpreted as extremely positive sentiment.
The question may be too generic, because there are several types of sentiment analysis (document level, sentence level, comparative sentiment analysis, etc.) and each type has some specific problems.
Generally speaking, I agree with the answer by #Ian Mercer, and I would add 3 other issues:
How to detect a more in depth sentiment/emotion. Positive and negative is a very simple analysis, one of the challenge is how to extract emotions like how much hate there is inside the opinion, how much happiness, how much sadness, etc.
How to detect the object that the opinion is positive for and the object that the opinion is negative for. For example, if you say "She won him!", this means a positive sentiment for her and a negative sentiment for him, at the same time.
How to analyze very subjective sentences or paragraphs. Sometimes even for humans it is very hard to agree on the sentiment of this high subjective texts. Imagine for a computer...
Although this is a little bit an old question, let me add some note related to Arabic sentiment anlsysis in specific. Arabic language has morphological complexities and dialectal varieties which require advanced preprocessing and lexical building processes that surpass what is needed for the English language.
Please, refer to
"https://www.researchgate.net/publication/280042139_Survey_on_Arabic_Sentiment_Analysis_in_Twitter"
"https://link.springer.com/chapter/10.1007/978-3-642-35326-0_14"