NLP - subject of sentence [closed] - text

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am trying to get the main subject of a sentence, i.e what a sentence is talking about (not the grammatical subject which may be different).
So far I have got
1.) OpenNLP in Java which is giving me sentence detection, POS tagging, parsing, tokenizer and Name Finder.
2.) MatlParser,stanford Parser - which can give the grammatical subject of a simple sentence by dependency parsing.
I think a noun or a noun phrase will always be subject in more general sense,but a sentence can have many nouns and noun phrases.
Any help would be much appreciated.

As you correctly pointed out, syntax is not sufficient. One would have to use some form of shallow semantic analysis to identify what you call the "subject". I believe it is more often referred to as Agent in the context of SRL (Semantic Role Labeling). There are open source tools (e.g. UIUC SRL parser) to perform semantic role labeling, at least for English, but they usually work on separate predicates, of which in a sentence there may be several, so one has to somehow figure out which "subject" is the "main" one.
I do not think that the latter notion is well defined, in fact, as in a complex sentence it might not be clear which subject is the "main" one. It might make more sense for a particular kind of sentences, but not in general. I think it would help if you described the data you're working with and/or gave some examples.
P.S. you might consider asking this on https://linguistics.stackexchange.com/

Related

Determining Grammatical Validity of Text Input

I am looking for some way to determine if textual input takes the form of a valid sentence; I would like to provide a warning to the user if not. Examples of input I would like to warn the user about:
"dog hat can ah!"
"slkj ds dsak"
It seems like this is a difficult problem, since grammars are usually derived from textbanks, and the words in the provided sentence input might not appear in the grammar. It also seems like parsers maybe make assumptions that the textual input is comprised of valid English words to begin with. (just my brief takeaway from playing around with Stanford NLP's GUI tool). My questions are as follows:
Is there some tool available to scan through text input and determine if it is made up of valid English words, or at least offer a probability on that? If not, I can write this, just wondering if it already exists. I figure this would be step 1 before determining grammatical correctness.
My understanding is that determining whether a sentence is grammatically correct is done simply by attempting to parse the sentence and see if it is possible. Is that accurate? Are there probabilistic parsers that offer a degree of confidence when ambiguity is encountered? (e.g., a proper noun not recognized)
I hesitate to ask this last question, since I saw it was asked on SO over a decade ago, but any updates as to whether there is a basic, readily available grammar for NLTK? I know English isn't simple, but I am truly just looking to parse relatively simple, single sentence input.
Thanks!
A starting point are classification models trained on the Corpus of Linguistic Acceptability (CoLA) task. There are several recent blog articles on how to fine tune the BERT models from HuggingFace (python) for this task. Here is one such blog article. You can also find already fine-tuned models for CoLA for various BERT flavors in the HuggingFace model zoo.

About subject,predicate and object in RDF [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
This is slightly Off-topic!!!. But please answer to this question.
I have studied lots of articles and materials on net about RDF but i can't understand one thing is how programatically subject, predicate and object is dividing in a natural English line.
Ex: Scott Directed Runner.
If i give this above line, then how the above line is divided into subject,predicate and object with respect to programmatical. please answer.
Thx...
subject, predicate, and object, are used in NLP to define aspects of sentences in some languages, as you mentioned. Do not conflate that with their usage in this context. In RDF, they are names for three distinguishing characteristics of a triple/statement.
Read RDF1.1 Concepts and Abstract Syntax and note that one major takeaway is that a statement is formally defined as a 3-tuple (triple) consisting of:
subject:= the node the statement/edge starts at
predicate := a semantically important label for for the statement/edge
object := the node that the statement/edge terminates at
As you learn more about RDF, you'll learn that you have two major problems:
The Pure NLP problem that you have asked earlier, consisting of "How does one map a sentence in a natural language to a statement in RDF". This is not a trivial task, and requires that one study a great deal of NLP in order to solve.
The RDF problem, which will be "what should I define as my representation for this content once I know what I am extracting". This will include direct mapping of language expressions ("bob is a cat" -> :bob rdf:type :Cat) and mapping of more arbitrary concepts
An example of mapping a more arbitrary concept: "All cats have at least one owner" ->
:Cat rdfs:subClassOf _:x .
_:x rdf:type owl:Restriction .
_:x owl:onProperty :hasOwner .
_:x owl:minCardinality "1"^^xsd:nonNegativeInteger .
To risk understating the point, the general problem that you have formulated is an extraordinarily large task that may not be well suited to StackOverflow. You will need to break this task up into many many much smaller issues while you develop an understanding of the domain, and then ask specific technical questions as you work on this.

nlp: alternate spelling identification

Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.

recognize words in a sequence of characters [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I need an algorithm that can recognize words (dictionary based)
in a sequence of characters that has no white spaces.
lets say for example, the sequence is:
spaceless
it should recognize space and less.
and there might be situations where more words can be recognized.
its hard to give such an example but I'll give it a try:
example: spaceslight
recognized words: space and slight (1)
recognized words: spaces and light (2)
so the algorithm should be able to find those kind of variations too.
If you need multiple queries on the same string a suffix trie is a good solution. This will store the string very efficiently and allows lookup of queries in O(n) where n is the length of the query (note that you cannot do better unless you have more knowledge of the queries).
If a suffix trie still is using up too much space, you can use a DAWG, but this is much more complicated to build.
You can also try the Knuth-Morris-Pratt algorithm. It searches for strings in text... If I remember it correctly it has a linear complexity. Here have a look:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
PS: You might need to tweak it a little bit to your needs...
You might want to look at the Rabin-Karp algorithm, it allows a single pass through the text file to search for all the n letter words in the dictionary for some value of n. Standard Rabin-Karp will find overlaps: spaceslight -> spaces, a, ace, aces, slight, light, i. You would need to modify it if you didn't want overlapping words.

Best Algorithmic Approach to Sentiment Analysis [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My requirement is taking in news articles and determining if they are positive or negative about a subject. I am taking the approach outlined below, but I keep reading NLP may be of use here. All that I have read has pointed at NLP detecting opinion from fact, which I don't think would matter much in my case. I'm wondering two things:
1) Why wouldn't my algorithm work and/or how can I improve it? ( I know sarcasm would probably be a pitfall, but again I don't see that occurring much in the type of news we will be getting)
2) How would NLP help, why should I use it?
My algorithmic approach (I have dictionaries of positive, negative, and negation words):
1) Count number of positive and negative words in article
2) If a negation word is found with 2 or 3 words of the positive or negative word, (ie: NOT the best) negate the score.
3) Multiply the scores by weights that have been manually assigned to each word. (1.0 to start)
4) Add up the totals for positive and negative to get the sentiment score.
I don't think there's anything particularly wrong with your algorithm, it's a fairly straightforward and practical way to go, but there are a lot of situations where it will get make mistakes.
Ambiguous sentiment words - "This product works terribly" vs. "This product is terribly good"
Missed negations - "I would never in a millions years say that this product is worth buying"
Quoted/Indirect text - "My dad says this product is terrible, but I disagree"
Comparisons - "This product is about as useful as a hole in the head"
Anything subtle - "This product is ugly, slow and uninspiring, but it's the only thing on the market that does the job"
I'm using product reviews for examples instead of news stories, but you get the idea. In fact, news articles are probably harder because they will often try to show both sides of an argument and tend to use a certain style to convey a point. The final example is quite common in opinion pieces, for example.
As far as NLP helping you with any of this, word sense disambiguation (or even just part-of-speech tagging) may help with (1), syntactic parsing might help with the long range dependencies in (2), some kind of chunking might help with (3). It's all research level work though, there's nothing that I know of that you can directly use. Issues (4) and (5) are a lot harder, I throw up my hands and give up at this point.
I'd stick with the approach you have and look at the output carefully to see if it is doing what you want. Of course that then raises the issue of what you want you understand the definition of "sentiment" to be in the first place...
my favorite example is "just read the book". it contains no explicit sentiment word and it is highly depending on the context. If it apears in a movie review it means that the-movie-sucks-it's-a-waste-of-your-time-but-the-book-is-good. However, if it is in a book review it delivers a positive sentiment.
And what about - "this is the smallest [mobile] phone in the market". back in the '90, it was a great praise. Today it may indicate that it is a way too small.
I think this is the place to start in order to get the complexity of sentiment analysis: http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html (by Lillian Lee of Cornell).
You may find the OpinionFinder system and the papers describing it useful.
It is available at http://www.cs.pitt.edu/mpqa/ with other resources for opinion analysis.
It goes beyond polarity classification at the document level, but try to find individual opinions at the sentence level.
I believe the best answer to all of the questions that you mentioned is reading the book under the title of "Sentiment Analysis and opinion mining" by Professor Bing Liu. This book is the best of its own in the field of sentiment analysis. it is amazing. Just take a look at it and you will find the answer to all your 'why' and 'how' questions!
Machine-learning techniques are probably better.
Whitelaw, Garg, and Argamon have a technique that achieves 92% accuracy, using a technique similar to yours for dealing with negation, and support vector machines for text classification.
Why don't you try something similar to how SpamAsassin spam filter works? There really not much difference between intension mining and opinion mining.

Resources