Is there any CFG available (with pos tags -part of speech tags) to validate the grammar of sentences in english? - nlp

It may not be 100 % accurate but still is there any written and tested CFG.
is it available with nltk data?

See this. And search for the word "probability". There are options for printing the probability of parse trees and you can interpret them as confidence, etc.

Short answer: no. Long answer: noooooooooooooo
This is a huge issue. One CFG couldn't possibly come even close to the complexity that English presents. Even the POS taggers aren't terribly accurate.
The very best spelling and grammar checkers look for invariant laws of English and violations therein. Search for English grammar rules and think about how you might use software to detect them.

Related

Stanford Core NLP Tree Parser Sentence Limits wrong - suggestions?

I'm dealing with german law documents and would like to generate parse trees for sentences. I could find and use Standford CoreNLP Parser. However, it does not recognize sentence limits as good as other tools (e.g. spaCy) when parsing the sentences of a document. For example, it would break sentences at every single '.'-character, incl. the dot at the end of abbreviations such as "incl.")
Since it is crucial to cover the whole sentence for creating syntax trees, this does not really work out for me.
I would appreciate any suggestions to tackle this problem, espacially pointers to other software that might be better suited for my problem. If I overlooked the possibility to tweak the Stanford parser, I would be very grateful for any hints on how to make it better detect sentence limits.
A quick glance into the docs did the trick: You can run your pipeline, which might include the sentence splitter, with the attribute
ssplit.isOneSentence = true to basically disable it. This means you can split the sentences beforehand, e.g. using spaCy, and then feed single sentences into the pipeline.

How to take the suffix in smoothing of Part of speech tagging

I am making a "Part of speech Tagger". I am handling the unknown word with the suffix.
But the main issue is that how would i decide the number of suffix... should it be pre-decided (like Weischedel approach) or I have to take the last few alphabets of the words(like Samuelsson approach).
Which approach would be better......
Quick googling suggests that the Weischedel approach is sufficient for English, which has only rudimentary morphological inflection. The Samuelsson approach seems to work better (which makes sense intuitively) when it comes to processing inflecting languages.
A Resource-light Approach to Morpho-syntactic Tagging - Google Books p 9 quote:
To handle unknown words Brants (2000) uses Samuelsson's (1993) suffix analysis, which seems to work best for inflected languages.
(This is not in a direct comparison to Weischedel's approach, though.)

unicode characters

In my application I have unicode strings, I need to tell in which language the string is in,
I want to do it by narrowing list of possible languages by determining in which range the characters of string are.
Ranges I have from http://jrgraphix.net/research/unicode_blocks.php
And possible languages from http://unicode-table.com/en/
The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?
Thanks
Wojciech
This is not really possible, for a couple of reasons:
Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a
specific piece of text contains them. German, for example, uses the
Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters
are not particularly rare, you can easily create whole sentences
without them. So, a short text might not contain them. Thus, again,
looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters
because of a French loanword (e.g. "déjà vu"). Or it may contain
foreign words, because the text is talking about foreign things (e.g.
"Götterdämmerung is an opera by Richard Wagner...", or "The Great
Wall of China (万里长城) is..."). Looking at code points alone would be
misleading.
To sum up, no, you cannot reliably map code point ranges to languages.
What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.
But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.
I like the suggestion of using something like google translate -- as they will be doing all the work for you.
You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.
Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.
A much more sophisticated version of this is presumably what Google does.

Can NLTK be used to Analyse the sentiment a certain word has within a sentence?

I have a quick question and could not find the answer anywhere on the internet:
Can NLTK be used to Analyze the sentiment a certain word has within a sentence?
Like: Sentiment for iPhone: "Even though it is terrible weather outside, my iPhone makes me feel good again." = Sentiment: positive
Have you thought of breaking down the text into clauses ("it is terrible weather outside", "my iphone makes me feel good again"), and evaluating them separately? You can use the NLTK's parsers for that. This will reduce the amount of text you have to analyze, though, so it might end up doing more harm than good.
This won't help you in cases like "Microsoft Surface is no iPad, it's terrible" (where your target is "iPad"), since the sentiment is negative but the iPad wins the comparison. So perhaps you'll also want to check the syntactic analysis, and only examine sentences where your target word is the subject or object. Whether these will give you better performance is anybody's guess, I think.
I do not have much experience with NLTK but I have done some concept level sentiment analysis using NLP libraries in Java. Here is how I did it. The same approach should work for you if you are able to identify dependencies in NLTK. This approach works fine for simple rules but may not work well for complicated sentences.

Libraries or tools for generating random but realistic text

I'm looking for tools for generating random but realistic text. I've implemented a Markov Chain text generator myself and while the results were promising, my attempts at improving them haven't yielded any great successes.
I'd be happy with tools that consume a corpus or that operate based on a context-sensitive or context-free grammar. I'd like the tool to be suitable for inclusion into another project.
Most of my recent work has been in Java so a tool in that language is preferred, but I'd be OK with C#, C, C++, or even JavaScript.
This is similar to this question, but larger in scope.
Extending your own Markov chain generator is probably your best bet, if you want "random" text. Generating something that has context is an open research problem.
Try (if you haven't):
Tokenising punctuation separately, or include punctuation in your chain if you're not already. This includes paragraph marks.
If you're using a 2- or 3- history Markov chain, try resetting to using a 1-history one when you encounter full stops or newlines.
Alternatively, you could use WordNet in two passes with your corpus:
Analyse sentences to determine common sequences of word types, ie nouns, verbs, adjectives, and adverbs. WordNet includes these. Everything else (pronouns, conjunctions, whatever) is excluded, but you could essentially pass those straight through.
This would turn "The quick brown fox jumps over the lazy dog" into "The [adjective] [adjective] [noun] [verb(s)] over the [adjective] [noun]"
Reproduce sentences by randomly choosing a template sentence and replacing [adjective], [nouns] and [verbs] with actual adjectives nouns and verbs.
There are quite a few problems with this approach too: for example, you need context from the surrounding words to know which homonym to choose. Looking up "quick" in wordnet yields the stuff about being fast, but also the bit of your fingernail.
I know this doesn't solve your requirement for a library or a tool, but might give you some ideas.
I've used for this purpose many data sets, including wikinews articles.
I've extracted text from them using this tool:
http://alas.matf.bg.ac.rs/~mr04069/WikiExtractor.py

Resources