I've been working with NLTK for the past three days to get familiar and reading the "Natural Language processing" book to understand what's going on. I'm curious if someone could clarify for me the following:
Note that the first time you run this command, it is slow because it
gathers statistics about word sequences. Each time you run it, you
will get different output text. Now try generating random text in the
style of an inaugural address or an Internet chat room. Although the
text is random, it re-uses common words and phrases from the source
text and gives us a sense of its style and content. (What is lacking
in this randomly generated text?)
This part of the text, chapter 1, simply says that it "gathers statistics" and it will get "different output text"
What specifically does generate do and how does it work?
This example of generate() uses text3, which is the Bible's Genesis:
In the beginning , between me and thee and in the garden thou mayest
come in unto Noah into the ark , and Mibsam , And said , Is there yet
any portion or inheritance for us , and make thee as Ephraim and as
the sand of the dukes that came with her ; and they were come . Also
he sent forth the dove out of thee , with tabret , and wept upon them
greatly ; and she conceived , and called their names , by their names
after the end of the womb ? And he
Here, the generate() function seems to simply output phrases created by cutting off text at punctuation and randomly reassembling it but it has a bit of readability to it.
type(text3) will tell you that text3 is of type nltk.text.Text.
To cite the documentation of Text.generate():
Print random text, generated using a trigram language model.
That means that NLTK has created an N-Gram model for the Genesis text, counting each occurence of sequences of three words so that it can predict the most likely successor of any given two words in this text. N-Gram models will be explained in more detail in chapter 5 of the NLTK book.
See also the answers to this question.
Related
I wonder why words like "therefore" or "however" or "etc" are not included for instance.
Can you suggest a strategy to make this list automatically more general?
One obvious solution is to include every word that arises in all documents. However, maybe in some documents "therefore" cannot arise.
Just to be clear I am not talking about augment the list by including words of specific data sets. For instance, in some data sets, it may be interested to filter some proper names. I am not talking about this. I am talking about the inclusion of general words that can appear in any english text.
The problem with tinkering with a stop word list is that there is no good way to gather all texts about a certain topic and then automatically discard everything that occurs too frequent. It may lead to inadvertently removing just the topic that you were looking for – because in a limited corpus it occurs relatively frequent. Also, any list of stop words may already contain just the phrase you are looking for. As an example, automatically creating a list of 1980s music groups would almost certainly discard the group The The.
The NLTK documentation refers to where their stopword list came from as:
Stopwords Corpus, Porter et al.
However, that reference is not very well written. It seems to state this was part of the 1980's Porter Stemmer (PDF: http://stp.lingfil.uu.se/~marie/undervisning/textanalys16/porter.pdf; thanks go to alexis for the link), but this actually does not mention stop words. Another source states that:
The Porter et al refers to the original Porter stemmer paper I believe - Porter, M.F. (1980): An algorithm for suffix stripping. Program 14 (3): 130—37. - although the et al is confusing to me. I remember being told the stopwords for English that the stemmer used came from a different source, likely this one - "Information retrieval" by C. J. Van Rijsbergen (Butterworths, London, 1979).
https://groups.google.com/forum/m/#!topic/nltk-users/c8GHEA8mq8A
The full text of Van Rijsbergen can be found online (PDF: http://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf); it mentions several approaches to preprocessing text and so may well be worth a full read. From a quick glance-through it seems the preferred algorithm to generate a stop word list goes all the way back to research such as
LUHN, H.P., 'A statistical approach to mechanised encoding and searching of library information', IBM Journal of Research and Development, 1, 309-317 (1957).
dating back to the very early stages of automated text processing.
The title of your question asks about the criteria that were used to compile the stopwords list. A look at stopwords.readme() will point you to the Snowball source code, and based on what I read there I believe the list was basically hand-compiled, and its primary goal was the exclusion of irregular word forms in order to provide better input to the stemmer. So if some uninteresting words were excluded, it was not a big problem for the system.
As for how you could build a better list, that's a pretty big question. You could try computing a TF-IDF score for each word in your corpus. Words that never get a high tf-idf score (for any document) are uninteresting, and can go in the stopword list.
I'm writing a java app that uses NLP for detecting named entities. I'm using the stanford university Named Entities code in my application. I've allready written a application to detect the names, compare them with a database. But I have a problem with the text itself.
I want to classify sentences in a text that have the mentioning of a name and ignore them.
Example:
'.... This writer has the same writing style as Herman Melville. .. '
The named entity is Herman Melville, but the text is not about Herman Melville, but other writers. Herman Melville is a true negative then.
Another example
The Orb.
Alex Paterson prides the Orb on manipulating obscure samples beyond recognition on its albums and during its concerts; his unauthorised use of other artists’ works has led to disputes with musicians, most notably with Rickie Lee Jones. During its live shows of the 1990s, the Orb performed using digital audio tape machines optimised for live mixing and sampling before switching to laptops and digital media. Despite changes in performance method, the Orb maintained its colourful light shows and psychedelic imagery in concert. These visually intensive performances prompted critics to compare the group to Pink Floyd.
The artists that are detected are 'The Orb' and 'Pink Floyd'. The text is about The Orb, but the group is compared with Pink Floyd. So I want to use NLP to ignore 'Pink Floyd' and detect 'The Orb' as the Named Entity as the subject.
I allready have a database with example texts, where the writers are allready detected. I could use this as a test set. And I have a database with all the writers that exist.
I would like to have some examples or stuff to read on how to solve this problem. Even a discussion would be nice.
Ok, For your problem I would prefer to add a constraint like handling only the sentences that explicitly have an name in them. This would help you in reducing the set of sentences that go through your final processing. Because your requirement is the decide on what is the text actually about(removing true negative) I think looking for the root and nsubj clauses in the grammatical dependency structure generated by Stanford parser would put you in right direction. For your example root and nsubj clauses from the grammatical dependency structure looks something like
nsubj(performed-11, Orb-10)
root(ROOT-0, performed-11)
nn(Paterson-2, Alex-1)
nsubj(prides-3, Paterson-2)
root(ROOT-0, prides-3)
nsubj(maintained-9, Orb-8)
root(ROOT-0, maintained-9)
nsubj(prompted-5, performances-4)
root(ROOT-0, prompted-5)
Now from here you can check which of the names(Orb or Pink Floyd) exist in this structure maximum number of times. If Orb comes shows up more times then Orb is your output else if Pink Floyd does then that is your final output. But you have to take into consideration the the name (orb is one word while pink floyd is two words for which you can take into consideration nn clause)and that would be you final output.
Hope this helps.
I want to generate plausible (or less than plausible is okay too) nonsense text similar to the way that a markov chain approach would do, but I want the nouns and verbs of the generated text to come from a different source than the analyzed text.
So, for example, let's say that text 1 is from Little Red Riding Hood, and my list of nouns/verbs is something like the ones listed here: nouns, verbs. I'm looking for a way to swap out some/all of the nouns/verbs in text 1 with the new nouns/verbs. Then I would generate a new text from the mashup (perhaps using the markov chain approach).
I'm guessing that I need some sort of initial grammar analysis for text 1, and then perhaps do a swap with appropriately coded words of the insertion noun/verb lists?
I'm not familiar with text generation but I'd suggest a language modelling approach. You should check out the first 1-2 lectures for inspiration :)
You can try creating a language model, independent on the nouns and verbs (i.e. replacing them with _noun and _verb). Then you can try generating text from it, based on a factor of randomness since the suggested model just counts words and phrases.
I haven't tried it and I hope it works for you.
I'm trying to make an analysis of a set of phrases, and I don't know exactly how "natural language processing" can help me, or if someone can share his knowledge with me.
The objective is to extract streets and localizations. Often this kind of information is not presented to the reader in a structured way, and It's hard to find a way of parsing it. I have two main objectives.
First the extraction of the streets itself. As far as I know NLP libraries can help me to tokenize a phrase and perform an analysis which will get nouns (for example). But where a street begins and where does it ends?. I assume that I will need to compare that analysis with a streets database, but I don't know wich is the optimal method.
Also, I would like to deduct the level of severity , for example, in car accidents. I'm assuming that the only way is to stablish some heuristic by the present words in the phrase (for example, if deceased word appears + 100). Am I correct?
Thanks a lot as always! :)
The first part of what you want to do ("First the extraction of the streets itself. [...] But where a street begins and where does it end?") is a subfield of NLP called Named Entity Recognition. There are many libraries available which can do this. I like NLTK for Python myself. Depending on your choice I assume that a streetname database would be useful for training the recognizer, but you might be able to get reasonable results with the default corpus. Read the documentation for your NLP library for that.
The second part, recognizing accident severity, can be treated as an independent problem at first. You could take the raw words or their part of speech tags as features, and train a classifier on it (SVM, HMM, KNN, your choice). You would need a fairly large, correctly labelled training set for that; from your description I'm not certain you have that?
"I'm assuming that the only way is to stablish some heuristic by the present words in the phrase " is very vague, and could mean a lot of things. Based on the next sentence it kind of sounds like you think scanning for a predefined list of keywords is the only way to go. In that case, no, see the paragraph above.
Once you have both parts working, you can combine them and count the number of accidents and their severity per street. Using some geocoding library you could even generalize to neighborhoods or cities. Another challenge is the detection of synonyms ("Smith Str" vs "John Smith Street") and homonyms ("Smith Street" in London vs "Smith Street" in Leeds).
I'm writing an Elman Simple Recurrent Network. I want to give it sequences of words, where each word is a sequence of phonemes, and I want a lot of training and test data.
So, what I need is a corpus of English words, together with the phonemes they're made up of, written as something like ARPAbet or SAMPA. British English would be nice but is not essential so long as I know what I'm dealing with. Any suggestions?
I do not currently have the time or the inclination to code something that derives the phonemes a word is comprised of from spoken or written data so please don't propose that.
Note: I'm aware of the CMU Pronouncing Dictionary, but it claims it's only based on the ARPABet symbol set - anyone know if there are actually any differences and if so what they are? (If there aren't any then I could just use that...)
EDIT: CMUPD 0.7a Symbol list - vowels may have lexical stress, and there are variants (of ARPABET standard symbols) indicating this.
CMUdict should be fine. "Arpabet symbol set" just means Arpabet. If there are any minor differences, they should be explained in the CMUdict documentation.
If you need data that's closer to real life than stringing together dictionary pronunciations of individual words, look for phonetically transcribed corpora, e.g., TIMIT.