Building your own text corpus

Building your own text corpus - text

It may sounds stupid, but do you know how to build text corpus? I have searched everywhere and there is already existing corpus, but I wonder how did they build it? For example, if I want to build corpus with positive and negative tweets, then I have to just make two files? But what about inner of those files? Dont get it((((
in this example he stores pos and neg tweets in RedisDB.

But what about inner of those files?
This depends mostly on what library you're using. XML (with a variety of tags) is common, as is one sentence per line. The tricky part is getting the data in the first place.
For example, if I want to build corpus with positive and negative tweets
Does this mean that you want to know how to mark the tweets as positive and negative? If so, what you're looking for is called text classification or semantic analysis.
If you want to find a bunch of tweets, I'd check one of these pages (just from a quick search of my own).
Clickonf5: http://clickonf5.org/5438/download-tweets-pdf-xml-format-local-machine-server/
Quora: http://quora.com/What-is-the-best-tool-to-download-and-archive-Twitter-data-of-certain-hashtags-and-mentions-for-academic-research
Google Groups: http://groups.google.com/forum/?fromgroups#!topic/twitter-development-talk/kfislDfxunI
For general learning about how to create a corpus, I would check out the Handbook of Natural Language Processing Wiki by Richard Xiao.

Related

Method to generate keywords out of a scientific text?

Which method of text analysis should I use if I need to get a number of multiword keywords, say (up to) 5 per text, analysing a scientific text of some length? In particular, the text could be
a title,
or an abstract.
Preferably a method already scripted on Python.
Thank you!

You could look into keyword extraction, collocation finding or text summarization. Depending on what you want to use it for you could also look into general terminology extraction. These are just some methods, there are also other approaches like topic modeling etc.
Collocation finding/terminology extraction are more about finding domain-specific terminology and require a larger amount of corpora, but they can help to unify the generated tags. Basically you would first run this kind of analysis to find ngrams which are domain-specific and therefore in scientific literature indicative of the topic and in a second step you would mark the occurence of these extracted ngrams in the original texts.
Keyword extraction and text summarization lean more towards being applied to single texts, but obviously the resulting tags are going to be less unified.
It's difficult to say which method makes the most sense for you as this depends on the amount of data you have, the diversity of topics within the data you have, what you are planning to do with the keywords/tags and how much time you want to spend to optimize this extraction.

Techniques other than RegEx to discover 'intent' in sentences

I'm embarking on a project for a non-profit organization to help process and classify 1000's of reports annually from their field workers / contractors the world over. I'm relatively new to NLP and as such wanted to seek the group's guidance on the approach to solve our problem.
I'll highlight the current process, and our challenges and would love your help on the best way to solve our problem.
Current process: Field officers submit reports from locally run projects in the form of best practices. These reports are then processed by a full-time team of curators who (i) ensure they adhere to a best-practice template and (ii) edit the documents to improve language/style/grammar.
Challenge: As the number of field workers increased the volume of reports being generated has grown and our editors are now becoming the bottle-neck.
Solution: We would like to automate the 1st step of our process i.e., checking the document for compliance to the organizational best practice template
Basically, we need to ensure every report has 3 components namely:
1. States its purpose: What topic / problem does this best practice address?
2. Identifies Audience: Who is this for?
3. Highlights Relevance: What can the reader do after reading it?
Here's an example of a good report submission.
"This document introduces techniques for successfully applying best practices across developing countries. This study is intended to help low-income farmers identify a set of best practices for pricing agricultural products in places where there is no price transparency. By implementing these processes, farmers will be able to get better prices for their produce and raise their household incomes."
As of now, our approach has been to use RegEx and check for keywords. i.e., to check for compliance we use the following logic:
1 To check "states purpose" = we do a regex to match 'purpose', 'intent'
2 To check "identifies audience" = we do a regex to match with 'identifies', 'is for'
3 To check "highlights relevance" = we do a regex to match with 'able to', 'allows', 'enables'
The current approach of RegEx seems very primitive and limited so I wanted to ask the community if there is a better way to solving this problem using something like NLTK, CoreNLP.
Thanks in advance.

Interesting problem, i believe its a thorough research problem! In natural language processing, there are few techniques that learn and extract template from text and then can use them as gold annotation to identify whether a document follows the template structure. Researchers used this kind of system for automatic question answering (extract templates from question and then answer them). But in your case its more difficult as you need to learn the structure from a report. In the light of Natural Language Processing, this is more hard to address your problem (no simple NLP task matches with your problem definition) and you may not need any fancy model (complex) to resolve your problem.
You can start by simple document matching and computing a similarity score. If you have large collection of positive examples (well formatted and specified reports), you can construct a dictionary based on tf-idf weights. Then you can check the presence of the dictionary tokens. You can also think of this problem as a binary classification problem. There are good machine learning classifiers such as svm, logistic regression which works good for text data. You can use python and scikit-learn to build programs quickly and they are pretty easy to use. For text pre-processing, you can use NLTK.
Since the reports will be generated by field workers and there are few questions that will be answered by the reports (you mentioned about 3 specific components), i guess simple keyword matching techniques will be a good start for your research. You can gradually move to different directions based on your observations.

This seems like a perfect scenario to apply some machine learning to your process.
First of all, the data annotation problem is covered. This is usually the most annoying problem. Thankfully, you can rely on the curators. The curators can mark the specific sentences that specify: audience, relevance, purpose.
Train some models to identify these types of clauses. If all the classifiers fire for a certain document, it means that the document is properly formatted.
If errors are encountered, make sure to retrain the models with the specific examples.

If you don't provide yourself hints about the format of the document this is an open problem.
What you can do thought, is ask people writing report to conform to some format for the document like having 3 parts each of which have a pre-defined title like so
1. Purpose
Explains the purpose of the document in several paragraph.
2. Topic / Problem
This address the foobar problem also known as lorem ipsum feeling text.
3. Take away
What can the reader do after reading it?
You parse this document from .doc format for instance and extract the three parts. Then you can go through spell checking, grammar and text complexity algorithm. And finally you can extract for instance Named Entities (cf. Named Entity Recognition) and low TF-IDF words.

I've been trying to do something very similar with clinical trials, where most of the data is again written in natural language.
If you do not care about past data, and have control over what the field officers write, maybe you can have them provide these 3 extra fields in their reports, and you would be done.
Otherwise; CoreNLP and OpenNLP, the libraries that I'm most familiar with, have some tools that can help you with part of the task. For example; if your Regex pattern matches a word that starts with the prefix "inten", the actual word could be "intention", "intended", "intent", "intentionally" etc., and you wouldn't necessarily know if the word is a verb, a noun, an adjective or an adverb. POS taggers and the parsers in these libraries would be able to tell you the type (POS) of the word and maybe you only care about the verbs that start with "inten", or more strictly, the verbs spoken by the 3rd person singular.
CoreNLP has another tool called OpenIE, which attempts to extract relations in a sentence. For example, given the following sentence
Born in a small town, she took the midnight train going anywhere
CoreNLP can extract the triple
she, took, midnight train
Combined with the POS tagger for example; you would also know that "she" is a personal pronoun and "took" is a past tense verb.
These libraries can accomplish many other tasks such as tokenization, sentence splitting, and named entity recognition and it would be up to you to combine all of these tools with your domain knowledge and creativity to come up with a solution that works for your case.

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.

All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.

Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.

How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Natural Language Generation - how to test if it sounds natural

I just have a set of sentences, which I have generated based on painting analysis. However I need to test how natural they sound. Is there any api or application which does this?
I am using the Standford Parser to give me a breakdown, but this doesn't exactly do the job I want!
Also can one test how similar sentences are? As I randomly generating parts of sentences and want to check the variety of the sentences produced.

A lot of NLP stuff works using things called 'Language Models'.
A language model is something that can take in some text and return a probability. This probability should typically be indicative of how "likely" the given text is.
You typically build a language model by taking a large chunk of text (which we call the "training corpus") and computing some statistics out of it (which represent your "model"), and then using those statistics to take in new, previously unseen sentences and returning probabilities for them.
You should probably google for "language models", "unigram models", "n-gram models" and click on some of the results to find some article or presentation which helps you understand the previous sentence. (Its hard for me to recommend an appropriate tutorial for you because I don't know what your existing background is)
Anyway, one way to think about language models is that they are systems that take in new text and tell you how similar the new text is to the training corpus the language model was made out of. So if you build 2 language models, one out of all the plays written by Shakespeare and another out of a large number of legal documents, then the second one should be giving you a much higher probability to sentences for some new legal document that just got released (as compared to the first model) while the first model should give you a much higher probability for some other old english play (written by some other author) because that play is probably more similar to Shakespeare (in terms of the kind of words used, sentence lengths, grammar, etc) than it is to modern legal language.
All the things you see the stanford parser give you back for a sentence you give it are generated using language models. One way to think about how those features are built is to pretend that the computer tried every possible combination of tags and every possible parse tree for the sentence you gave it, and used some clever language model to identify which is most probable sequence of tags and most probable parse tree out there, and returned those back to you.
Getting back to your problem, you need to build a language model out of what you consider natural sounding text and then use that language model to evaluate the sentences you want to measure the naturalness of. To do this, you will have to identify a good training corpus and decide on what type of language model you want to build.
If you can't think of anything better, a collection of wikipedia articles might serve to be a good training corpus representing what natural sounding english looks like.
As for model type, an "n-gram model" would probably be good enough for your task. More complicated models like "Hidden Markov Models" and "PCFG's" (the stuff that is powering the stanford page you linked to) would definitely make things even better, but n-grams are definitely the most simple thing you could start with.

How can I analyze pieces of text for positive or negative words?

I'm looking for some sort of module (preferably for python) that would allow me to give that module a string about 200 characters long. The module should then return how many positive or negative words that string had. (e.g. love, like, enjoy vs. hate, dislike, bad)
I'd really like to avoid having to reinvent the wheel in natural language processing, so if there is anything you guys know of that would allow me to do what I described above, it'd be a huge time-saver if you could share.
Thanks for the help!

I think you're looking for sentiment analysis. Here's a Twitter sentiment app.
Here's a question about sentiment analysis using Python.

Before you analyse pieces of text you need to preprocess given text by striping punctuation, repair language, split spaces,lower the whole text and store the words in an iterable data structure.
For some basic sentiment analysis, following techniques can be used:
Bag of words
In bag of words technique we basically go through a bag(file) of words and check if the iterable made by us contains these. If it does then we assign some value to each word's presence in order to weigh the total sentiment of the text.
This link should help you understand more about this
https://en.wikipedia.org/wiki/Bag-of-words_model
Keyword Extraction and Tagging
Keywords and important information can be extracted from the input text by tagging the elements and then removing unwanted data.
For example:
My name is John.
Here John, name are the information and "is" isn't really needed.
Similarly verbs and other unimportant things can be removed in order to retain only the main information.
Chunking and Chinking helps.
This link must be of help.
http://nltk.org/book/ch07.html

You can tokenize your text and get the sentiment using existing sentiment analysis tools. The most comprehensive sentiment analysis tool that I know is SentiBench. This is basically a survey study of all sentiment analysis tools. As well as the code and examples on how to use the code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string