Text classification

Text classification - nlp

I have a trivial understanding of NLP so please keep things basic.
I would like to run some PDFs at work through a keyword extractor/classifier and build a taxonomy - in the hope of delivering some business intelligence.
For example, given a few thousand PDFs to mine I would like to determine the markets they apply to (we serve about 5 major industries with each one having several minor industries. Each industry and sub-industry has a specific market and in most cases those deal with OEMs, which in turn deal models, which further sub divide into component parts, etc.
I would love to crunch these PDFs into a semi-structured (more a graph actually) output like:
Aerospace
Manufacturing
Repair
PT Support
M250
C20
C18
Distribution
Can text classifiers do that? Is this too specific? How do you train a system like this that C18 is a "model" of "manufacturer" Rolls Royce of the M250 series and "PT SUPPORT" is a sub-component?
I could build this data manually but would take forever...
Is there a way I could use a text classifier framework and build something more efficiently than regex and python?
Just looking for ideas at this point... Watched a few tutorials on R and python libs but they didn't sound quite like what I am looking for.

Ok lets break your problem into small sub-problems first , i will break the task as
Read PDF and extract data and meta data from them - take a look at Apache Tikka lib
Any classifier to be more effective need training data - Create training data for text classifier
Then apply any suitable classifier algo .
You can also have look at Carrot2 clustering algo , it will automatically analyse the data and group pdf into different categories.

Related

Haar Cascade Training for Parts of a Known Object

I am working on a project where I am trying to extract key features of a bicycle from an overall image. I am currently investigating the use of Haar Cascades to train my computer to find certain regions of interest from said bicycles, e.g. the pedal-sprocket, seat, handle-bars. Then I will extract local features from these sub regions accordingly. The purpose is to create an overall descriptor of a particular bicycle so I can try to match it throughout a sample set of images of other bicycles.
My questions are as follows: Can I train a Haar classifier to look for a sub-component of an overall object? For example, say I want to look for the handlebars on a bicycle. How should I design the training? Should I detect the bicycle first, and then detect the handlebars within the overall bicycle region (Similar to detecting the eyes within a face in terms of facial recognition)? Since I know beforehand that all my images will contain a picture of a bicycle, I'm not sure if there is any point in detecting the bicycle to begin with and then looking for sub components.
In terms of training a Haar cascade and creating an XML that I can use (in OpenCV 3.1 and Python 3.6), could I just set up the positive and negative images with pictures of bicycles and no bicycles respectively? With the difference being that I isolate the particular area of interest by cropping the image appropriately each time (e.g. where the handlebars are)?
Also open to any recommendations about how others might solve the general problem of extracting key features for object matching. This is just one approach I am currently investigating. Thanks!

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!

Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...

So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!

So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

Image Categorization Using Gist Descriptors

I created a multi-class SVM model using libSVM for categorizing images. I optimized for the C and G parameters using grid search and used the RBF kernel.
The classes are 1) animal 2) floral 3) landscape 4) portrait.
My training set is 100 images from each category, and for each image, I extracted a 920-length vector using Lear's Gist Descriptor C code: http://lear.inrialpes.fr/software.
Upon testing my model on 50 images/category, I achieved ~50% accuracy, which is twice as good as random (25% since there are four classes).
I'm relatively new to computer vision, but familiar with machine learning techniques. Any suggestions on how to improve accuracy effectively?
Thanks so much and I look forward to your responses!

This is very very very open research challenge. And there isn't necessarily a single answer that is theoretically guaranteed to be better.
Given your categories, it's not a bad start though, but keep in mind that Gist was originally designed as a global descriptor for scene classification (albeit has empirically proven useful for other image categories). On the representation side, I recommend trying color-based features like patch-based histograms as well as popular low-level gradient features like SIFT. If you're just beginning to learn about computer vision, then I would say SVM is plenty for what you're doing depending on the variability in your image set, e.g. illumination, view-angle, focus, etc.

Can I use NLTK to determine if a comment is a positive one or a negative one?

Can you show me a simple example using http://www.nltk.org/code to determine if a string about a happy or upset mood?

NLTK cannot out of the box, but if you are looking for some related research on that area, take a look at this paper on Offensive Language Detection. The same methods could be adapted to detect comments which are not offensive/unoffensive, but instead happy/unhappy. The primary software package being used in this project for text classification is called WEKA and uses multiple classifiers, trained on previous examples, to determine whether language is offensive or not (and in this method uses a tunable threshold).

Pattern is something worthwhile a test drive too: you can see two opinion mining experiments right on the project homepage.
http://www.clips.ua.ac.be/pages/pattern-examples-100days
http://www.clips.ua.ac.be/pages/pattern-examples-elections

Nopey.
This is a task far beyond the capabilities of NLTK or any grammatical parser that is known or can be realistically imagined. Look at the NLTK Book to see what sorts of tasks it can accomplish which are far, far from your stated purpose.
As a cheap example:
I really enjoyed using your paper to train my dog.
Parse that up with NLTK and you can get
[('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'),
('using', 'VBG'), ('your', 'PRP$'), ('paper', 'NN'),
('to', 'TO'), ('train', 'VB'), ('my', 'PRP$'), ('dog', 'NN')]
Where the parse tree would tell me that 'enjoyed' is the central (past-tense) verb of the simple sentence. To enjoy something is good. To train something is generally a good thing. Gerunds, nouns, comparatives, and such are relatively neutral. So give this a Good score of 0.90.
Except I really mean that I either hit my dog with your paper or let it excrete on the paper which you'd probably consider a not Good thing.
Hire a person for this recognition task.
Added for those who imagine that even trained classifiers are of much use:
Classify this real entry from a real customer review corpus using any classifier you like trained on any dataset you like:
This camera keeps on autofocussing in
auto mode with a buzzing sound which
can't be stopped. It would be really
good if they have given an option to
stop this autofocussing. If you want
to have the date and time on the
image, it's only through their
software which reads the image's date
and time from the image's meta-data.
So if you use your card reader and
copy images - you got to once again
open them through their software to
put the date and time. In that too,
there isn't a direct way to add date
and time
- you got to say 'print images' to a different directory in which there is
an option to specify the date and time
. Even the slightest of the shakes
totally distorts your image. Indoor
images weren't so clear. You got to
have flash 'on' to get it even though
your room is well lit. The lens cap is
a really annoying. the movie clips
taken will always have some 'noise' in
it - you can't avoid that.
The worst mood classification I obtained was "totally equivocal" yet humans can easily determine that this is anything but complimentary. This wasn't a randomly picked datum, rather one that was selected for negative bias without "hate" or "suxz" or similar.

You're looking for a technique that uses a machine learning classifier to determine whether a piece of text is positive or negative. There have been various different attempts at this by a number of research teams (e.g. http://research.yahoo.com/pub/2387 and http://lingcog.iit.edu/doc/appraisal_sentiment_cikm.pdf) we can get about 80% to 90% accuracy at determining whether a product review is positive or negative.
Due to the brevity of your question, it's not obvious to me whether determining whether a product review is positive or negative is the same task you're trying to accomplish, or merely a related task, but I'd suggest starting simple with bag-of-words classification with a Bayesian classifier (which NLTK should be able to handle), and then improve your techniques from there depending on how the accuracy turns out.
Unfortunately, I've never used NLTK (nor Python for that matter) so I can't give you a code example of how to use NLTK for this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string