Im trying to understand PLSA (probabilistic latent semantic analysis), to do text modeling (NLP), the problem in every article i red, it's only maths (probabilities), without any semi-algorithme or anything to help you understand that, is there any link where i can understand PLSA please ?
The P in PLSA stands for probablistic and hence I am afraid you may not find any article that does not talk about these. The model itself is a probablistic model and some knowledge of joints, conditionals, independence etc are expected. I would recommend https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05 which I found to be the best online resource. There is a bit of Math but most of it is explained well. About PLSA algorithm - I am not sure. It is not used that often and one almost always prefers LDA. I could find a GitHub implementation of solving PLSA using EM here: https://github.com/laserwave/plsa.
Related
I've tried looking around but I can't seem to find an answer on what math I need before jumping into NLP. I was hoping to get a solid foundation in math before jumping into NLP.
From what I've gathered it's mostly:
Probability,
some Statistic,
Discrete Math
Thank you for your time.
As in most fields, you'll find once you dive in that the title "NLP" covers a fairly broad range of sub-fields. The math requirements vary widely depending on what you're trying to accomplish. So a little more detail about your goals would help.
That said, I can address parsing and the related fields I have some experience in, and offer very general comments on a few others.
You'll find discrete math and automata theory useful in any computer science discipline, so you can't go wrong there.
Some NLP work is closer to linguistics or psychology than computer science. So some linguistic theory might be helpful if that's where your interests lie, and some background in statistical hypothesis testing (applied statistics of the sort you might find in a social science department, although the more rigorous the better).
For morphology, tagging, parsing, and related fields, some probability theory is helpful (as is experience thinking about dynamic programming, although that's not really math background). If you're doing anything involving machine learning (which is most of NLP), it helps to understand some linear algebra.
That said, if your goals are more applied, you can accomplish quite a lot by applying existing tools, without detailed knowledge of the underlying math (it doesn't require any linear algebra to train an SVM, if all you need is a classifier).
For comprehension, there is no mathematics to model language. The 'model' means that a function maps language expression to number or utility. The 'comprehension' comes from representation and composition, which have exceeded the nature science.
One simple question (but I haven't quite found an obvious answer in the NLP stuff I've been reading, which I'm very new to):
I want to classify emails with a probability along certain dimensions of mood. Is there an NLP package out there specifically dealing with this? Is there an obvious starting point in the literature I start reading at?
For example, if I got a short email something like "Hi, I'm not very impressed with your last email - you said the order amount would only be $15.95! Regards, Tom" then it might get 8/10 for Frustration and 0/10 for Happiness.
The actual list of moods isn't so important, but a short list of generally positive vs generally negative moods would be useful.
Thanks in advance!
--Trindaz on Fedang #NLP
You can do this with a number of different NLP tools, but nothing to my knowledge comes with it ready out of the box. Perhaps the easiest place to start would be with LingPipe (java), and you can use their very good sentiment analysis tutorial. You could also use NLTK if python is more your bent. There are some good blog posts over at Streamhacker that describe how you would use Naive Bayes to implement that.
Check out AlchemyAPI for sentiment analysis tools and scikit-learn or any other open machine learning library for the classifier.
if you have not decided to code the implementation, you can also have the data classified by some other tool. google prediction api may be an alternative.
Either way, you will need some labeled data and do the preprocessing. But if you use a tool that may help you get better accuracy easily.
Is there a research paper/book that I can read which can tell me for the problem at hand what sort of feature selection algorithm would work best.
I am trying to simply identify twitter messages as pos/neg (to begin with). I started out with Frequency based feature selection (having started with NLTK book) but soon realised that for a similar problem various individuals have choosen different algorithms
Although I can try Frequency based, mutual information, information gain and various other algorithms the list seems endless.. and was wondering if there an efficient way then trial and error.
any advice
Have you tried the book I recommended upon your last question? It's freely available online and entirely about the task you are dealing with: Sentiment Analysis and Opinion Mining by Pang and Lee. Chapter 4 ("Extraction and Classification") is just what you need!
I did an NLP course last term, and it came pretty clear that sentiment analysis is something that nobody really knows how to do well (yet). Doing this with unsupervised learning is of course even harder.
There's quite a lot of research going on regarding this, some of it commercial and thus not open to the public. I can't point you to any research papers but the book we used for the course was this (google books preview). That said, the book covers a lot of material and might not be the quickest way to find a solution to this particular problem.
The only other thing I can point you towards is to try googling around, maybe in scholar.google.com for "sentiment analysis" or "opinion mining".
Have a look at the NLTK movie_reviews corpus. The reviews are already pos/neg categorized and might help you with training your classifier. Although the language you find in Twitter is probably very different from those.
As a last note, please post any successes (or failures for that matter) here. This issue will come up later for sure at some point.
Unfortunately, there is no silver bullet for anything when dealing with machine learning. It's usually referred to as the "No Free Lunch" theorem. Basically a number of algorithms work for a problem, and some do better on some problems and worse on others. Over all, they all perform about the same. The same feature set may cause one algorithm to perform better and another to perform worse for a given data set. For a different data set, the situation could be completely reversed.
Usually what I do is pick a few feature selection algorithms that have worked for others on similar tasks and then start with those. If the performance I get using my favorite classifiers is acceptable, scrounging for another half percentage point probably isn't worth my time. But if it's not acceptable, then it's time to re-evaluate my approach, or to look for more feature selection methods.
I have a problem related to graph.
I am not a computer science grad hence needed a some quick intro on what is graph and were can i read about graph and how to solve graph related problem in c++ or in general.
The boost graph library may be a starting point and give you some code for solving your graph related problems.
Please see Graph problems in the Stony Brook Algorithm Repository,
and a cute lecture by Xavier Llora.
I would start by studying a few specific algorithms. Dijkstra's algorithm and the graph closure algorithm are good places to start. Also most introductory computer science (eg. Data Structures) texts have a section on graphs. I used this book, mostly after I was already pretty comfortable with most of the material though. It takes a pretty formal approach, so if your maths is strong you might like it.
The community might be able to give you better pointers if you mentioned something specific that you're trying to solve (if there is such a thing).
This is a very cool tool for representing graphs
I need your help in determining the best approach for analyzing industry-specific sentences (i.e. movie reviews) for "positive" vs "negative". I've seen libraries such as OpenNLP before, but it's too low-level - it just gives me the basic sentence composition; what I need is a higher-level structure:
- hopefully with wordlists
- hopefully trainable on my set of data
Thanks!
What you are looking for is commonly dubbed Sentiment Analysis. Typically, sentiment analysis is not able to handle delicate subtleties, like sarcasm or irony, but it fares pretty well if you throw a large set of data at it.
Sentiment analysis usually needs quite a bit of pre-processing. At least tokenization, sentence boundary detection and part-of-speech tagging. Sometimes, syntactic parsing can be important. Doing it properly is an entire branch of research in computational linguistics, and I wouldn't advise you with coming up with your own solution unless you take your time to study the field first.
OpenNLP has some tools to aid sentiment analysis, but if you want something more serious, you should look into the LingPipe toolkit. It has some built-in SA-functionality and a nice tutorial. And you can train it on your own set of data, but don't think that it is entirely trivial :-).
Googling for the term will probably also give you some resources to work with. If you have any more specific question, just ask, I'm watching the nlp-tag closely ;-)
Some approaches to sentiment analysis use strategies popular on other text classification tasks. The most common being transforming your film review into a word vector, and feeding it into a classifier algorithm as training data. Most popular data mining packages can help you here. You could have a look at this tutorial on sentiment classification illustrating how to do an experiment using the open source RapidMiner toolkit.
Incidentally, there is a good data set made available for research purposes related to detecting opinion on film reviews. It is based on IMDB user reviews, and you can check many related research work on the area and how they use the data set.
Its worth bearing in mind that the effectiveness of these methods can only be judged from a statistical viewpoint, so you can pretty much assume there will be misclassifications and cases where opinion is hard to detect. As already noticed in this thread, detecting things like irony and sarcasm can be very difficult indeed.