I am trying to use the vowpal wabbit lda model. But I am having very bad results. I think there is something wrong with the process I am doing. I have this vocabulary size of 100000.
I run the code like this
vw --data train.txt --lda 50 --lda_alpha 0.1 --lda_rho 0.1 --lda_D 262726 -b 20 -pions.dat --readable_model wordtopics.dat
Now I was expecting the wordtopics.dat file to contain the topic proportions for those 100000 words but it looks that this word topics.dat file is very huge contains like 1048587 lines.
I think it is because of b = 20, and the lines at the end are like having uniform probability distribution.
However, when I look at the topics obtained they do not make sense at all. So I think something is wrong. What could go wrong guys?
Not answering your question, but guys at Columbia University Applied Data Science have made a helper to work with VW's LDA, especially on viewing the results.
Also try to use --passes option so VW result can be better over some number of training.
Related
I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.
Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.
[Disclaimer: I know next to nothing about alignment and have not used fast_align.]
Yes.
You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.
That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.
More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.
Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.
I have been asked by the company to specifically use a convolutional neural network to predict the type of medication (RxNorm code) prescribed based on the diagnoses given (ICD9 codes). I will be given a million prescriptions written by doctors. Each prescription is independent of the next one.
So an example would be: 110, 670, 890, BB2344
The first 3 items are ICD9 codes, the last one is the output, the RxNorm code. There are a million of these.
Honestly the task seems nonsensical to me. I do not have any idea regarding how to structure the inputs.
There is no inherent order to the diagnoses and no timestamps.
One diagnosis may make another diagnosis more likely; but there are plenty of examples where they are just independent.
The ICD9 coding system has hierarchical structure, such that a code of 110 and 120 (both infections) are both more closely related than say a code of 110 and 890. (an infection and a wound).
Basically, what should my input "image" look like? Or does a CNN not make sense at all for this problem?
Thanks!
CNN require spatial (or temporal) correlation in inputs. There is no such thing here, so the short answer is no, it makes no sense. In general, given how simplistic is the data, I would actually expect some basic linear model (on one hot encoded data) / or even basic rule inductions to work well.
The only possible use of "cnn-like" structures is to exploit the graph nature through graph-CNNs. Since the hierarchical structure in the input can be considered a "spatial" correlation.
I was trying my hand at sentiment analysis in python 3, and was using the TDF-IDF vectorizer with the bag-of-words model to vectorize a document.
So, to anyone who is familiar with that, it is quite evident that the resulting matrix representation is sparse.
Here is a snippet of my code. Firstly, the documents.
tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
('I was shocked because no signs indicate cash only.',0),
('Waitress was a little slow in service.',0),
('did not like at all',0),('The food, amazing.',1),
('The burger is good beef, cooked just right.',1),
('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
('This is the place where I first had pho and it was amazing!!',1),
('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
('We literally sat there for 20 minutes with no one asking to take our order.',0),
('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]
X_train, y_train = zip(*tweets)
And the following code to vectorize the documents.
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
print(vectorized)
When I print vectorized, it does not output a normal matrix. Instead, this:
If I'm not wrong, this must be a sparse matrix representation. However, I am not able to comprehend its format, and what each term means.
Also, there are 30 documents. So, that explains the 0-29 on the first column. If that's the trend then I'm guessing the second column is the index of the words, and the last value is it's tf-idf? It just struck me while I was typing my question, but kindly correct me if I'm wrong.
Could anyone with experience in this help me understand it better?
Yes, technically the first two tuples represent the row-column position, and the third column is the value in that position. So it is basically showing the position and values of the nonzero values.
I need a reliable and accurate method to filter tweets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.
Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?
For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.
SentiWordNet : http://sentiwordnet.isti.cnr.it/
Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java
Related Paper: http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf
The other approach I would try:
Example
Tweet 1: #xyz u should see the dark knight. Its awesme.
1) First a dictionary lookup for the for meanings.
"u" and "awesme" will not return anything.
2) Then go against the known abbreviations/shorthands and substitute matches with the expansions
(Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesme.
3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)
Related Link:
Looking for Java spell checker library
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesome.
4) Split and feed the tweet into SWN3, aggregate the result
The problem with this approach is that
a) Negations should be handled outside SWN3.
b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.
There is sentiment training data at CMU somewhere. I can't remember the link. CMU has done a lot on twitter and sentiment analysis:
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls
I wrote an english vs. not english Naive Bayes classifier for twitter and made a ~example dev/test set and it was 98% accurate. I think that sort of thing is always pretty good if you are just trying to understand the problem, but a package like SentiWordNet might give you a head start.
The problem is defining what makes a tweet subjective or objective! It's important to understand that machine learning is less about the algorithm and more about the quality of the data.
You mention 75% accuracy is all you need.... what about recall? If you provide the right training data you might be able to get that, at the cost of lower recall.
The DynamicLMClassifier in LingPipe works pretty good.
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
Can you show me a simple example using http://www.nltk.org/code to determine if a string about a happy or upset mood?
NLTK cannot out of the box, but if you are looking for some related research on that area, take a look at this paper on Offensive Language Detection. The same methods could be adapted to detect comments which are not offensive/unoffensive, but instead happy/unhappy. The primary software package being used in this project for text classification is called WEKA and uses multiple classifiers, trained on previous examples, to determine whether language is offensive or not (and in this method uses a tunable threshold).
Pattern is something worthwhile a test drive too: you can see two opinion mining experiments right on the project homepage.
http://www.clips.ua.ac.be/pages/pattern-examples-100days
http://www.clips.ua.ac.be/pages/pattern-examples-elections
Nopey.
This is a task far beyond the capabilities of NLTK or any grammatical parser that is known or can be realistically imagined. Look at the NLTK Book to see what sorts of tasks it can accomplish which are far, far from your stated purpose.
As a cheap example:
I really enjoyed using your paper to train my dog.
Parse that up with NLTK and you can get
[('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'),
('using', 'VBG'), ('your', 'PRP$'), ('paper', 'NN'),
('to', 'TO'), ('train', 'VB'), ('my', 'PRP$'), ('dog', 'NN')]
Where the parse tree would tell me that 'enjoyed' is the central (past-tense) verb of the simple sentence. To enjoy something is good. To train something is generally a good thing. Gerunds, nouns, comparatives, and such are relatively neutral. So give this a Good score of 0.90.
Except I really mean that I either hit my dog with your paper or let it excrete on the paper which you'd probably consider a not Good thing.
Hire a person for this recognition task.
Added for those who imagine that even trained classifiers are of much use:
Classify this real entry from a real customer review corpus using any classifier you like trained on any dataset you like:
This camera keeps on autofocussing in
auto mode with a buzzing sound which
can't be stopped. It would be really
good if they have given an option to
stop this autofocussing. If you want
to have the date and time on the
image, it's only through their
software which reads the image's date
and time from the image's meta-data.
So if you use your card reader and
copy images - you got to once again
open them through their software to
put the date and time. In that too,
there isn't a direct way to add date
and time
- you got to say 'print images' to a different directory in which there is
an option to specify the date and time
. Even the slightest of the shakes
totally distorts your image. Indoor
images weren't so clear. You got to
have flash 'on' to get it even though
your room is well lit. The lens cap is
a really annoying. the movie clips
taken will always have some 'noise' in
it - you can't avoid that.
The worst mood classification I obtained was "totally equivocal" yet humans can easily determine that this is anything but complimentary. This wasn't a randomly picked datum, rather one that was selected for negative bias without "hate" or "suxz" or similar.
You're looking for a technique that uses a machine learning classifier to determine whether a piece of text is positive or negative. There have been various different attempts at this by a number of research teams (e.g. http://research.yahoo.com/pub/2387 and http://lingcog.iit.edu/doc/appraisal_sentiment_cikm.pdf) we can get about 80% to 90% accuracy at determining whether a product review is positive or negative.
Due to the brevity of your question, it's not obvious to me whether determining whether a product review is positive or negative is the same task you're trying to accomplish, or merely a related task, but I'd suggest starting simple with bag-of-words classification with a Bayesian classifier (which NLTK should be able to handle), and then improve your techniques from there depending on how the accuracy turns out.
Unfortunately, I've never used NLTK (nor Python for that matter) so I can't give you a code example of how to use NLTK for this.