Document Vectorization Representation in Python

Document Vectorization Representation in Python - python-3.x

I was trying my hand at sentiment analysis in python 3, and was using the TDF-IDF vectorizer with the bag-of-words model to vectorize a document.
So, to anyone who is familiar with that, it is quite evident that the resulting matrix representation is sparse.
Here is a snippet of my code. Firstly, the documents.
tweets = [('Once you get inside you will be impressed with the place.',1),('I got home to see the driest damn wings ever!',0),('An extensive menu provides lots of options for breakfast.',1),('The flair bartenders are absolutely amazing!',1),('My first visit to Hiro was a delight!',1),('Poor service, the waiter made me feel like I was stupid every time he came to the table.',0),('Loved this place.',1),('This restaurant has great food',1),
('Honeslty it did not taste THAT fresh :(',0),('Would not go back.',0),
('I was shocked because no signs indicate cash only.',0),
('Waitress was a little slow in service.',0),
('did not like at all',0),('The food, amazing.',1),
('The burger is good beef, cooked just right.',1),
('They have horrible attitudes towards customers, and talk down to each one when customers do not enjoy their food.',0),
('The cocktails are all handmade and delicious.',1),('This restaurant has terrible food',0),
('Both of the egg rolls were fantastic.',1),('The WORST EXPERIENCE EVER.',0),
('My friend loved the salmon tartar.',1),('Which are small and not worth the price.',0),
('This is the place where I first had pho and it was amazing!!',1),
('Horrible - do not waste your time and money.',0),('Seriously flavorful delights, folks.',1),
('I loved the bacon wrapped dates.',1),('I dressed up to be treated so rudely!',0),
('We literally sat there for 20 minutes with no one asking to take our order.',0),
('you can watch them preparing the delicious food! :)',1),('In the summer, you can dine in a charming outdoor patio - so very delightful.',1)]
X_train, y_train = zip(*tweets)
And the following code to vectorize the documents.
tfidfvec = TfidfVectorizer(lowercase=True)
vectorized = tfidfvec.fit_transform(X_train)
print(vectorized)
When I print vectorized, it does not output a normal matrix. Instead, this:
If I'm not wrong, this must be a sparse matrix representation. However, I am not able to comprehend its format, and what each term means.
Also, there are 30 documents. So, that explains the 0-29 on the first column. If that's the trend then I'm guessing the second column is the index of the words, and the last value is it's tf-idf? It just struck me while I was typing my question, but kindly correct me if I'm wrong.
Could anyone with experience in this help me understand it better?

Yes, technically the first two tuples represent the row-column position, and the third column is the value in that position. So it is basically showing the position and values of the nonzero values.

Related

How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])

Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
The intuition behind how the model uses the answer_start when
calculating the loss, accuracy, etc.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?'
answer='angered' #-> starting at character 365
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
If I need to go through the process of adding this to my custom
dataset (easier to run model evaluation code, etc?)
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
If so, is there a programmatic way to do this to avoid manual effort?
You could simply loop through your dataset and use str.find:
context.find(answer)
Output:
365

doc2vec: Pull documents from inferred document

i am new in word/paragraph embedding and trying to understand via doc2vec in GENSIM. I would like to seek advice on whether my understanding is incorrect. My understanding is that doc2vec is potentially able to return documents that may have semantically similar content. As a test, i tried the following and have the following questions.
Question 1: I noted that every run of training with the exact same parameters and examples will result in a model that produces very different results from previous trains (E.g. Different vectors and different ranking of similar documents eveytime).. Why is this so indeterministic? As such, can this be reliably used for any practical work?
Question 2: Why am i not getting the tag ids of the top similar documents instead?
Results: [('day',0.477),('2016',0.386)....
Question 2 answer: The problem was due to model.most_similar, should use model.docvecs.most_similar instead
Please advise if i misunderstood anything?
Data prep
I had created multiple documents with a sentence each. I had deliberately made it such that they are distinctly different semantically.
A: It is a fine summer weather, with the birds singing and sun shining bright.
B: It is a lovely day indeed, if only i had a degree in appreciating.
C: 2016-2017 Degree in Earth Science Earthly University
D: 2009-2010 Dip in Life and Nature Life College
Query: Degree in Philosophy from Thinking University from 2009 to 2010
Training
I trained the documents (tokens as words, running index as tag)
tdlist=[]
docstring=['It is a fine summer weather, with the birds singing and sun shining bright.',
'It is a lovely day indeed, if only i had a degree in appreciating.',
'2016-2017 Degree in Earth Science Earthly University',
'2009-2010 Dip in Life and Nature Life College']
counter=1
for para in docstring:
tokens=tokenize(para) #This will also strip punctuation
td=TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(counter))
tdlist.append(td)
counter=counter+1
model=gensim.models.Doc2Vec(tdlist,dm=0,alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
model.train(tdlist, total_examples=model.corpus_count, epochs=model.iter)
Inference
I then attempted to infer the query. Although they are many missing words in the vocab for the query, i would expect closest document similarity results for C and D. But the results only gave me a list of 'words' followed by a similarity score. I am unsure if my understanding is wrong. Below is my code extract.
mydocvector=model.infer_vector(['Degree' ,'in' ,'Philosophy' ,'from' ,'Thinking' ,'University', 'from', '2009', 'to', '2010'])
print(model.docvecs.most_similar(positive=[mydocvector])

Doc2Vec doesn't work well on toy-sized datasets - few documents, few total words, few words per document. You'll absolutely want more documents than vector dimensions (size), and ideally tens-of-thousands of documents or more.
The second argument to TaggedDocument should be a list of tags. By supplying a single string-of-an-int, each of its elements (characters) will be seen as tags. (With just documents 1 to 4 this won't yet hurt, but as soon as you have document 10, Doc2Vec will see it as tags 1 and 0, unless you supply it as ['10'] (a single-element list).
Yes, to find most-similar documents you use model.docvecs.most_similar() rather than model.most_similar() (which only operates on learned words, if any).
You are using dm=0 mode, which is a pretty good starting idea – it's fast and often a top-performer. But note that this mode doesn't train word-vectors too. So anything you ask for from the top model, like model['summer'] or model.most_similar('sun'), will be nonsense results based on randomly-initialized but never-trained words. (If you need words trained too, either add dbow_words=1 to the dm=0 mode, or use a dm=1 mode. But for pure doc-vectors, dm=0 is a pretty good choice.)
There's no need to call train() in a loop - or indeed at all, given the line above it. The form you've used to instantiate Doc2Vec, with an actual corpus tdlist as the first argument, already triggers model-setup and training, using the default number of iter passes (5) and the supplied alpha and min_alpha. Now, for Doc2Vec training you often want more passes (10 to 20 are common, though smaller datasets might benefit from even more). And for any training, for properly gradient-descent, you want the effective learning-rate alpha to gradually decline to a negligible value, such as the default 0.0001 (rather than a forced same-as-starting value).
The only situation where you'd usually call train() explicitly is if you instantiate the model without a corpus. In that case, you'd need to both call model.build_vocab(tdlist) (to let the model initialize with a discovered vocabulary), and then some form of train() - but you'd still need only one call to train, supplying the desired number of passes. (Allowing the default model.iter 5 passes, inside an outer loop of 200 iterations, means a total of 1000 passes over the data... and all at the same fixed alpha, which is not proper gradient-descent.)
When you have a beefier dataset, you may find results improve with a higher min_count. Usually words that appear only a few times can't contribute much meaning, and thus only serve as noise slowing training and interfering with other vectors becoming more expressive. (Don't assume "more words must equal better results".) Throwing out the singletons, or more, usually helps.
Regarding inference, almost none of the words in your inference text are in the training set. (I only see 'Degree', 'in', and 'University' repeated.) So in addition to all the issues above, inferring a good vector for the example text would be hard. With a richer training set, you'd likely get better results. It also often helps to increase the steps optional parameter to infer_vector() far above its default of 5.

Financial news headers classification to positive/negative classes

I'm doing a small research project where I should try to split financial news articles headers to positive and negative classes.For classification I'm using SVM approach.The main problem which I see now it that not a lot of features can be produced for ML. News articles contains a lot of Named Entities and other "garbage" elements (from my point of view of course).
Could you please suggest ML features which can be used for ML training? Current results are: precision =0.6, recall=0.8
Thanks

The task is not trivial at all.
The straightforward approach would be to find or create a training set. That is a set of headers with positive news and a set of headers with negative news.
You turn the training set to a TF/IDF representation and then you train a Linear SVM to separate the two classes. Depending on the quality and size of your training set you can achieve something decent - not sure for 0.7 break even point.
Then, to get better results you need to go for NLP approaches. Try use a part-of-speech tagger to identify adjectives (trivial), and then score them using some sentiment DB like SentiWordNet.
There is an excellent overview on Sentiment Analysis by Bo Pang and Lillian Lee you should read:

How about these features?
Length of article header in words
Average word length
Number of words in a dictionary of "bad" words, e.g. dictionary = {terrible, horrible, downturn, bankruptcy, ...}. You may have to generate this dictionary yourself.
Ratio of words in that dictionary to total words in sentence
Similar to 3, but number of words in a "good" dictionary of words, e.g. dictionary = {boon, booming, employment, ...}
Similar to 5, but use the "good"-word dictionary
Time of the article's publication
Date of the article's publication
The medium through which it was published (you'll have to do some subjective classification)
A count of certain punctuation marks, such as the exclamation point
If you're allowed access to the actual article, you could use surface features from the actual article, such as its total length and perhaps even the number of responses or the level of opposition to that article. You could also look at many other dictionaries online such as Ogden's 850 basic english dictionary, and see if bad/good articles would be likely to extract many words from those. I agree that it seems difficult to come up with a long list (e.g. 100 features) of useful features for this purpose.

iliasfl is right, this is not a straightforward task.
I would use a bag of words approach but use a POS tagger first to tag each word in the headline. Then you could remove all of the named entities - which as you rightly point out don't affect the sentiment. Other words should appear frequently enough (if your dataset is big enough) to cancel themselves out from being polarised as either positive or negative.
One step further along, if you still aren't close could be to only select the adjectives and verbs from the tagged data as they are the words that tend to convey the emotion or mood.
I wouldn't be too disheartened in your precision and recall figures though, an F number of 0.8 and above is actually quite good.

Twitter Subjectivity Training Sets

I need a reliable and accurate method to filter tweets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.
Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?

For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.
SentiWordNet : http://sentiwordnet.isti.cnr.it/
Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java
Related Paper: http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf
The other approach I would try:
Example
Tweet 1: #xyz u should see the dark knight. Its awesme.
1) First a dictionary lookup for the for meanings.
"u" and "awesme" will not return anything.
2) Then go against the known abbreviations/shorthands and substitute matches with the expansions
(Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesme.
3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)
Related Link:
Looking for Java spell checker library
Now the original tweet will look like:
Tweet 1: #xyz you should see the dark knight. Its awesome.
4) Split and feed the tweet into SWN3, aggregate the result
The problem with this approach is that
a) Negations should be handled outside SWN3.
b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.

There is sentiment training data at CMU somewhere. I can't remember the link. CMU has done a lot on twitter and sentiment analysis:
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls
I wrote an english vs. not english Naive Bayes classifier for twitter and made a ~example dev/test set and it was 98% accurate. I think that sort of thing is always pretty good if you are just trying to understand the problem, but a package like SentiWordNet might give you a head start.
The problem is defining what makes a tweet subjective or objective! It's important to understand that machine learning is less about the algorithm and more about the quality of the data.
You mention 75% accuracy is all you need.... what about recall? If you provide the right training data you might be able to get that, at the cost of lower recall.

The DynamicLMClassifier in LingPipe works pretty good.
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

Can I use NLTK to determine if a comment is a positive one or a negative one?

Can you show me a simple example using http://www.nltk.org/code to determine if a string about a happy or upset mood?

NLTK cannot out of the box, but if you are looking for some related research on that area, take a look at this paper on Offensive Language Detection. The same methods could be adapted to detect comments which are not offensive/unoffensive, but instead happy/unhappy. The primary software package being used in this project for text classification is called WEKA and uses multiple classifiers, trained on previous examples, to determine whether language is offensive or not (and in this method uses a tunable threshold).

Pattern is something worthwhile a test drive too: you can see two opinion mining experiments right on the project homepage.
http://www.clips.ua.ac.be/pages/pattern-examples-100days
http://www.clips.ua.ac.be/pages/pattern-examples-elections

Nopey.
This is a task far beyond the capabilities of NLTK or any grammatical parser that is known or can be realistically imagined. Look at the NLTK Book to see what sorts of tasks it can accomplish which are far, far from your stated purpose.
As a cheap example:
I really enjoyed using your paper to train my dog.
Parse that up with NLTK and you can get
[('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'),
('using', 'VBG'), ('your', 'PRP$'), ('paper', 'NN'),
('to', 'TO'), ('train', 'VB'), ('my', 'PRP$'), ('dog', 'NN')]
Where the parse tree would tell me that 'enjoyed' is the central (past-tense) verb of the simple sentence. To enjoy something is good. To train something is generally a good thing. Gerunds, nouns, comparatives, and such are relatively neutral. So give this a Good score of 0.90.
Except I really mean that I either hit my dog with your paper or let it excrete on the paper which you'd probably consider a not Good thing.
Hire a person for this recognition task.
Added for those who imagine that even trained classifiers are of much use:
Classify this real entry from a real customer review corpus using any classifier you like trained on any dataset you like:
This camera keeps on autofocussing in
auto mode with a buzzing sound which
can't be stopped. It would be really
good if they have given an option to
stop this autofocussing. If you want
to have the date and time on the
image, it's only through their
software which reads the image's date
and time from the image's meta-data.
So if you use your card reader and
copy images - you got to once again
open them through their software to
put the date and time. In that too,
there isn't a direct way to add date
and time
- you got to say 'print images' to a different directory in which there is
an option to specify the date and time
. Even the slightest of the shakes
totally distorts your image. Indoor
images weren't so clear. You got to
have flash 'on' to get it even though
your room is well lit. The lens cap is
a really annoying. the movie clips
taken will always have some 'noise' in
it - you can't avoid that.
The worst mood classification I obtained was "totally equivocal" yet humans can easily determine that this is anything but complimentary. This wasn't a randomly picked datum, rather one that was selected for negative bias without "hate" or "suxz" or similar.

You're looking for a technique that uses a machine learning classifier to determine whether a piece of text is positive or negative. There have been various different attempts at this by a number of research teams (e.g. http://research.yahoo.com/pub/2387 and http://lingcog.iit.edu/doc/appraisal_sentiment_cikm.pdf) we can get about 80% to 90% accuracy at determining whether a product review is positive or negative.
Due to the brevity of your question, it's not obvious to me whether determining whether a product review is positive or negative is the same task you're trying to accomplish, or merely a related task, but I'd suggest starting simple with bag-of-words classification with a Bayesian classifier (which NLTK should be able to handle), and then improve your techniques from there depending on how the accuracy turns out.
Unfortunately, I've never used NLTK (nor Python for that matter) so I can't give you a code example of how to use NLTK for this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string