Understanding TF-IDF scores - python-3.x

I am trying to understand how tf and idf scores are calculated when we vectorize a text document usign TfidfVectorizer.
I am understanding how tf-idf ranks in 2 ways, which I am writing below.
tf = ranking a single word based on how often it repeats in this document and idf = ranking the same word on how often it gets repeated in a built-in 'database-like' collection in scikit learn where almost all possible words are collected. Here I assume this built in database to be the corpus.
tf = ranking a single work how often it repeats in the line in the document which is currently being read by tfidfvectorize and idf = ranking based on how many times it is repeated in the entire document that is being vectorized.
Could someone please explain if any of my understanding is correct? And if not please correct what is wrong in my understanding.

The exact answer is in sklearn documentation:
... the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
idf(t) = log[(1 + n_d) / (1+df(d,t))] + 1,
where n_d is the total number of documents, and df(d,t) is the number of documents that contain term t.
So your first item is correct about the tf, but both items miss the point that idf is the inverse document frequency, so it's the ratio of the number of documents (all documents vs documents that contain the term at least once). The formula is taking the log of the ratio to make the ratio function more "flat", and can be adjusted by the class arguments.

Related

Quanteda: Removing documents with low occurrence of word x

When reading on methods of textual analysis, some eliminate documents with "10% lowest density score", that is, documents that are relatively long compared to the occurrence of a certain keyword. How can I achieve a similar result in quanteda?
I've created a corpus using a query of the words "refugee" and "asylum seeker". Now I would like to remove all documents where the count frequency of refugee|asylum_seeker is below 3. However, I imagine it is also possible to use the relative frequency if document length is to be taken into account.
Could someone help me? The solution in my head looks like this, however I don't know how to implement it.
For count frequency: Add counts of occurrences of refugee|asylum_seeker per document and remove documents with an added count below 3.
For relative frequency: Inspect the overall average relative frequency of both words refugee and asylum_seeker, to then calculate the per row relative frequencies of the features and apply a function to remove all documents with a relative frequency of both features below X.
Create a dfm from your tokenised corpus, using dfmat <- dfm(your tokens).
Remove the documents features this way:
dfm_remove(dfmat,
as.logical(dfmat[, c("refugee")] < 3 |
dfmat[, c("asylum_seeker")] < 3)
)

Turn list of tuples into binary tensors?

I have a list of tuples in the form below. A given tuple represents a given pair of movies that a given user liked. All tuples, together, capture every combination of movie likes found in my data.
[(movie_a,movie_b),...(movie_a,movie_b)]
My task is to create movie embeddings, similar to word embeddings. The idea is to train a single hidden layer NN to predict the most likely movie which any user might like, given a movie supplied. Much like word embeddings, the task is inconsequential; it's the weight matrix I'm after, which maps movies to vectors.
Reference: https://arxiv.org/vc/arxiv/papers/1603/1603.04259v2.pdf
In total, there are 19,000,000 tuples (training examples.) Likewise, there are 9,000 unique movie IDs in my data. My initial goal was to create an input variable, X where each row represented a unique movie_id, and each column represented a unique observation. In any given column, only one cell would be set to 1, with all other values set to 0.
As an intermediate step, I tried creating a matrix of zeros of the right dimensions
X = np.zeros([9000,19000000])
Understandably, my computer crashed, simply trying to allocate sufficient memory to X.
Is there a memory efficient way to pass my list of values into PyTorch, such that a binary vector is created for every training example?
Likewise, I tried randomly sampling 500,000 observations. But similarly, passing 9000,500000 to np.zeroes() resulted in another crash.
My university has a GPU server available, and that's my next stop. But I'd like to know if there's a memory efficient way that I should be doing this, especially since I'll be using shared resources.

What does "document" mean in a NLP context?

As I was reading about tf–idf on Wiki, I was confused by what it means by the word "document". Does it mean paragraph?
"The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient."
Document in the tf-idf context can typically be thought of as a bag of words. In a vector space model each word is a dimension in a very high-dimensional space, where the magnitude of an word vector is the number of occurrences of the word (term) in the document. A Document-Term matrix represents a matrix where the rows represent documents and the columns represent the terms, with each cell in the matrix representing # occurrences of the word in the document. Hope it's clear.
A "document" is a distinct text. This generally means that each article, book, or so on is its own document.
If you wanted, you could treat an individual paragraph or even sentence as a "document". It's all a matter of perspective.

Document similarity- Odd one out

Lets say I have "n" number of documents over a specific topic giving certain details. I want to get those documents who are not similar to the majority of the documents. As vague as this might seem, I know how to find cosine similarity between 2 documents. But lets say, I "know" I have 10 documents that are similar to each other, I introduce an 11th document and I need a way to judge how similar is this document with those 10 collectively and not just with every individual document.
I am working with scikit learn, so an answer or technique with its reference will help!
Represent each document as a bag of words and use tf-idf weight to represent a word in a particular document. Then compute cosine similarity with all n documents. Sum all similarity values and then normalize (divide the final sim value by n). It should give you a reasonable similarity between the n documents and your target document.
You can also consider mutual information (sklearn.metrics.mutual_info_score), KL-divergence to measure similarity/difference between two documents. Note that if you want to use them, you need to represent documents as a probability distribution. To compute probability of a term in a document, you can simply use the following formula:
Probability(w) = TF(w) / TTF(w)
where,
TF(w) = term frequency of word, w in a document, d
TTF(w) = total term frequency of word, w [sum of tf in all documents]
I believe any one of them will give you reasonable idea about similarity/dissimilarity between the n documents and your target document.

Best way to rank sentences based on similarity from a set of Documents

I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.

Resources