Turn list of tuples into binary tensors? - pytorch

I have a list of tuples in the form below. A given tuple represents a given pair of movies that a given user liked. All tuples, together, capture every combination of movie likes found in my data.
[(movie_a,movie_b),...(movie_a,movie_b)]
My task is to create movie embeddings, similar to word embeddings. The idea is to train a single hidden layer NN to predict the most likely movie which any user might like, given a movie supplied. Much like word embeddings, the task is inconsequential; it's the weight matrix I'm after, which maps movies to vectors.
Reference: https://arxiv.org/vc/arxiv/papers/1603/1603.04259v2.pdf
In total, there are 19,000,000 tuples (training examples.) Likewise, there are 9,000 unique movie IDs in my data. My initial goal was to create an input variable, X where each row represented a unique movie_id, and each column represented a unique observation. In any given column, only one cell would be set to 1, with all other values set to 0.
As an intermediate step, I tried creating a matrix of zeros of the right dimensions
X = np.zeros([9000,19000000])
Understandably, my computer crashed, simply trying to allocate sufficient memory to X.
Is there a memory efficient way to pass my list of values into PyTorch, such that a binary vector is created for every training example?
Likewise, I tried randomly sampling 500,000 observations. But similarly, passing 9000,500000 to np.zeroes() resulted in another crash.
My university has a GPU server available, and that's my next stop. But I'd like to know if there's a memory efficient way that I should be doing this, especially since I'll be using shared resources.

Related

Get closest text entry of a list from a string

I am trying to build a RNN model for text classification and I am currently building my dataset.
I am trying to do some of the work automatically and I'm using an API that gets me some information for each text I send to it.
So basically :
I have, for each text on my dataframe, I have a df['label'] that contain a 1 to 3 word string.
I have a list of vocabulary (my futur classes) and for each on the df['label'] item, and want to attribute one of the vocabulary list item, depending on which is closest in meaning.
So I need to measure how close each of the labels are close in meaning to my vocabulary list.
Any help ?

WAG matrix implementation

I am working with certain programs in python3.4. I want to use WAG matrix for phylogeny inference, but I am confused about the formula implemented by it.
For example, in phylogenetics study, when a sequence file is used to generate a distance based matrix, there is a formula called "p-distance" implemented and on the basis of this formula and some standard values for sequence data, a matrix is generated which is later used to construct a tree. When a character based method for tree construction is used, "WAG" is one of the matrices used for likelihood tree construction. What I want to say is that if one wants to implement this matrix, then what is the formula basis for it?
I want to write codes for this implementation. But first I need to understand the logic used by WAG matrix.
I have an aligned protein sequence file and I need to generate "WAG"
matrix from it. The thing is that I have been studying literature
regarding wag matrix but I could not get how does it perform
calculation??? Does it have a specific formula?? (For example,
'p-distance' is a formula used bu distance matrix) I want to give
aligned protein sequence file as input and have a matrix generated as
output.

Understanding TF-IDF scores

I am trying to understand how tf and idf scores are calculated when we vectorize a text document usign TfidfVectorizer.
I am understanding how tf-idf ranks in 2 ways, which I am writing below.
tf = ranking a single word based on how often it repeats in this document and idf = ranking the same word on how often it gets repeated in a built-in 'database-like' collection in scikit learn where almost all possible words are collected. Here I assume this built in database to be the corpus.
tf = ranking a single work how often it repeats in the line in the document which is currently being read by tfidfvectorize and idf = ranking based on how many times it is repeated in the entire document that is being vectorized.
Could someone please explain if any of my understanding is correct? And if not please correct what is wrong in my understanding.
The exact answer is in sklearn documentation:
... the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
idf(t) = log[(1 + n_d) / (1+df(d,t))] + 1,
where n_d is the total number of documents, and df(d,t) is the number of documents that contain term t.
So your first item is correct about the tf, but both items miss the point that idf is the inverse document frequency, so it's the ratio of the number of documents (all documents vs documents that contain the term at least once). The formula is taking the log of the ratio to make the ratio function more "flat", and can be adjusted by the class arguments.

Document similarity- Odd one out

Lets say I have "n" number of documents over a specific topic giving certain details. I want to get those documents who are not similar to the majority of the documents. As vague as this might seem, I know how to find cosine similarity between 2 documents. But lets say, I "know" I have 10 documents that are similar to each other, I introduce an 11th document and I need a way to judge how similar is this document with those 10 collectively and not just with every individual document.
I am working with scikit learn, so an answer or technique with its reference will help!
Represent each document as a bag of words and use tf-idf weight to represent a word in a particular document. Then compute cosine similarity with all n documents. Sum all similarity values and then normalize (divide the final sim value by n). It should give you a reasonable similarity between the n documents and your target document.
You can also consider mutual information (sklearn.metrics.mutual_info_score), KL-divergence to measure similarity/difference between two documents. Note that if you want to use them, you need to represent documents as a probability distribution. To compute probability of a term in a document, you can simply use the following formula:
Probability(w) = TF(w) / TTF(w)
where,
TF(w) = term frequency of word, w in a document, d
TTF(w) = total term frequency of word, w [sum of tf in all documents]
I believe any one of them will give you reasonable idea about similarity/dissimilarity between the n documents and your target document.

Connecting exchange names and codes to LCA inventory results

I'm getting into Brightway2 for some energy system modeling and I'm still getting used to the all of the concepts.
I've created a small custom demo database, and run lca.lci() and lca.lcia(). lca.inventory and lca.characterized_inventory both return sparse matrices of the results. My question, which may be very simple, is how can you connect the values in the matrix to the exchange names and keys. I.e., if I wanted to print the results to a file, how would I match the exchanges to the inventory values?
Thanks.
To really understand what is going on, it is useful to understand the difference between "intermediate" data (stored as structured text files) and "processed" data (stored as numpy structured arrays). These concepts are described both here and here.
However, to answer your question directly: what each row and column stand for in the different matrices and arrays (e.g. lca.inventory matrix, lca.supply_array, lca.characterized_inventory) are contained in a set of dictionaries that are associated with your LCA object. These are:
activity_dict: Columns in the technosphere matrix
product_dict : Rows in the technosphere matrix
biosphere_dict: Rows in the biosphere matrix
For example, lca.product_dict yields, in the case of an LCA I just did:
{('ei32_CU_U', '671c1ae85db847083176b9492f000a9d'): 8397,
('ei32_CU_U', '53398faeaf96420408204e309262b8c5'): 536,
('ei32_CU_U', 'fb8599da19dabad6929af8c3a3c3bad6'): 7774,
('ei32_CU_U', '28b3475e12e4ed0ec511cbef4dc97412'): 3051, ...}
with the key in the dictionary being the actual product in my inventory database and the value is the row in the demand_array or the supply_array.
More useful may be the reverse of these dictionaries. Let's say you want to know what a value in e.g. your supply_array refers to, you can create a reverse dictionary using a dict comprehension :
inv_product_dict = {v: k for k, v in lca.product_dict.items()}
and then simply use it directly to obtain the information you are after. Say you want to know what is in the 10th row of the supply_array, you can simply do inv_product_dict[10], which in my case yields ('ei32_CU_U', '4110733917e1fcdc7c55af3b3f068c72')
The same types of logic applies with biosphere (or elementary) flows, found in the lca.biosphere_dict (in LCA parlance, rows in the B matrix), and activities, found in the lca.activity_dict (columns of the A or B matrices).
Note that you can generate the reverse of the activity_dict/product_dict/biosphere_dict simultaneously using lca.reverse_dict(). The syntax then is:
rev_act_dict, rev_product_dict, rev_bio_dict = lca.reverse_dict()

Resources