WAG matrix implementation - python-3.x

I am working with certain programs in python3.4. I want to use WAG matrix for phylogeny inference, but I am confused about the formula implemented by it.
For example, in phylogenetics study, when a sequence file is used to generate a distance based matrix, there is a formula called "p-distance" implemented and on the basis of this formula and some standard values for sequence data, a matrix is generated which is later used to construct a tree. When a character based method for tree construction is used, "WAG" is one of the matrices used for likelihood tree construction. What I want to say is that if one wants to implement this matrix, then what is the formula basis for it?
I want to write codes for this implementation. But first I need to understand the logic used by WAG matrix.
I have an aligned protein sequence file and I need to generate "WAG"
matrix from it. The thing is that I have been studying literature
regarding wag matrix but I could not get how does it perform
calculation??? Does it have a specific formula?? (For example,
'p-distance' is a formula used bu distance matrix) I want to give
aligned protein sequence file as input and have a matrix generated as
output.

Related

Is there a way to call the pcfcross function on groups of marks?

I'm using the pcfcross function to estimate the pair correlation functions (PCFs) between pairs of cell types, indicated by marks. I would now like to expand my analysis to include measuring the PCFs between cell types and groups of cell types. Is there a way to use the pcfcross function on a group of marks?
Alternatively, is there a way to change the marks of a group of marks to a singular mark?
You can collapse several levels of a factor to a single level, using the spatstat function mergeLevels. This will group several types of points into a single type.
However, this may not give you any useful new information. The pair correlation function is a second-order summary, so the pair correlation for the grouped data can be calculated from the pair correlations for the un-grouped data. (See Chapter 7 of the spatstat book).

Turn list of tuples into binary tensors?

I have a list of tuples in the form below. A given tuple represents a given pair of movies that a given user liked. All tuples, together, capture every combination of movie likes found in my data.
[(movie_a,movie_b),...(movie_a,movie_b)]
My task is to create movie embeddings, similar to word embeddings. The idea is to train a single hidden layer NN to predict the most likely movie which any user might like, given a movie supplied. Much like word embeddings, the task is inconsequential; it's the weight matrix I'm after, which maps movies to vectors.
Reference: https://arxiv.org/vc/arxiv/papers/1603/1603.04259v2.pdf
In total, there are 19,000,000 tuples (training examples.) Likewise, there are 9,000 unique movie IDs in my data. My initial goal was to create an input variable, X where each row represented a unique movie_id, and each column represented a unique observation. In any given column, only one cell would be set to 1, with all other values set to 0.
As an intermediate step, I tried creating a matrix of zeros of the right dimensions
X = np.zeros([9000,19000000])
Understandably, my computer crashed, simply trying to allocate sufficient memory to X.
Is there a memory efficient way to pass my list of values into PyTorch, such that a binary vector is created for every training example?
Likewise, I tried randomly sampling 500,000 observations. But similarly, passing 9000,500000 to np.zeroes() resulted in another crash.
My university has a GPU server available, and that's my next stop. But I'd like to know if there's a memory efficient way that I should be doing this, especially since I'll be using shared resources.

Connecting exchange names and codes to LCA inventory results

I'm getting into Brightway2 for some energy system modeling and I'm still getting used to the all of the concepts.
I've created a small custom demo database, and run lca.lci() and lca.lcia(). lca.inventory and lca.characterized_inventory both return sparse matrices of the results. My question, which may be very simple, is how can you connect the values in the matrix to the exchange names and keys. I.e., if I wanted to print the results to a file, how would I match the exchanges to the inventory values?
Thanks.
To really understand what is going on, it is useful to understand the difference between "intermediate" data (stored as structured text files) and "processed" data (stored as numpy structured arrays). These concepts are described both here and here.
However, to answer your question directly: what each row and column stand for in the different matrices and arrays (e.g. lca.inventory matrix, lca.supply_array, lca.characterized_inventory) are contained in a set of dictionaries that are associated with your LCA object. These are:
activity_dict: Columns in the technosphere matrix
product_dict : Rows in the technosphere matrix
biosphere_dict: Rows in the biosphere matrix
For example, lca.product_dict yields, in the case of an LCA I just did:
{('ei32_CU_U', '671c1ae85db847083176b9492f000a9d'): 8397,
('ei32_CU_U', '53398faeaf96420408204e309262b8c5'): 536,
('ei32_CU_U', 'fb8599da19dabad6929af8c3a3c3bad6'): 7774,
('ei32_CU_U', '28b3475e12e4ed0ec511cbef4dc97412'): 3051, ...}
with the key in the dictionary being the actual product in my inventory database and the value is the row in the demand_array or the supply_array.
More useful may be the reverse of these dictionaries. Let's say you want to know what a value in e.g. your supply_array refers to, you can create a reverse dictionary using a dict comprehension :
inv_product_dict = {v: k for k, v in lca.product_dict.items()}
and then simply use it directly to obtain the information you are after. Say you want to know what is in the 10th row of the supply_array, you can simply do inv_product_dict[10], which in my case yields ('ei32_CU_U', '4110733917e1fcdc7c55af3b3f068c72')
The same types of logic applies with biosphere (or elementary) flows, found in the lca.biosphere_dict (in LCA parlance, rows in the B matrix), and activities, found in the lca.activity_dict (columns of the A or B matrices).
Note that you can generate the reverse of the activity_dict/product_dict/biosphere_dict simultaneously using lca.reverse_dict(). The syntax then is:
rev_act_dict, rev_product_dict, rev_bio_dict = lca.reverse_dict()

Polynominal error in Rapidminer when doing n-gram classification

I am trying to classify different concepts in a text using n-gram. My data tyically exists of six columns:
The word that needs classification
The classification
First word on the left of 1)
Second word on the left of 1)
First word on the right of 1)
Second word on the right of 1)
When I try to use a SVM in Rapidminer, I get the error that it can not handle polynominal values. I know that this can be done because I have read it in different papers. I set the second column to 'label' and have tried to set the rest to 'text' or 'real', but it seems to have no effect. What am I doing wrong?
You have to use the Support Vector Machine (LibSVM) Operator.
In contrast to the classic SVM which only supports two class problems, the LibSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf) supports multi-class classification as well as regression.
One approach could be to create attributes with names equal to the words and values equal to the distance from the word of interest. Of course, all possible words would need to be represented as attributes so the input data would be large.

Matrix functions to compute a correlation matrix in conjunction with an IF statement

I am trying to calculate several correlation matrices using matrix functions in Excel. I have no difficulty with a straightforward problem but when I want to compute three matrices based on three unique values of a variable I am not able to get the IF statement to work properly.
Specifically I have three scenarios ("risk loving", "normal", "risk averse") coded in say B2:B253. My return data is in C2:C253. My goal is to create three correlation matrices depending on the values in column B. My code is:
=MMULT(IF(B2:B253="RISK LOVING",TRANSPOSE($C$2:$L$253-$O$3:$X$3),$C$2:$L$253-$O$3:$X$3)/$P$1/MMULT(TRANSPOSE($O$4:$X$4),$O$4:$X$4),0). Any suggestions?
The left hand side of the condition is a range (B2:B253). The right hand side is a value. I believe that the first comparisson results in a true if and only if it is true for B2. Is this really what you want to do? Or do you like to have a column of IF statements.

Resources