Difference between apply node and op node in theano - theano

Theano beginner here. I was just going through the graph structures section on deeplearning.net and I have a doubt.
It is stated that in the tutorial that, "Apply node represents the application of an op to some variables. It is important to draw the difference between the definition of a computation represented by an op and its application to some actual data which is represented by the apply node."
In theano, the application of a computation to data is performed by first creating a function and plugging the appropriate value in f(). Where does the op node come into the picture ?

OP in theano means the computation, like add, dot, convolution. And we define all our works by OPs and then build it to a specific graph by theano.function(). Look at here.
Theano represents symbolic mathematical computations as graphs. These
graphs are composed of interconnected Apply, Variable and Op nodes.
Apply node represents the application of an op to some variables.
I am not sure this answers your questions. Let's me know if it's still unclear.

Related

how cost complexity pruning works and how the random_state works inside decisiontreeclassifier

I want to ask about 2 things in a decision tree model:
1- In the Decision Tree I know how it's built, but there is the "data overfitting problem" so I need to work with "Cost complexity pruning" which I have two problems with it, when I request from the model to bring all the values of alpha, I want to know the alpha values that it returns how does the model find those values and also after I find the best value for alpha and built on it the tree how does the model build this tree with those values. This is the first question
2- When using Decisiontreeclassifier(random_state=any int) I don't understand what is the importance of "random_state" and how will it work inside the model, I understand how it works in the "train_test_split". I searched at stackoverflow and found someone saying that the tree uses the "Heuristic" while it is built, I don't get it how this could happen while I am standing on every node, I search for the value of information gain in every feature, taking the highest value and using it to split the next node around this feature , so I don't get how it's used Heuristic where it's obvious it works as a greedy algorithm. Even if we assume that it works with Heuristic, if I gave it a random state to control the randomness so what is it storing ?

How to compute the iteration matrix for nth NLBGS iteration

I was wondering if there was a direct way of computing the iteration matrix for nth Linear Block Gauss Seidel iteration within OpenMDAO?
thank you
If I understand you correctly, you are referring to the matrix-form of the Gauss Seidel algorithm where you take Ax=b, and break A up into the Diagonal (D), Lower (L) and Upper (U) parts, then use those parts to compute the next iterate.
Specifically you compute [D-L]^-1. This, I believe is what you are referring to as the "iteration matrix" (I am not familiar with this terminology, but based on the algorithm I'm comfortable making an educated guess).
This formulation of the algorithm is useful to think about and a simple way to implement it, but OpenMDAO takes a different approach. The LBGS algorithm implemented in OpenMDAO is set up to work in a matrix-free manner. That means it only interacts with the linear operator methods solve_linear and apply_linear and never explicitly assembles the A matrix at all. Hence there isn't an opportunity to split A up into D, L, U.
Depending on the way you constructed the model, the A matrix you would need might or might not be there at all because OpenMDAO is capable of working in a completely matrix free context. However, if all of your components use the compute_partials or linearize methods to provide partial derivatives then the data you would need for the A matrix does exist in memory.
You'll have to dig for it a bit, and ironically the best place to see how to do that is in the direct solver which does actually require the matrix be formed to compute a factorization.
Also, in that code you'll see a function can iteratively call the linear operator to construct a dense matrix even if the underlying components don't provide their partials directly. Please note that this approach for assembling the matrix is extremely slow and is not recommended for normal operations.

Pre-aligning molecules in Rdkit before computing shape similarity with ShapeTanimotoDist() possible?

I am building a script to compare shapes of Rdkit generated conformers for a query ligand to a reference ligand extracted from a template protein-ligand complex. For this I want to use the shape similarity Tanimoto metric ShapeTanimotoDist() provided by Rdkit. It seems however that this function does not pre-align the molecules when computing shape similarity. When I did some searches I stumbled upon this discussion 10 years ago wherein someone attempted something similar: https://sourceforge.net/p/rdkit/mailman/message/21906484/.
Quoting Greg Landrum:
There is no alignment step. If you want reasonable shape comparisons,
you first need a reasonable alignment of the molecules. The RDKit
doesn't currently provide a practical method of doing this alignment.
So I am wondering if since then this issue has been resolved and that it therefore would be reasonable to just use this function in a standalone fashion to compare shapes of molecules? In the documentation it states under ShapeTanimotoDist() that it uses a "predefined alignment", which is not elaborated further. I have looked into documentation for the 2 molecule aligning functions Rdkit provides: AlignMol and Open3DAlign (O3A) https://www.rdkit.org/docs/source/rdkit.Chem.rdMolAlign.html. For some reason AlignMol does not work for me (Runtime error), albeit O3A which is supported in Rdkit since 2014 did allow me to compare the conformers with ref ligands. However, when creating an O3A object, is there a way to somehow retrieve the coordinates of the conformer and ref molecule alignment to feed into ShapeTanimotoDist()? And also perhaps visualize this using PyMol?
Cheers
Also perhaps useful to consult: 3D functionality in RDkit section https://www.rdkit.org/docs/Cookbook.html

Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.
Yet, two issues are unclear to me:
1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why would that be the case?
2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?
Secondly
Images
WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. http://abload.de/img/bildschirmfoto2013-089out9.png
Decision Tree Example: Sketch showing the issue when it comes to running decision trees on datasets with dummy-coded instances after attribute selection. http://abload.de/img/sketchziu5s.jpg
The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies.
But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!
I believe feature selection and dummy coding just doesn't fit. It equals dropping some values from the attribute. Why do you insist on doing feature selection?
You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.
As for the decision tree: don't perform, feature selection on your output attribute...
Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

What are "Factor Graphs" and what are they useful for?

A friend is using Factor Graphs to do text mining (identifying references to people in text), and it got me interested in this tool, but I'm having a hard time finding an intuitive explanation of what Factor Graphs are and how to use them.
Can anyone provide an explanation of Factor Graphs that isn't math heavy, and which focusses on practical applications rather than abstract theory?
They are used extensively for breaking down a problem into pieces. One very interesting application of factor graphs (and message passing on them) is the XBox Live TrueSkill algorithm. I wrote extensively about it on my blog where I tried to go for an introductory explanation rather than an overly academic one.
A factor graph is the graphical representation of the dependencies between variables and factors (parts of a formula) that are present in a particular kind of formula.
Suppose you have a function f(x_1,x_2,...,x_n) and you want to compute the marginalization of this function for some argument x_i, thus summing over all assignments to the remaining formula. Further f can be broken into factors, e.g.
f(x_1,x_2,...,x_n)=f_1(x_1,x_2)f_2(x_5,x_8,x_9)...f_k(x_1,x_10,x_11)
Then in order to compute the marginalization of f for some of the variables you can use a special algorithm called sum product (or message passing), that breaks the problem into smaller computations. For this algortithm, it is very important which variables appear as arguments to which factor. This information is captured by the factor graph.
A factor graph is a bipartite graph with both factor nodes and variable nodes. And there is an edge between a factor and a variable node if the variable appears as an argument of the factor. In our example there would be an edge between the factor f_2 and the variable x_5 but not between f_2 and x_1.
There is a great article: Factor graphs and the sum-product algorithm.
Factor graph is math model, and can be explained only with math equations. In nutshell it is way to explain complex relations between interest variables in your model. Example: A is temperature, B is pressure, components C,D,E are depends on B,A in some way, and component K is depends on B,A. And you want to predict value K based on A and B. So you know only visible states. Basic ML libraries don't allow to model such structure. Neural network do it better. And Factor Graph is exactly solve that problem.
Factor graph is an example of deep learning. When it is impossible to present model with features and output, Factor models allow to build hidden states, layers and complex structure of variables to fit real world behavior. Examples are Machine translation alignment, fingerprint recognition, co-reference etc.

Resources