I'm working a basic modeling application. I read that you can implement it by having a list of objects with identifiers (such as 1 for cube and 2 for a sphere) and then apply each object's instance transformation (a combination of the translation, rotation, and scaling). Since these transformations are not commutative, order matters. In general, you would define an overall transformation matrix as M=TRS where T=translation, R=rotation, and S=scaling.
My question is that if I perform a series of transformations, would it be the same as the total transformations of each type? As in, something like this:
M = t1*r1*t2*s1*r2*s2 =? t1*t2*r1*r2*s1*s2 = TRS
No, it is not the same. Matrix multiplication is not commutative so you cannot change the order of the multiplications for the different transformations.
For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...
This question regards making sympy's geometric algebra module use both covariant and contravariant vector forms to make the output much more compact. So far I am able to use one or the other, but not both together. It may be that I don't know the maths well enough, and the answer is in the documentation after all.
Some background:
I have a system of equations that I want to solve in a complicated non-orthogonal coordinate system. The metric tensor elements of this coordinate system are known, but their expressions are unwieldy so I'd like to to keep them hidden and simply use gij, the square root of its determinant J, and gij. Also it's useful to describe vectors, V, in either their contravariant or their covariant forms,
V = ∑Viei = ∑Viei,
and transform between them where necessary.
Here ei = ∇u(i) and u(i) is the ith coordinate, and ei = ∂R/∂u(i).
This notation is the same as that used in this invaluable text, which I cannot recommend more. Specifically, chapter 2 will be useful for this question.
There are many curls and divergence operations in the system of equations I'm trying to solve. The former is most simply expressed with the contravariant form of the a vector, and the latter with the covariant:
∇.V = 1/J ∑∂u(i)JVi,
∇ x V = εijk/J (∂u(i)Vi)ei,
where εijk is the Levi-Cevitta symbol. I would consider this question answered if I could print the above two equations using sympy's geometric algebra module.
How does one configure sympy's geometric algebra module to express calculations in this manner i.e. using covariant and contravariant vector expressions in order to hide away the complicated nature of the coordinate system?
Maybe there is an alternative toolbox that does exactly this?
I need to use the SVD form of a matrix to extract concepts from a series of documents. My matrix is of the form A = [d1, d2, d3 ... dN] where di is a binary vector of M components. Then the svd decomposition gives me svd(A) = U x S x V' with S containing the singular values.
I use SVDLIBC to do the processing in nodejs (using a small module I wrote to use it). It seemed to work all well, but I noticed something quite weird in the running time behavior depending on the state of my matrix (where N, M are growing, but already above 1000 for each).
First, I didn't consider extracting the same document vectors, but now after some tests, it looks like adding a document twice sometimes speeds the processing extraordinarily.
Do I have to make sure that each of the columns of A are pairwise-independent? Are they required to be all linearly independent? (I thought nope, since SVD just seems to be performing its job well even with some columns being exactly the same, it will simply show in the resulting decomposition which columns / rows are useless by having 0 components in U or V)
Now that it sometimes takes way too much time to compute the SVD of my big matrix, I was trying to reduce its size by removing the same columns, but I found out that actually adding dummy same vectors can make it way faster. Is that normal? What's happening?
Logically, I'd say that I want my matrix to contain as much information as possible, and thus
[A] Remove all same columns, and in the best case, maybe
[B] Remove linearly dependent columns.
Doing [A] seems pretty simple and not computationally too expensive, I could hash my vectors at construction to check what are the possibly same vectors, and then spend time to check these, but are there good computation techniques for [A] and [B]?
(I'd appreciate for [A] to not have to check equality of a new vector with the whole past vectors the brute-force way, and as for [B], I don't know any good way to check it / do it).
Added related question: about my second question, why would SVD's running time behavior change so massively by just adding one similar column? Is that a normal possible behavior, or does it mean I should look for a bug in SVDLIBC?
It is difficult to say where the problem is without samples of fast and slow input matrices. But, since one of the primary uses of the SVD is to provide a rotation that eliminates covariance, redundant (or the same) columns should not cause problems.
To answer your question about if the slow behavior being a bug in the library you're using, I'd suggest trying to retrieve the SVD of the same matrix using another tool. For example, in Octave, retrieve an SVD of your matrix to compare runtimes:
[U, S, V] = svd(A)
In most of instruction discussing Decision Tree, the attributes are represented by a single value, and then these values are concatenated as a feature vector. It makes sense since normally the attributes are independent to each other.
However, in practice, some attributes can only represented as vector or matrix, for example, a GPS coordinate (x,y) in 2D map. If x and y are correlative, (nonlinear dependence e.g.), it is not a good a solution to concatenate them with other attributes simply. I wonder if there are some better techniques to deal with them?
I have a multivariate timeseries of "inputs" of dimension N that I want to map to an output timeseries of dimension M, where M < N. The inputs are bounded in [0,k] and the outputs are in [0,1]. Let's call the input vector for some time slice in the series "I[t]" and the output vector "O[t]".
Now if I knew the optimal mapping of pairs <I[t], O[t]>, I could use one of the standard multivariate regression / training techniques (such as NN, SVM, etc) to discover a mapping function.
I do not know the relationship between specific <I[t], O[t]> pairs, rather have a view on the overall fitness of the output timeseries, i.e. the fitness is governed by a penalty function on the complete output series.
I want to determine the mapping / regressing function "f", where:
O[t] = f (theta, I[t])
Such that penalty function P(O) is minimized:
minarg P( f(theta, I) )
[Note that the penalty function P is being applied the resultant series generated from multiple applications of f to the I[t]'s across time. That is f is a function of I[t] and not the whole timeseries]
The mapping between I and O is complex enough that I do not know what functions should form its basis. Therefore expect to have to experiment with a number of basis functions.
Have a view on one way to approach this, but do not want to bias the proposals.
... depends on your definition of optimal mapping and penalty function. I'm not sure if this is the direction you're taking, but here's a couple of suggestions:
For example you can find a mapping of the data from the higher dimensional space to a lower dimension space that tries to preserve the original similarity between data points (something like Multidimensional Scaling [MDS]).
Or you can prefer to map the data to a lower dimension that accounts for as much of the variability in the data as possible (Principal Component Analysis [PCA]).