Is there a way to describe attributes, using just graphs, links and nodes? - object

I've been thinking about this for a while but I can't seem to get my head around this.
(1) Say you have a simple graph with links and nodes. Some nodes are green and some nodes are red.
(2) It seems to me that we could represent this by adding two special 'color' nodes, and linking them to the nodes that have that color.
(3) However, 'being a color node', is in itself an attribute. So we could represent this, again, by adding a special node that represents this, and linking the color nodes to that one. This could go on ad infinitum.
see this image for illustration
Is there a way to do describe attributes, using only nodes and links? I.e. is there a way to break out of the infinite regression without using 'special' nodes?

In various data structures (linked lists, trees, graphs etc.) nodes always contain some data.
This is also necessary in recursive data structures like a linked list or a tree, were the nodes of such data structures include themselves in data. This is because a list can be a list of a list and a tree can be a tree of a tree of a tree etc.
In a graph you could kind of get a pass for nodes, because for instance you can define and represent a graph with a matrix with number nodes. However, if you want it to be for colours, you would need a dictionary (or an associative arrays) to map your values with other values.
This graph can be represented by a matrix like this:
5|Y x x x Y Dictionary // a, b, c, d, e could be colours or other atomic values mapped.
4|x x Y Y x a -> 1
3|Y x Y Y Y b -> 2
2|Y Y x x x c -> 3
1|Y Y Y x Y d -> 4
- - - - - e -> 5
1 2 3 4 5
x -> represents there's no link (edge) between the nodes.
Y -> represents there's a link (edge) between the nodes.

Related

Scikit Learn PolynomialFeatures - what is the use of the include_bias option?

In scikit-learn's PolynomialFeatures preprocessor, there is an option to include_bias. This essentially just adds a column of ones to the dataframe. I was wondering what the point of having this was. Of course, you can set it to False. But theoretically how does having or not having a column of ones along with the Polynomial Features generated affect Regression.
This is the explanation in the documentation, but I can't seem to get anything useful out of it relation to why it should be used or not.
include_bias : boolean
If True (default), then include a bias column, the feature in which
all polynomial powers are zero (i.e. a column of ones - acts as an
intercept term in a linear model).
Suppose you want to perform the following regression:
y ~ a + b x + c x^2
where x is a generic sample. The best coefficients a,b,c are computed via simple matricial calculus. First, let us denote with X = [1 | X | X^2] a matrix with N rows, where N is the number of samples. The first column is a column of 1s, the second column is a column of values x_i, for all the samples i, the third column is a column of values x_i^2, for all samples i. Let us denote with B the following column vector B=[a b c]^T If Y is a column vector of the N target values for all samples i, we can write the regression as
y ~ X B
The i-th row of this equation is y_i ~ [1 x_i x^2] [a b c]^t = a + b x_i + c x_i^2.
The goal of training a regression is to find B=[a b c] such that X B be as close as possible to y.
If you don't add a column of 1, you are assuming a-priori that a=0, which might not be correct.
In practice, when you write Python code, and you use PolynomialFeatures together with sklearn.linear_model.LinearRegression, the latter takes care by default of adding a column of 1s (since in LinearRegression the fit_intercept parameter is True by default), so you don't need to add it as well in PolynomialFeatures. Therefore, in PolynomialFeatures one usually keeps include_bias=False.
The situation is different if you use statsmodels.OLS instead of LinearRegression

Transform rhombus defined by corners to another rhombus

If you have a 2D model in the shape of a rhombus, defined by it's four corners, and you want to transform it into the shape of another rhombus, given it's 4 corners, how would you do this? Can you do it with a transformation matrix?
My line of thinking is that you can find 4 vectors along the sides such that ||b||/||side_b|| = ||c||/||side_c||, ||a||/||side_a|| = ||b||/||side_b||, such that (a,b) and (b,c) cross at p.
a
-------->-------
| |
b ^ p ^ c
| |
| |
-------->-------
b
However, I would like the most efficient way possible.
So a rhombus can be defined by 2 vectors; scaling on the X axis, and scaling on the y axis. So a matrix that scales a square (we'll say 1x1 dimensions) into a rhombus would look like this: (sorry for the gross matrices! underscores are for spacing)
|X1_1 Y1_1 ___0|
|X1_2 Y1_2 ___0| = Matrix A
|___0 ___0 ___1|
and we want to go from this toward some new dimensions, represented as the following:
|X2_1 Y2_1 ___0|
|X2_2 Y2_2 ___0| = Matrix B
|___0 ___0 ___1|
We want to transform Matrix A into Matrix B. I'll name this transformation T. Therefore:
Matrix A * Transform T = Matrix B.
Doing some basic matrix shuffling, and...
Transform T = Matrix B * inverse(matrix A).
So just fill Matrix B with the dimensions you want, and fill Matrix A with the values you start with.

masking a double over a string

This is a question in MatLab...
I have two matrices, one being a (5 x 1 double) :
1
2
3
1
3
And the second matrix being a (5 x 3 string), with spaces where no character appears :
a
bc
def
g
hij
I am trying to get an output such that a (5 x 1 string) is created and outputs the nth value from each line of matrix two, where n is the value in matrix one. I am unsure how to do this using a mask which would be able to handle much larger matrces. My target matrix would have the following :
a
c
f
g
j
Thank you very much for the help!!!
There are so many ways you can accomplish this task. I'll give you two.
Method #1 - Generate linear indices and access elements
Use sub2ind to generate a set of linear indices that correspond to the row and column locations you want to access in your matrix. You'll note that the column locations are the ones changing, but the row locations are always increasing by 1 as you want to access each row. As such, given your string matrix A, and your columns you want to access stored in ind, just do this:
A = ['a '; 'bc '; 'def'; 'g ';'hij'];
ind = [1 2 3 1 3];
out = A(sub2ind(size(A), (1:numel(ind)).', ind(:)))
out =
a
c
f
g
j
Method #2 - Create a sparse matrix, convert to logical and access
Alternatively, you can create a sparse matrix through sparse where the non-zero entries are rows vary from 1 up to as many elements as you have in ind and the columns vary like what you have given us.
S = sparse((1:numel(ind)).',ind(:),true,size(A,1),size(A,2));
A = A.'; out = A(S.');
Be mindful that you are trying to access each element in a row-major fashion, yet MATLAB will do this in a column-major format. As such, we would need to transpose our data matrix, and also take our sparse matrix and transpose that too. The end result should give you the same order as Method #1.

About affine transform (translation )

For affine 4*4 transformation, I saw two representation in different text
one is
L T
0 1
Another is
L 0
T 1
L is the linear part, T is the translation part; I am wondering which is correct?
Both forms are correct. The first is used in left-multiply matrix by column vector
ResultVector = Matrix*V (for example, in OpenGL), and the second - in right-multiply convention with row vector V*Matrix (for example, in DirectX)

Best way to match 4 million rows of data against each other and sort results by similarity?

We use libpuzzle ( http://www.pureftpd.org/project/libpuzzle/doc ) to compare 4 million images against each other for similarity.
It works quite well.
But rather then doing a image vs image compare using the libpuzzle functions, there is another method of comparing the images.
Here is some quick background:
Libpuzzle creates a rather small (544 bytes) hash of any given image. This hash can in turn be used to compare against other hashes using libpuzzles functions. There are a few APIs... PHP, C, etc etc... We are using the PHP API.
The other method of comparing the images is by creating vectors from the given hash, here is a paste from the docs:
Cut the vector in fixed-length words. For instance, let's consider the
following vector:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
With a word length (K) of 10, you can get the following words:
[ a b c d e f g h i j ] found at position 0
[ b c d e f g h i j k ] found at position 1
[ c d e f g h i j k l ] found at position 2
etc. until position N-1
Then, index your vector with a compound index of (word + position).
Even with millions of images, K = 10 and N = 100 should be enough to
have very little entries sharing the same index.
So, we have the vector method working. Its actually works a bit better then the image vs image compare since when we do the image vs image compare, we use other data to reduce our sample size. Its a bit irrelevant and application specific what other data we use to reduce the sample size, but with the vector method... we would not have to do so, we could do a real test of each of the 4 million hashes against each other.
The issue we have is as follows:
With 4 million images, 100 vectors per image, this becomes 400 million rows. We have found MySQL tends to choke after about 60000 images (60000 x 100 = 6 million rows).
The query we use is as follows:
SELECT isw.itemid, COUNT(isw.word) as strength
FROM vectors isw
JOIN vectors isw_search ON isw.word = isw_search.word
WHERE isw_search.itemid = {ITEM ID TO COMPARE AGAINST ALL OTHER ENTRIES}
GROUP BY isw.itemid;
As mentioned, even with proper indexes, the above is quite slow when it comes to 400 million rows.
So, can anyone suggest any other technologies / algos to test these for similarity?
We are willing to give anything a shot.
Some things worth mentioning:
Hashes are binary.
Hashes are always the same length, 544 bytes.
The best we have been able to come up with is:
Convert image hash from binary to ascii.
Create vectors.
Create a string as follows: VECTOR1 VECTOR2 VECTOR3 etc etc.
Search using sphinx.
We have not yet tried the above, but this should probably yield a bit better results than doing the mysql query.
Any ideas? As mentioned, we are willing to install any new service (postgresql? hadoop?).
Final note, an outline of exactly how this vector + compare method works can be found in question Libpuzzle Indexing millions of pictures?. We are in essence using the exact method provided by Jason (currently the last answer, awarded 200+ so points).
Don't do this in a database, just use a simple file. Below i have shown a file with some of the words from the two vectores [abcdefghijklmnopqrst] (image 1) and [xxcdefghijklxxxxxxxx] (image 2)
<index> <image>
0abcdefghij 1
1bcdefghijk 1
2cdefghijkl 1
3defghijklm 1
4efghijklmn 1
...
...
0xxcdefghij 2
1xcdefghijk 2
2cdefghijkl 2
3defghijklx 2
4efghijklxx 2
...
Now sort the file:
<index> <image>
0abcdefghij 1
0xxcdefghij 2
1bcdefghijk 1
1xcdefghijk 2
2cdefghijkl 1
2cdefghijkl 2 <= the index is repeated, those we have a match
3defghijklm 1
3defghijklx 2
4efghijklmn 1
4efghijklxx 2
When the file have been sorted it's easy to find the records that have the same index. Write a small program or something that can run through the sorted list and find the duplicates.
i have opted to 'answer my own' question as we have found a solution that works quite well.
in the initial question, i mentioned we were thinking of doing this via sphinx search.
well, we went ahead and did it and the results are MUCH better then doing this via mysql.
so, in essence the process looks like this:
a) generate hash from image.
b) 'vectorize' this hash into 100 parts.
c) binhex (binary to hex) each of these vectors since they are in binary format.
d) store in sphinx search like so:
itemid | 0_vector0 1_vector1 2_vec... etc
e) search using sphinx search.
initially... once we had this sphinxbase full of 4 million records, it would still take about 1 second per search.
we then enabled distributed indexing for this sphinxbase, on 8 cores, and now are about to query about 10+ searches per second. this is good enough for us.
one final step would be to further distribute this sphinxbase over the multiple servers we have, further utilizing the unused cpu cycles we have available.
but for the time being, good enough. we add about 1000-2000 'items' per day, so searching thru 'just the new ones' will happen quite quickly... after we do the initial scan.

Resources