Unique pairs of unequal arrays in J - j

Suppose two arrays of different sizes:
N0 =: i. 50
N1 =: i. 500
There should be a way to get the unique pairs, just combine the two. The "simplest" I found was:
]$R =: |:,"2 |: (,.N0) ,"1 0/ N1
25000 2
Which is frankly a butt ugly, baseball bat solution. Is there a more elegant way to do this?

The pattern of data you're reaching for is a variation on Catalogue. It's the most famous variation, in fact: Cartesian product.
On the Vocabulary listing for Catalogue there's also code for Cartesian product. To get the list you want, just ravel and open the result.
pair=: ># ,# { #(,&<)
$ N0 pair N1
25000 2

I'm in search of the same thing.
I've only came up with the following that are shorter but not prettier:
,/(N0 ,. ])"0 N1
;(N0 ,. ]) &.> N1
or in the form:
;N0&,.&.>N1

Related

How to classify my binary classified data in excel in pair and unpair rows

How to classify my binary classified data in excel in pair and unpair rows
I want my data to be classified one class in a pair row than the other class in an unpair row and so on for all data. Here a sample input and the expected output:
Text
Gender
Sorted
Input
BB
M
BB
M
AA
F
AA
F
CC
F
DD
M
DD
M
CC
F
AB
F
CD
M
BA
F
AB
F
CD
M
DC
M
DC
M
BA
F
where the last two columns are the expected result sorted evenly by M, F. A valid solution could be also starting from F instead of M. The Text column is irrelevant related to the sorting algorithm. It is just required as part of the output to indicate the input sorted.
Just adapting the idea used in the answer (under Microsoft Office 365) provided by me to the question: Is there a way to sort a list so that rows with the same value in one column are evenly distributed?. In cell E2 enter the following formula:
=LET(groupSize, 2, sorted, SORT(HSTACK(A2:B9,XMATCH(B2:B9,{"M","F"})),3), sInput,
FILTER(sorted, {1,1,0}),sGenderNum, INDEX(sorted,,3),
seq0, SEQUENCE(ROWS(sGenderNum),,0), mapResult,
MAP(sGenderNum, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sGenderNum,b), "SAME", "NEW")))), factor,
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sGenderNum, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sInput,pos)
)
Simplification 1: There is no need to add a numeric column on the fly representing the gender, but the formula from previous question needs to be changed a little bit. It would be enough to add a SWITCH statement in the last MAP function. If you want to start with F interchange the letters in the SWITCH statement or the associated numbers.
=LET(groupSize, 2, sorted, SORT(A2:B9,2), sGender,INDEX(sorted,,2),
seq0, SEQUENCE(ROWS(sGender),,0),
mapResult,MAP(sGender, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sGender,b), "SAME", "NEW")))),
factor, SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sGender, factor, LAMBDA(m,n, SWITCH(m, "M",1, "F",2) + groupSize*n)),
SORTBY(sorted,pos)
)
Simplification 2: For this particular case after sorting in ascending order the gender, there is only one change from F to M. So we can remove the first MAP on previous solution, finding where this change happens (changeIdx). So it can be simplified as follow:
=LET(groupSize, 2, sorted, SORT(A2:B9,2), sGender,INDEX(sorted,,2),
changeIdx, XMATCH("M", sGender), seq, SEQUENCE(ROWS(sGender)),
factor, SCAN(-1,seq, LAMBDA(aa,c, IF(c= changeIdx,0, aa+1))),
pos, MAP(sGender, factor, LAMBDA(m,n, SWITCH(m, "M",1, "F",2) + groupSize*n)),
SORTBY(sorted,pos))
The previous approach works only for the binary case (two values for the genders). See Disclaimer note at the end
and here is the output:
Explanation
First Formula
Please check the referred answer to understand the logic. If you want to start first with Female (F) in XMATCH interchange the letters M, F. The mentioned solution requires to have number instead of letters for the column to sort, so I adapted the input adding an additional column on the fly via: HSTACK with the equivalent numbers representing the gender (0,1). The column representing the numbers is the following:
XMATCH(B2:B9,{"M","F"})
Disclaimer: This is a simpler case than the referred question, so maybe there are easier ways to do it. Because this is just a particular case it easier to adapt it than to start from scratch and we can guarantee it works. If I have time I will try to simplify it but so far it is good enough.

A better way to rotate columns of a matrix independently

As part of my journey to learn j I implemented a technique for computing the area of a polygon I came across in Futility Closet. I came up with a solution, but it's quite inelegant, so I'm interested in better methods:
polyarea =: -:#((+/#((1&{&|:)*(0{&|:1&|.)))-(+/#((0&{&|:)*(1{&|:1&|.))))
y =: 2 7 9 5 6,.5 7 1 0 4
polyarea y
20
This technique rotates one column and takes the dot product of the columns, then does the same after rotating the other column. The area is half the difference of these two results.
Interested in suggestions!
I think that their technique boils down to using the determinant to find the area of the polygon http://mathworld.wolfram.com/PolygonArea.html
But using the Futility Closet technique I would first close the polygon by adding the first point to the end.
y =: 2 7 9 5 6,.5 7 1 0 4
close=: (, {.)
close is a hook that takes the first pair and appends it to the end
Then take the determinants two points at a time, which is essentially what they are doing with their columns and rotations
dets=: 2 (-/ . *)\ close
dets takes the determinant of each pair of points - result is negative if the points are in clockwise order
Then take those values and process for the answer.
clean=: |#:-:#:(+/)
clean sums up the determinants, divides by 2 and returns the absolute value of the result.
clean #: dets y
20
To see the result in complete tacit form we can lean on the f. adverb (Fix) to flatten our definitions.
clean #: dets f.
|#:-:#:(+/)#:(2 -/ .*\ (, {.))
It is just a different way of looking at what they are doing, but it allows J to use the . conjunction (Dot Product) and \ adverb (Infix) to handle all of those rotations with determinants.
Hope this helps.

Approximate String Matching Algorithms for names

I'm looking for fuzzy string algorithms for the following example: given a database of existing names, match inputs to either the best-matched name if the match accuracy is higher than the input threshold (say 90%), or NA otherwise
database = [James Bond, Michael Smith]
input
James L Bond->James Bond
JBondL->James Bond
Bond,James->James Bond
BandJamesk->James Bond
Jenny,Bond->N/A
Currently, most algorithms like Levenstein and phonetic based ones like Soundex can't match inverted names like BondJames. So far cosine and Jacquard yield the best results, but I'm looking for more, so that I can choose the best or possibly combine algorithms.
Given your examples, I would consider:
Separating n1 - the name in the input and n2 - a name in the database into words (by delimiters and capital letters): n1 -> {u1,u2,...}, n2 -> {v1,v2,...}
Finding the permutation of the order of words in n2 that minimizes s = sum(L(u, v)) where L is the Levenshtein distance.
Selecting the database entry that minimizes s.
When the number of words in L1 and the number of words in L2 don't match - you should 'penalize' s.

Mapping of elements by number of occurrences in J

Using J language, I wish to attain a mapping of the counts of elements of an array.
Specifically, I want to input a lowercased English word with two to many letters and get back each pair of letters in the word along with counts of occurences.
I need a verb that gives something like this, in whatever J structure you think is appropriate:
For 'cocoa':
co 2
oc 1
oa 1
For 'banana':
ba 1
an 2
na 2
For 'milk':
mi 1
il 1
lk 1
For 'to':
to 1
(For single letter words like 'a', the task is undefined and will not be attempted.)
(Order is not important, that's just how I happened to list them.)
I can easily attain successive pairs of letters in a word as a matrix or list of boxes:
2(] ;._3)'cocoa'
co
oc
co
oa
]
2(< ;._3)'cocoa'
┌──┬──┬──┬──┐
│co│oc│co│oa│
└──┴──┴──┴──┘
But I need help getting from there to a mapping of pairs to counts.
I am aware of ~. and ~: but I don't just want to return the unique elements or indexes of duplicates. I want a mapping of counts.
NuVoc's "Loopless" page is indicating that / (or /\. or /\) are where I should be looking for accumulation problems. I am familiar with / for arithmetic operations on numeric arrays, but for u/y I don't know what u would have to be to accumulate the list of pairs of letters that would make up y.
(NB. I can already do this in "normal" languages like Java or Python without help. Similar questions on SO are for languages with very different syntax and semantics to J. I am interested in the idiomatic J approach to this sort of problem.)
To get the list of 2-letter combinations I'd use dyadic infix (\):
2 ]\ 'banana'
ba
an
na
an
na
To count occurrences the primitive that immediately comes to mind is key (/.)
#/.~ 2 ]\ 'banana'
1 2 2
If you want to match the counts to the letter combinations you can extend the verb to the following fork:
({. ; #)/.~ 2 ]\ 'banana'
┌──┬─┐
│ba│1│
├──┼─┤
│an│2│
├──┼─┤
│na│2│
└──┴─┘
I think that you are looking to map counts of unique items to the items. You can correct me if I am wrong.
Starting with
[t=. 2(< ;._3)'cocoa'
┌──┬──┬──┬──┐
│co│oc│co│oa│
└──┴──┴──┴──┘
You can use ~. (Nub) to return the unique items in the list
~.t
┌──┬──┬──┐
│co│oc│oa│
└──┴──┴──┘
Then if you compare the nub to the boxed list you get a matrix where the 1's are the positions that match the nub to the boxed pairs in your string
t =/ ~.t
1 0 0
0 1 0
1 0 0
0 0 1
Sum the columns of this matrix and you get the number of times each item of the nub shows up
+/ t =/ ~.t
2 1 1
Then box them so that you can combine the integers along side the boxed characters
<"0 +/ t =/ ~.t
┌─┬─┬─┐
│2│1│1│
└─┴─┴─┘
Combine them by stitching together the nub and the count using ,. (Stitch)
(~.t) ,. <"0 +/ t =/ ~.t
┌──┬─┐
│co│2│
├──┼─┤
│oc│1│
├──┼─┤
│oa│1│
└──┴─┘
[t=. 2(< ;._3)'banana'
┌──┬──┬──┬──┬──┐
│ba│an│na│an│na│
└──┴──┴──┴──┴──┘
(~.t) ,. <"0 +/ t =/ ~.t
┌──┬─┐
│ba│1│
├──┼─┤
│an│2│
├──┼─┤
│na│2│
└──┴─┘
[t=. 2(< ;._3)'milk'
┌──┬──┬──┐
│mi│il│lk│
└──┴──┴──┘
(~.t) ,. <"0 +/ t =/ ~.t
┌──┬─┐
│mi│1│
├──┼─┤
│il│1│
├──┼─┤
│lk│1│
└──┴─┘
Hope this helps.

Best way to match 4 million rows of data against each other and sort results by similarity?

We use libpuzzle ( http://www.pureftpd.org/project/libpuzzle/doc ) to compare 4 million images against each other for similarity.
It works quite well.
But rather then doing a image vs image compare using the libpuzzle functions, there is another method of comparing the images.
Here is some quick background:
Libpuzzle creates a rather small (544 bytes) hash of any given image. This hash can in turn be used to compare against other hashes using libpuzzles functions. There are a few APIs... PHP, C, etc etc... We are using the PHP API.
The other method of comparing the images is by creating vectors from the given hash, here is a paste from the docs:
Cut the vector in fixed-length words. For instance, let's consider the
following vector:
[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]
With a word length (K) of 10, you can get the following words:
[ a b c d e f g h i j ] found at position 0
[ b c d e f g h i j k ] found at position 1
[ c d e f g h i j k l ] found at position 2
etc. until position N-1
Then, index your vector with a compound index of (word + position).
Even with millions of images, K = 10 and N = 100 should be enough to
have very little entries sharing the same index.
So, we have the vector method working. Its actually works a bit better then the image vs image compare since when we do the image vs image compare, we use other data to reduce our sample size. Its a bit irrelevant and application specific what other data we use to reduce the sample size, but with the vector method... we would not have to do so, we could do a real test of each of the 4 million hashes against each other.
The issue we have is as follows:
With 4 million images, 100 vectors per image, this becomes 400 million rows. We have found MySQL tends to choke after about 60000 images (60000 x 100 = 6 million rows).
The query we use is as follows:
SELECT isw.itemid, COUNT(isw.word) as strength
FROM vectors isw
JOIN vectors isw_search ON isw.word = isw_search.word
WHERE isw_search.itemid = {ITEM ID TO COMPARE AGAINST ALL OTHER ENTRIES}
GROUP BY isw.itemid;
As mentioned, even with proper indexes, the above is quite slow when it comes to 400 million rows.
So, can anyone suggest any other technologies / algos to test these for similarity?
We are willing to give anything a shot.
Some things worth mentioning:
Hashes are binary.
Hashes are always the same length, 544 bytes.
The best we have been able to come up with is:
Convert image hash from binary to ascii.
Create vectors.
Create a string as follows: VECTOR1 VECTOR2 VECTOR3 etc etc.
Search using sphinx.
We have not yet tried the above, but this should probably yield a bit better results than doing the mysql query.
Any ideas? As mentioned, we are willing to install any new service (postgresql? hadoop?).
Final note, an outline of exactly how this vector + compare method works can be found in question Libpuzzle Indexing millions of pictures?. We are in essence using the exact method provided by Jason (currently the last answer, awarded 200+ so points).
Don't do this in a database, just use a simple file. Below i have shown a file with some of the words from the two vectores [abcdefghijklmnopqrst] (image 1) and [xxcdefghijklxxxxxxxx] (image 2)
<index> <image>
0abcdefghij 1
1bcdefghijk 1
2cdefghijkl 1
3defghijklm 1
4efghijklmn 1
...
...
0xxcdefghij 2
1xcdefghijk 2
2cdefghijkl 2
3defghijklx 2
4efghijklxx 2
...
Now sort the file:
<index> <image>
0abcdefghij 1
0xxcdefghij 2
1bcdefghijk 1
1xcdefghijk 2
2cdefghijkl 1
2cdefghijkl 2 <= the index is repeated, those we have a match
3defghijklm 1
3defghijklx 2
4efghijklmn 1
4efghijklxx 2
When the file have been sorted it's easy to find the records that have the same index. Write a small program or something that can run through the sorted list and find the duplicates.
i have opted to 'answer my own' question as we have found a solution that works quite well.
in the initial question, i mentioned we were thinking of doing this via sphinx search.
well, we went ahead and did it and the results are MUCH better then doing this via mysql.
so, in essence the process looks like this:
a) generate hash from image.
b) 'vectorize' this hash into 100 parts.
c) binhex (binary to hex) each of these vectors since they are in binary format.
d) store in sphinx search like so:
itemid | 0_vector0 1_vector1 2_vec... etc
e) search using sphinx search.
initially... once we had this sphinxbase full of 4 million records, it would still take about 1 second per search.
we then enabled distributed indexing for this sphinxbase, on 8 cores, and now are about to query about 10+ searches per second. this is good enough for us.
one final step would be to further distribute this sphinxbase over the multiple servers we have, further utilizing the unused cpu cycles we have available.
but for the time being, good enough. we add about 1000-2000 'items' per day, so searching thru 'just the new ones' will happen quite quickly... after we do the initial scan.

Resources