Replace items in an array with J verb `I.` - j

Here is a simple replace for a rank-1 list using the I. verb:
y=: _3 _2 _1 1 2 3
0 (I. y<0) } y
The result is
0 0 0 1 2 3
How do I do such a replacement for a rank-2 matrix?
For example,
y2 =: 2 3 $ _3 _2 _1 1 2 3
0 (I. y2<0) } y2
I got (J806)
|index error
| 0 (I.y2<2)}y2
The reason seems to be
(I. y2 < 0)
gives
0 1 2
0 0 0
which isn't taken well by }.

The simplest answer for this problem is to use dyadic >. (Larger of) ...
0 >. y2
0 0 0
1 2 3
If you want to use a more general conditional replacement criteria, then the following form may be useful:
(0 > y2)} y2 ,: 0
0 0 0
1 2 3
If you want it as a verb then you can use the gerund form (v1`v2)} y ↔ (v1 y)} (v2 y) :
(0 > ])`(0 ,:~ ])} y2
0 0 0
1 2 3
If your question is more about scatter index replacement then that is possible too. You need to get the 2D indices of positions you want to replace, for example:
4 $. $. 0 > y2
0 0
0 1
0 2
Now box those indices and use dyadic }:
0 (<"1 (4 $. $. 0 > y2)) } y2
0 0 0
1 2 3
Again you can turn this into a verb using a gerund left argument to dyadic } (x (v0`v1`v2)} y ↔ (x v0 y) (x v1 y)} (x v2 y)) like this:
0 [`([: (<"1) 4 $. [: $. 0 > ])`]} y2
0 0 0
1 2 3
Or
100 101 102 [`([: (<"1) 4 $. [: $. 0 > ])`]} y2
100 101 102
1 2 3
To tidy this up a bit you could define getIdx as separate verb...
getIdx=: 4 $. $.
0 [`([: <"1#getIdx 0 > ])`]} y2
0 0 0
1 2 3

This is not a good solution. My original approach was to change the rank of the test so that it looks at each row separately, but that does not work in the general case (see comments below).
[y2 =: 2 3 $ _3 _2 _1 1 2 3
_3 _2 _1
1 2 3
I. y2<0
0 1 2
0 0 0
0 (I. y2<0)"1 } y2 NB. Rank of 1 applies to each row of y2
0 0 0
1 2 3

Related

How to achieve following output in pandas dataframe [duplicate]

This question already has answers here:
Reconstruct a categorical variable from dummies in pandas
(7 answers)
Closed 4 years ago.
df:
category A B C D
x 0 1 0 0
y 1 0 0 0
z 1 0 0 0
l 0 0 0 1
m 0 1 0 0
n 0 0 1 0
how to get df like below
Category Sub-category
x B
y A
z A
l D
m B
n C
I tried:
df['sector'] = df.apply(lambda x: df.columns[x.argmax()], axis = 1)
but getting TypeError: ("reduction operation 'argmax' not allowed for this dtype", 'occurred at index 1')
Just do
df['sub_category'] = df[['A', 'B', 'C', 'D']].idxmax(axis=1)
category A B C D sub_category
0 x 0 1 0 0 B
1 y 1 0 0 0 A
2 z 1 0 0 0 A
3 l 0 0 0 1 D
4 m 0 1 0 0 B
5 n 0 0 1 0 C
Of course you may select only the columns you want
df[['category', 'sub_category']]
category sub_category
0 x B
1 y A
2 z A
3 l D
4 m B
5 n C

How to understand synergy in information theory?

In information theory, multivariate mutual information (MMI) could be synergy (negative) or redundancy (positive). To simulate this two cases, assuming three variables X, Y and Z, all of them takes 0 or 1 (binary variable). And we repeat sampling them 12 times.
Case 1:
X = [ 0 0 0 0 0 0 1 1 1 1 1 1 ]
Y = [ 0 0 0 0 1 1 0 0 1 1 1 1 ]
Z = [ 0 0 1 1 1 1 0 0 0 0 1 1 ]
In this case, we assume a mechanism among XYZ taht when both Y and Z are 0 or 1, X takes 0 or 1 respectively. When Y = 0, Z = 1, then X takes 0, and Y = 1, Z = 0, then X takes 1.
The mmi(X,Y,Z) = -0.1699 in this case, indicating a synergy effect among three variable.
Case 2:
X = [ 0 0 0 0 0 0 1 1 1 1 1 1 ]
Y = [ 0 0 0 0 0 1 0 1 1 1 1 1 ]
Z = [ 0 1 1 1 1 1 0 0 0 0 0 1 ]
the machanism in this case is same as above. The difference is there are more samples of XY takes different value and less samples of both XY are 0 or 1.
The mmi(X,Y,Z) = 0.0333, indicating a redundancy.
So far, can I say in these two cases, synergy and redundancy show the similar mechanism (or relationship) among three variables? But how do we understand redundancy and particularly synergy in realistic data?

How to iterate through 'nested' dataframes without 'for' loops in pandas (python)?

I'm trying to check the cartesian distance between each set of points in one dataframe to sets of scattered points in another dataframe, to see if the input gets above a threshold 'distance' of my checking points.
I have this working with nested for loops, but is painfully slow (~7 mins for 40k input rows, each checked vs ~180 other rows, + some overhead operations).
Here is what I'm attempting in vectorialized format - 'for every pair of points (a,b) from df1, if the distance to ANY point (d,e) from df2 is > threshold, print "yes" into df1.c, next to input points.
..but I'm getting unexpected behavior from this. With given data, all but one distances are > 1, but only df1.1c is getting 'yes'.
Thanks for any ideas - the problem is probably in the 'df1.loc...' line:
import numpy as np
from pandas import DataFrame
inp1 = [{'a':1, 'b':2, 'c':0}, {'a':1,'b':3,'c':0}, {'a':0,'b':3,'c':0}]
df1 = DataFrame(inp1)
inp2 = [{'d':2, 'e':0}, {'d':0,'e':3}, {'d':0,'e':4}]
df2 = DataFrame(inp2)
threshold = 1
df1.loc[np.sqrt((df1.a - df2.d) ** 2 + (df1.b - df2.e) ** 2) > threshold, 'c'] = "yes"
print(df1)
print(df2)
a b c
0 1 2 yes
1 1 3 0
2 0 3 0
d e
0 2 0
1 0 3
2 0 4
Here is an idea to help you to start...
Source DFs:
In [170]: df1
Out[170]:
c x y
0 0 1 2
1 0 1 3
2 0 0 3
In [171]: df2
Out[171]:
x y
0 2 0
1 0 3
2 0 4
Helper DF with cartesian product:
In [172]: x = df1[['x','y']] \
.reset_index() \
.assign(k=0).merge(df2.assign(k=0).reset_index(),
on='k', suffixes=['1','2']) \
.drop('k',1)
In [173]: x
Out[173]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
4 1 1 3 1 0 3
5 1 1 3 2 0 4
6 2 0 3 0 2 0
7 2 0 3 1 0 3
8 2 0 3 2 0 4
now we can calculate the distance:
In [169]: x.eval("D=sqrt((x1 - x2)**2 + (y1 - y2)**2)", inplace=False)
Out[169]:
index1 x1 y1 index2 x2 y2 D
0 0 1 2 0 2 0 2.236068
1 0 1 2 1 0 3 1.414214
2 0 1 2 2 0 4 2.236068
3 1 1 3 0 2 0 3.162278
4 1 1 3 1 0 3 1.000000
5 1 1 3 2 0 4 1.414214
6 2 0 3 0 2 0 3.605551
7 2 0 3 1 0 3 0.000000
8 2 0 3 2 0 4 1.000000
or filter:
In [175]: x.query("sqrt((x1 - x2)**2 + (y1 - y2)**2) > #threshold")
Out[175]:
index1 x1 y1 index2 x2 y2
0 0 1 2 0 2 0
1 0 1 2 1 0 3
2 0 1 2 2 0 4
3 1 1 3 0 2 0
5 1 1 3 2 0 4
6 2 0 3 0 2 0
Try using scipy implementation, it is surprisingly fast
scipy.spatial.distance.pdist
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
or
scipy.spatial.distance_matrix
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.spatial.distance_matrix.html

Combining pairs in a string (Matlab)

I have a string:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG'
How can I combine pairs which have the last character of 1 pair is the first character of the follow pairs into strings? And the new strings must contain all of the character 'A','B','C','D','E','F' , 'G', those characters are appeared in the sup_pairs string.
The expected output should be:
S1 = 'BAEFCGD' % because BA will be followed by AE in sup_pairs string, so we combine BAE, and so on...we continue the rule to generate S1
S2 = 'DFCEABG'
If I have AB, BC and BD, the generated strings should be both : ABC and ABD .
If there is any repeated character in the pairs like : AB BC CA CE . We will skip the second A , and we get ABCE .
This, like all good things in life, is a graph problem. Each letter is a node, and each pair is an edge.
First we must transform your string of pairs into a numeric format so we can use the letters as subscripts. I will use A=2, B=3, ..., G=8:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG';
p=strsplit(sup_pairs,' ');
m=cell2mat(p(:));
m=m-'?';
A=sparse(m(:,1),m(:,2),1);
The sparse matrix A is now the adjacency matrix (actually, more like an adjacency list) representing our pairs. If you look at the full matrix of A, it looks like this:
>> full(A)
ans =
0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
As you can see, the edge BA, which translates to subscript (3,2) is equal to 1.
Now you can use your favorite implementation of Depth-first Search (DFS) to perform a traversal of the graph from your starting node of choice. Each path from the root to a leaf node represents a valid string. You then transform the path back into your letter sequence:
treepath=[3,2,6,7,4,8,5];
S1=char(treepath+'?');
Output:
S1 = BAEFCGD
Here's a recursive implementation of DFS to get you going. Normally in MATLAB you have to worry about not hitting the default limitation on recursion depth, but you're finding Hamiltonian paths here, which is NP-complete. If you ever get anywhere near the recursion limit, the computation time will be so huge that increasing the depth will be the least of your worries.
function full_paths = dft_all(A, current_path)
% A - adjacency matrix of graph
% current_path - initially just the start node (root)
% full_paths - cell array containing all paths from initial root to a leaf
n = size(A, 1); % number of nodes in graph
full_paths = cell(1,0); % return cell array
unvisited_mask = ones(1, n);
unvisited_mask(current_path) = 0; % mask off already visited nodes (path)
% multiply mask by array of nodes accessible from last node in path
unvisited_nodes = find(A(current_path(end), :) .* unvisited_mask);
% add restriction on length of paths to keep (numel == n)
if isempty(unvisited_nodes) && (numel(current_path) == n)
full_paths = {current_path}; % we've found a leaf node
return;
end
% otherwise, still more nodes to search
for node = unvisited_nodes
new_path = dft_all(A, [current_path node]); % add new node and search
if ~isempty(new_path) % if this produces a new path...
full_paths = {full_paths{1,:}, new_path{1,:}}; % add it to output
end
end
end
This is a normal Depth-first traversal except for the added condition on the length of the path in line 15:
if isempty(unvisited_nodes) && (numel(current_path) == n)
The first half of the if condition, isempty(unvisited_nodes) is standard. If you only use this part of the condition you'll get all paths from the start node to a leaf, regardless of path length. (Hence the cell array output.) The second half, (numel(current_path) == n) enforces the length of the path.
I took a shortcut here because n is the number of nodes in the adjacency matrix, which in the sample case is 8 rather than 7, the number of characters in your alphabet. But there are no edges into or out of node 1 because I was apparently planning on using a trick that I never got around to telling you about. Rather than run DFS starting from each of the nodes to get all of the paths, you can make a dummy node (in this case node 1) and create an edge from it to all of the other real nodes. Then you just call DFS once on node 1 and you get all the paths. Here's the updated adjacency matrix:
A =
0 1 1 1 1 1 1 1
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
If you don't want to use this trick, you can change the condition to n-1, or change the adjacency matrix not to include node 1. Note that if you do leave node 1 in, you need to remove it from the resulting paths.
Here's the output of the function using the updated matrix:
>> dft_all(A, 1)
ans =
{
[1,1] =
1 2 3 8 5 7 4 6
[1,2] =
1 3 2 6 7 4 8 5
[1,3] =
1 3 8 5 2 6 7 4
[1,4] =
1 3 8 5 7 4 6 2
[1,5] =
1 4 6 2 3 8 5 7
[1,6] =
1 5 7 4 6 2 3 8
[1,7] =
1 6 2 3 8 5 7 4
[1,8] =
1 6 7 4 8 5 2 3
[1,9] =
1 7 4 6 2 3 8 5
[1,10] =
1 8 5 7 4 6 2 3
}

How to count the frequency of a element in APL or J without loops

Assume I have two lists, one is the text t, one is a list of characters c. I want to count how many times each character appears in the text.
This can be done easily with the following APL code.
+⌿t∘.=c
However it is slow. It take the outer product, then sum each column.
It is a O(nm) algorithm where n and m are the size of t and c.
Of course I can write a procedural program in APL that read t character by character and solve this problem in O(n+m) (assume perfect hashing).
Are there ways to do this faster in APL without loops(or conditional)? I also accept solutions in J.
Edit:
Practically speaking, I'm doing this where the text is much shorter than the list of characters(the characters are non-ascii). I'm considering where text have length of 20 and character list have length in the thousands.
There is a simple optimization given n is smaller than m.
w ← (∪t)∩c
f ← +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r
w contains only the characters in t, therefore the table size only depend on t and not c. This algorithm runs in O(n^2+m log m). Where m log m is the time for doing the intersection operation.
However, a sub-quadratic algorithm is still preferred just in case someone gave a huge text file.
NB. Using "key" (/.) adverb w/tally (#) verb counts
#/.~ 'abdaaa'
4 1 1
NB. the items counted are the nub of the string.
~. 'abdaaa'
abd
NB. So, if we count the target along with the string
#/.~ 'abc','abdaaa'
5 2 1 1
NB. We get an extra one for each of the target items.
countKey2=: 4 : '<:(#x){.#/.~ x,y'
NB. This subtracts 1 (<:) from each count of the xs.
6!:2 '''1'' countKey2 10000000$''1234567890'''
0.0451088
6!:2 '''1'' countKey2 1e7$''1234567890'''
0.0441849
6!:2 '''1'' countKey2 1e8$''1234567890'''
0.466857
NB. A tacit version
countKey=. [: <: ([: # [) {. [: #/.~ ,
NB. appears to be a little faster at first
6!:2 '''1'' countKey 1e8$''1234567890'''
0.432938
NB. But repeating the timing 10 times shows they are the same.
(10) 6!:2 '''1'' countKey 1e8$''1234567890'''
0.43914
(10) 6!:2 '''1'' countKey2 1e8$''1234567890'''
0.43964
Dyalog v14 introduced the key operator (⌸):
{⍺,⍴⍵}⌸'abcracadabra'
a 5
b 2
c 2
r 2
d 1
The operand function takes a letter as ⍺ and the occurrences of that letter (vector of indices) as ⍵.
I think this example, written in J, fits your request. The character list is longer than the text (but both are kept short for convenience during development.) I have not examined timing but my intuition is that it will be fast. The tallying is done only with reference to characters that actually occur in the text, and the long character set is looked across only to correlate characters that occur in the text.
c=: 80{.43}.a.
t=: 'some {text} to examine'
RawIndicies=: c i. ~.t
Mask=: RawIndicies ~: #c
Indicies=: Mask # RawIndicies
Tallies=: Mask # #/.~ t
Result=: Tallies Indicies} (#c)$0
4 20 $ Result
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 4 0
0 0 1 0 0 0 2 1 2 0 0 0 1 3 0 0 0 2 0 0
4 20 $ c
+,-./0123456789:;<=>
?#ABCDEFGHIJKLMNOPQR
STUVWXYZ[\]^_`abcdef
ghijklmnopqrstuvwxyz
As noted in other answers, the key operator does this directly. However the classic APL way of solving this problem is still worth knowing.
The classic solution is "sort, shift, and compare":
c←'missippi'
t←'abcdefghijklmnopqrstuvwxyz'
g←⍋c
g
1 4 7 0 5 6 2 3
s←c[g]
s
iiimppss
b←s≠¯1⌽s
b
1 0 0 1 1 0 1 0
n←b/⍳⍴b
n
0 3 4 6
k←(1↓n,⍴b)-n
k
3 1 2 2
u←b/s
u
imps
And for the final answer:
z←(⍴t)⍴0
z
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
z[t⍳u]←k
z
0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 2 0 0 2 0 0 0 0 0 0 0
This code is off the top of my head, not ready for production. Have to look for empty cases - the boolean shift is probably not right for all cases....
"Brute force" in J:
count =: (i.~~.) ({,&0) (]+/"1#:=)
Usage:
'abc' count 'abdaaa'
4 1 0
Not sure how it's implemented internally, but here are the timings for different input sizes:
6!:2 '''abcdefg'' count 100000$''abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000
0.00803909
6!:2 '''abcdefg'' count 1000000$''abdaaaerbfqeiurbouebjkvwek'''
0.0845451
6!:2 '''abcdefg'' count 10000000$''abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7
0.862423
We don't filter input date prior to 'self-classify' so:
6!:2 '''1'' count 10000000$''1'''
0.244975
6!:2 '''1'' count 10000000$''1234567890'''
0.673034
6!:2 '''1234567890'' count 10000000$''1234567890'''
0.673864
My implementation in APL (NARS2000):
(∪w),[0.5]∪⍦w←t∩c
Example:
c←'abcdefg'
t←'abdaaaerbfqeiurbouebjkvwek'
(∪w),[0.5]∪⍦w←t∩c
a b d e f
4 4 1 4 1
Note: showing only those characters in c that exist in t
My initial thought was that this was a case for the Find operator:
T←'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
C←'MISSISSIPPI'
X←+/¨T⍷¨⊂C
The used characters are:
(×X)/T
IMPS
Their respective frequencies are:
X~0
4 1 2 4
I've only run toy cases so I have no idea what the performance is, but my intuition tells me it should be cheaper that the outer product.
Any thoughts?

Resources