Combining pairs in a string (Matlab)

Combining pairs in a string (Matlab) - string

I have a string:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG'
How can I combine pairs which have the last character of 1 pair is the first character of the follow pairs into strings? And the new strings must contain all of the character 'A','B','C','D','E','F' , 'G', those characters are appeared in the sup_pairs string.
The expected output should be:
S1 = 'BAEFCGD' % because BA will be followed by AE in sup_pairs string, so we combine BAE, and so on...we continue the rule to generate S1
S2 = 'DFCEABG'
If I have AB, BC and BD, the generated strings should be both : ABC and ABD .
If there is any repeated character in the pairs like : AB BC CA CE . We will skip the second A , and we get ABCE .

This, like all good things in life, is a graph problem. Each letter is a node, and each pair is an edge.
First we must transform your string of pairs into a numeric format so we can use the letters as subscripts. I will use A=2, B=3, ..., G=8:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG';
p=strsplit(sup_pairs,' ');
m=cell2mat(p(:));
m=m-'?';
A=sparse(m(:,1),m(:,2),1);
The sparse matrix A is now the adjacency matrix (actually, more like an adjacency list) representing our pairs. If you look at the full matrix of A, it looks like this:
>> full(A)
ans =
0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
As you can see, the edge BA, which translates to subscript (3,2) is equal to 1.
Now you can use your favorite implementation of Depth-first Search (DFS) to perform a traversal of the graph from your starting node of choice. Each path from the root to a leaf node represents a valid string. You then transform the path back into your letter sequence:
treepath=[3,2,6,7,4,8,5];
S1=char(treepath+'?');
Output:
S1 = BAEFCGD
Here's a recursive implementation of DFS to get you going. Normally in MATLAB you have to worry about not hitting the default limitation on recursion depth, but you're finding Hamiltonian paths here, which is NP-complete. If you ever get anywhere near the recursion limit, the computation time will be so huge that increasing the depth will be the least of your worries.
function full_paths = dft_all(A, current_path)
% A - adjacency matrix of graph
% current_path - initially just the start node (root)
% full_paths - cell array containing all paths from initial root to a leaf
n = size(A, 1); % number of nodes in graph
full_paths = cell(1,0); % return cell array
unvisited_mask = ones(1, n);
unvisited_mask(current_path) = 0; % mask off already visited nodes (path)
% multiply mask by array of nodes accessible from last node in path
unvisited_nodes = find(A(current_path(end), :) .* unvisited_mask);
% add restriction on length of paths to keep (numel == n)
if isempty(unvisited_nodes) && (numel(current_path) == n)
full_paths = {current_path}; % we've found a leaf node
return;
end
% otherwise, still more nodes to search
for node = unvisited_nodes
new_path = dft_all(A, [current_path node]); % add new node and search
if ~isempty(new_path) % if this produces a new path...
full_paths = {full_paths{1,:}, new_path{1,:}}; % add it to output
end
end
end
This is a normal Depth-first traversal except for the added condition on the length of the path in line 15:
if isempty(unvisited_nodes) && (numel(current_path) == n)
The first half of the if condition, isempty(unvisited_nodes) is standard. If you only use this part of the condition you'll get all paths from the start node to a leaf, regardless of path length. (Hence the cell array output.) The second half, (numel(current_path) == n) enforces the length of the path.
I took a shortcut here because n is the number of nodes in the adjacency matrix, which in the sample case is 8 rather than 7, the number of characters in your alphabet. But there are no edges into or out of node 1 because I was apparently planning on using a trick that I never got around to telling you about. Rather than run DFS starting from each of the nodes to get all of the paths, you can make a dummy node (in this case node 1) and create an edge from it to all of the other real nodes. Then you just call DFS once on node 1 and you get all the paths. Here's the updated adjacency matrix:
A =
0 1 1 1 1 1 1 1
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
If you don't want to use this trick, you can change the condition to n-1, or change the adjacency matrix not to include node 1. Note that if you do leave node 1 in, you need to remove it from the resulting paths.
Here's the output of the function using the updated matrix:
>> dft_all(A, 1)
ans =
{
[1,1] =
1 2 3 8 5 7 4 6
[1,2] =
1 3 2 6 7 4 8 5
[1,3] =
1 3 8 5 2 6 7 4
[1,4] =
1 3 8 5 7 4 6 2
[1,5] =
1 4 6 2 3 8 5 7
[1,6] =
1 5 7 4 6 2 3 8
[1,7] =
1 6 2 3 8 5 7 4
[1,8] =
1 6 7 4 8 5 2 3
[1,9] =
1 7 4 6 2 3 8 5
[1,10] =
1 8 5 7 4 6 2 3
}

Related

How to recognize [1,X,X,X,1] repeating pattern in panda serie

I have a boolean column in a csv file for example:
1 1
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 1
11 0
12 0
13 1
14 0
15 1
You can see here 1 is reapting every 5 lines.
I want to recognize this repeating pattern [1,0,0,0] as soon as the repetition is above 10 in python (I have ~20.000 rows/file).
The pattern can start at any position
How could I manage this in python avoiding if .....

# Generate 20000 of 0s and 1s
data = pd.Series(np.random.randint(0, 2, 20000))
# Keep indices of 1s
idx = df[df > 0].index
# Check distance of current index with next index whether is 4 or not,
# Say if position 2 and position 6 is found as 1, so 6 - 2 = 4
found = []
for i, v in enumerate(idx):
if i == len(idx) - 1:
break
next_value = idx[i + 1]
if (next_value - v) == 4:
found.append(v)
print(found)

Generate data following specified pattern in J

I'm dabbling my feet with J and, to get the ball rolling, decided to write a function that:
gets integer N;
spits out a table that follows this pattern:
(example for N = 4)
1
0 1
0 0 1
0 0 0 1
i.e. in each row number of zeroes increases from 0 up to N - 1.
However, being newbie, I'm stuck. My current labored (and incorrect) solution for N = 4 case looks like:
(4 # ,: 0 1) #~/"1 1 (1 ,.~/ i.4)
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
And the problem with it is twofold:
it's not general enough and looks kinda ugly (parens and " usage);
trailing zeroes - as I understand, all arrays in J are homogeneous, so in my case every row should be boxed.
Like that:
┌───────┐
│1 │
├───────┤
│0 1 │
├───────┤
│0 0 1 │
├───────┤
│0 0 0 1│
└───────┘
Or I should use strings (e.g. '0 0 1') which will be padded with spaces instead of zeroes.
So, what I'm kindly asking here is:
please provide an idiomatic J solution for this task with explanation;
criticize my attempt and point out how it could be finished.
Thanks in advance!

Like so many challenges in J, sometimes it is better to keep your focus on your result and find a different way to get there. In this case, what your initial approach is doing is creating an identity matrix. I would use
=/~#:i. 4
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
You have correctly identified the issue with the trailing 0's and the fact that J will pad out with 0's to avoid ragged arrays. Boxing avoids this padding since each row is self contained.
So create your lists first. I would use overtake to get the extra 0's
4{.1
1 0 0 0
The next line uses 1: to return 1 as a verb and boxes the overtakes from 1 to 4
(>:#:i. <#:{."0 1:) 4
+-+---+-----+-------+
|1|1 0|1 0 0|1 0 0 0|
+-+---+-----+-------+
Since we want this as reversed and then made into strings, we add ":#:|.#: to the process.
(>:#:i. <#:":#:|.#:{."0 1:) 4
+-+---+-----+-------+
|1|0 1|0 0 1|0 0 0 1|
+-+---+-----+-------+
Then we unbox
>#:(>:#:i. <#:":#:|.#:{."0 1:) 4
1
0 1
0 0 1
0 0 0 1
I am not sure this is the way everyone would solve the problem, but it works.
An alternative solution that does not use boxing and uses the dyadic j. (Complex) and the fact that
1j4 # 1
1 0 0 0 0
(1 j. 4) # 1
1 0 0 0 0
(1 #~ 1 j. ]) 4
1 0 0 0 0
So, I create a list for each integer in i. 4, then reverse them and make them into strings. Since they are now strings, the extra padding is done with blanks.
(1 ":#:|.#:#~ 1 j. ])"0#:i. 4
1
0 1
0 0 1
0 0 0 1
Taking this step by step as to hopefully explain a little better.
i.4
0 1 2 3
Which is then applied to (1 ":#:|.#:#~ 1 j. ]) an atom at a time, hence the use of "0
Breaking down what is going on within the parenthesis. I first take the right three verbs which form a fork.
( 1 j. ])"0#:i.4
1 1j1 1j2 1j3
Now, effectively that gives me
1 ":#:|.#:#~ 1 1j1 1j2 1j3
The middle tine of the fork becomes the verb acting on the two noun arguments.The ~ swaps the arguments. so it becomes equivalent to
1 1j1 1j2 1j3 ":#:|.#:# 1
which because of the way #: works is the same as
": |. 1 1j1 1j2 1j3 # 1
I haven't shown the results of these components because using the "0 on the fork changes how the arguments that are sent to the middle tine and assembled later. I'm hoping that there is enough here that with some hand waving the explanation may suffice
The jump from tacit to explicit can be a big one, so it may be a better exercise to write the same verb explicitly to see if it makes more sense.
lowerTriangle =: 3 : 0
rightArg=. i. y
complexCopy=. 1 j. rightArg
1 (":#:|.#:#~)"0 complexCopy
)
lowerTriangle 4
1
0 1
0 0 1
0 0 0 1
lowerTriangle 5
1
0 1
0 0 1
0 0 0 1
0 0 0 0 1
See what happens when you 'get the ball rolling'? I guess the thing about J is that the ball goes down a pretty steep slope no matter where you begin. Exciting, eh?

Logical not on a scipy sparse matrix

I have a bag-of-words representation of a corpus stored in an D by W sparse matrix word_freqs. Each row is a document and each column is a word. A given element word_freqs[d,w] represents the number of occurrences of word w in document d.
I'm trying to obtain another D by W matrix not_word_occs where, for each element of word_freqs:
If word_freqs[d,w] is zero, not_word_occs[d,w] should be one.
Otherwise, not_word_occs[d,w] should be zero.
Eventually, this matrix will need to be multiplied with other matrices which might be dense or sparse.
I've tried a number of methods, including:
not_word_occs = (word_freqs == 0).astype(int)
This words for toy examples, but results in a MemoryError for my actual data (which is approx. 18,000x16,000).
I've also tried np.logical_not():
word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)
This seemed promising, but np.logical_not() does not work on sparse matrices, giving the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Any ideas or guidance would be appreciated.
(By the way, word_freqs is generated by sklearn's preprocessing.CountVectorizer(). If there's a solution that involves converting this to another kind of matrix, I'm certainly open to that.)

The complement of the nonzero positions of a sparse matrix is dense. So if you want to achieve your stated goals with standard numpy arrays you will require quite a bit of RAM. Here's a quick and totally unscientific hack to give you an idea, how many arrays of that sort your computer can handle:
>>> import numpy as np
>>> a = []
>>> for j in range(100):
... print(j)
... a.append(np.ones((16000, 18000), dtype=int))
My laptop chokes at j=1. So unless you have a really good computer even if you can get the complement (you can do
>>> compl = np.ones(S.shape,int)
>>> compl[S.nonzero()] = 0
) memory will be an issue.
One way out may be to not explicitly compute the complement let's call it C = B1 - A, where B1 is the same-shape matrix completely filled with ones and A the adjacency matrix of your original sparse matrix. For example the matrix product XC can be written as XB1 - XA so you have one multiplication with the sparse A and one with B1 which is actually cheap because it boils down to computing row sums. The point here is that you can compute that without computing C first.
A particularly simple example would be multiplication with a one-hot vector. Such a multiplication just selects a column (if multiplying from the right) or row (if multiplying from the left) of the other matrix. Meaning you just need to find that column or row of the sparse matrix and take the complement (for a single slice no problem) and if you do this for a one-hot matrix, as above you needn't compute the complement explicitly.

Make a small sparse matrix:
In [743]: freq = sparse.random(10,10,.1)
In [744]: freq
Out[744]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in COOrdinate format>
the repr(freq) shows the shape, elements and format.
In [745]: freq==0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
", try using != instead.", SparseEfficiencyWarning)
Out[745]:
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
with 90 stored elements in Compressed Sparse Row format>
If do your first action, I get a warning and new array with 90 (out of 100) nonzero terms. That not is no longer sparse.
In general numpy functions do not work when applied to sparse matrices. To work they have to delegate the task to sparse methods. But even if logical_not worked it wouldn't solve the memory issue.

Here is an example of using Pandas.SparseDataFrame:
In [42]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)
In [43]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)
In [44]: d1 = pd.SparseDataFrame(X.toarray(), default_fill_value=0, dtype=np.int64)
In [45]: d2 = pd.SparseDataFrame(np.ones((10,10)), default_fill_value=1, dtype=np.int64)
In [46]: d1.memory_usage()
Out[46]:
Index 80
0 16
1 0
2 8
3 16
4 0
5 0
6 16
7 16
8 8
9 0
dtype: int64
In [47]: d2.memory_usage()
Out[47]:
Index 80
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
math:
In [48]: d2 - d1
Out[48]:
0 1 2 3 4 5 6 7 8 9
0 1 1 0 0 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 0 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 0 1 1
4 1 1 1 1 1 1 1 1 1 1
5 0 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
7 0 1 1 0 1 1 1 0 1 1
8 1 1 1 1 1 1 0 1 1 1
9 1 1 1 1 1 1 1 1 1 1
source sparse matrix:
In [49]: d1
Out[49]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 1 0 0 1 0 0 0 1 0 0
8 0 0 0 0 0 0 1 0 0 0
9 0 0 0 0 0 0 0 0 0 0
memory usage:
In [50]: (d2 - d1).memory_usage()
Out[50]:
Index 80
0 16
1 0
2 8
3 16
4 0
5 0
6 16
7 16
8 8
9 0
dtype: int64
PS if you can't build the whole SparseDataFrame at once (because of memory constraints), you can use an approach similar to one used in this answer

Pattern decoding II [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Pattern decoding
I have some new question concerning to the previous post about pattern decoding:
I have almost the same data file, BUT there are double empty (blank) lines, which have to be taken into account in the decoding.
So, the double empty lines mean that there was a street/grout (for definitions see the previous post: Pattern decoding) in which there was zero (0) house, but we have to count these kind of patterns too. (Yes, you may think, that this is absolutely wrong statement, because there is no street without at least one house, but this is just an analogy, so please, just accept it as it is.)
Here is the new data file, with the double lines:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 9
0 0 # <--- Group 10
I need to convert this into an elegant way as it has been done before, but in this case we have to take into account these new groups too, and indicate them in this way, following Kent for example: roupIdx houseIdx numberOfRooms, where the houseIdx let equal to zero houseIdx = 0 and the numberOfRooms let equal to zero too numberOfRooms = 0. So, I need to get this kind of output for example:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 1
8 0 0
9 0 1
10 0 1
Can we tune the previous code in this way?
UPDATE: the new second empty line indicates a new group. If there was an additional empty new line after the empty line, as in this case
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
we just treat the new empty line (the second one in the 2 blank lines) as a new group, and indicate them as group_index 0 0. See the desired output above!

Try:
$ cat houses.awk
BEGIN{max=1;group=1}
NF==0{
empty++
if (empty==1) group++
next
}
{ max = ($1 > max) ? $1 : max
if (empty<=1){
a[group,$1]++
} else {
a[group,$1]=-1
}
empty=0
}
END{for (i=1;i<=group;i++){
for (j=0;j<=max;j++){
if (a[i,j]>=1)
print i , j , a[i,j]
if (a[i,j]==-1)
print i, j, 0
}
printf "\n"
}
}
Command:
awk -f houses.awk houses
Output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1

How to count the frequency of a element in APL or J without loops

Assume I have two lists, one is the text t, one is a list of characters c. I want to count how many times each character appears in the text.
This can be done easily with the following APL code.
+⌿t∘.=c
However it is slow. It take the outer product, then sum each column.
It is a O(nm) algorithm where n and m are the size of t and c.
Of course I can write a procedural program in APL that read t character by character and solve this problem in O(n+m) (assume perfect hashing).
Are there ways to do this faster in APL without loops(or conditional)? I also accept solutions in J.
Edit:
Practically speaking, I'm doing this where the text is much shorter than the list of characters(the characters are non-ascii). I'm considering where text have length of 20 and character list have length in the thousands.
There is a simple optimization given n is smaller than m.
w ← (∪t)∩c
f ← +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r
w contains only the characters in t, therefore the table size only depend on t and not c. This algorithm runs in O(n^2+m log m). Where m log m is the time for doing the intersection operation.
However, a sub-quadratic algorithm is still preferred just in case someone gave a huge text file.

NB. Using "key" (/.) adverb w/tally (#) verb counts
#/.~ 'abdaaa'
4 1 1
NB. the items counted are the nub of the string.
~. 'abdaaa'
abd
NB. So, if we count the target along with the string
#/.~ 'abc','abdaaa'
5 2 1 1
NB. We get an extra one for each of the target items.
countKey2=: 4 : '<:(#x){.#/.~ x,y'
NB. This subtracts 1 (<:) from each count of the xs.
6!:2 '''1'' countKey2 10000000$''1234567890'''
0.0451088
6!:2 '''1'' countKey2 1e7$''1234567890'''
0.0441849
6!:2 '''1'' countKey2 1e8$''1234567890'''
0.466857
NB. A tacit version
countKey=. [: <: ([: # [) {. [: #/.~ ,
NB. appears to be a little faster at first
6!:2 '''1'' countKey 1e8$''1234567890'''
0.432938
NB. But repeating the timing 10 times shows they are the same.
(10) 6!:2 '''1'' countKey 1e8$''1234567890'''
0.43914
(10) 6!:2 '''1'' countKey2 1e8$''1234567890'''
0.43964

Dyalog v14 introduced the key operator (⌸):
{⍺,⍴⍵}⌸'abcracadabra'
a 5
b 2
c 2
r 2
d 1
The operand function takes a letter as ⍺ and the occurrences of that letter (vector of indices) as ⍵.

I think this example, written in J, fits your request. The character list is longer than the text (but both are kept short for convenience during development.) I have not examined timing but my intuition is that it will be fast. The tallying is done only with reference to characters that actually occur in the text, and the long character set is looked across only to correlate characters that occur in the text.
c=: 80{.43}.a.
t=: 'some {text} to examine'
RawIndicies=: c i. ~.t
Mask=: RawIndicies ~: #c
Indicies=: Mask # RawIndicies
Tallies=: Mask # #/.~ t
Result=: Tallies Indicies} (#c)$0
4 20 $ Result
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 4 0
0 0 1 0 0 0 2 1 2 0 0 0 1 3 0 0 0 2 0 0
4 20 $ c
+,-./0123456789:;<=>
?#ABCDEFGHIJKLMNOPQR
STUVWXYZ[\]^_`abcdef
ghijklmnopqrstuvwxyz

As noted in other answers, the key operator does this directly. However the classic APL way of solving this problem is still worth knowing.
The classic solution is "sort, shift, and compare":
c←'missippi'
t←'abcdefghijklmnopqrstuvwxyz'
g←⍋c
g
1 4 7 0 5 6 2 3
s←c[g]
s
iiimppss
b←s≠¯1⌽s
b
1 0 0 1 1 0 1 0
n←b/⍳⍴b
n
0 3 4 6
k←(1↓n,⍴b)-n
k
3 1 2 2
u←b/s
u
imps
And for the final answer:
z←(⍴t)⍴0
z
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
z[t⍳u]←k
z
0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 2 0 0 2 0 0 0 0 0 0 0
This code is off the top of my head, not ready for production. Have to look for empty cases - the boolean shift is probably not right for all cases....

"Brute force" in J:
count =: (i.~~.) ({,&0) (]+/"1#:=)
Usage:
'abc' count 'abdaaa'
4 1 0
Not sure how it's implemented internally, but here are the timings for different input sizes:
6!:2 '''abcdefg'' count 100000$''abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000
0.00803909
6!:2 '''abcdefg'' count 1000000$''abdaaaerbfqeiurbouebjkvwek'''
0.0845451
6!:2 '''abcdefg'' count 10000000$''abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7
0.862423
We don't filter input date prior to 'self-classify' so:
6!:2 '''1'' count 10000000$''1'''
0.244975
6!:2 '''1'' count 10000000$''1234567890'''
0.673034
6!:2 '''1234567890'' count 10000000$''1234567890'''
0.673864

My implementation in APL (NARS2000):
(∪w),[0.5]∪⍦w←t∩c
Example:
c←'abcdefg'
t←'abdaaaerbfqeiurbouebjkvwek'
(∪w),[0.5]∪⍦w←t∩c
a b d e f
4 4 1 4 1
Note: showing only those characters in c that exist in t

My initial thought was that this was a case for the Find operator:
T←'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
C←'MISSISSIPPI'
X←+/¨T⍷¨⊂C
The used characters are:
(×X)/T
IMPS
Their respective frequencies are:
X~0
4 1 2 4
I've only run toy cases so I have no idea what the performance is, but my intuition tells me it should be cheaper that the outer product.
Any thoughts?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Combining pairs in a string (Matlab) - string

Related

How to recognize [1,X,X,X,1] repeating pattern in panda serie

Generate data following specified pattern in J

Logical not on a scipy sparse matrix

Pattern decoding II [duplicate]

How to count the frequency of a element in APL or J without loops

Categories

Resources