Chain verbs in J

Chain verbs in J - j

Suppose a boxed matrix containing various types:
matrix =: ('abc';'defgh';23),:('foo';'bar';45)
matrix
+---+-----+--+
|abc|defgh|23|
+---+-----+--+
|foo|bar |45|
+---+-----+--+
And a column descriptor:
columnTypes =: 'string';'string';'num'
I want to apply verbs on this matrix by column according to types. I'll be using verbs DoString and DoNum:
chain =: (('string';'num') i. columnTypes) { DoString`DoNum
EDIT: The column descriptors are important, the decision on which verb to use is based on them, not on the type itself. In reality, I could have several types of strings, numerics, and even dates (which would be numeric in J).
How do I apply chain to each row of matrix? The verbs themselves can take care of whether the passed value is boxed or not, that's fine. Also, I'd rather avoid transposing the matrix (|:) as it could be quite large.

The standard method for doing this is:
Convert your row (cell)-oriented structure to a column-oriented structure
Apply the correct verb to each column (just once)
Step (1) is easy. Step (2) is also easy, but not as obvious. There's a little trick that helps.
The trick is knowing that a number of primitive operators accept a gerund as a left argument and produce a function which cycles over the gerund, applying each verb in turn. IMO, the most useful operator in this category is ;. . Here's an example implementation using it:
Step (0), inputs:
matrix =: ('abc';'defgh';23),:('foo';'bar';45)
columnTypes =: 'string';'string';'num'
DoString =: toupper
DoNum =: 0&j.
matrix
+---+-----+--+
|abc|defgh|23|
+---+-----+--+
|foo|bar |45|
+---+-----+--+
Step (1), columify data:
columnify =: <#:>"1#:|: :. rowify =: <"_1&>
columnify matrix
+---+-----+-----+
|abc|defgh|23 45|
|foo|bar | |
+---+-----+-----+
Note that the columnify is provided with an inverse which will re-"rowify" data, though you shouldn't do that: see below.
Step (2), apply the correct verb to each column (exactly once), using the verb-cycling feature of ;.:
homogenize =: ({. foo&.>#:{.`'') [^:('foo'-:])L:0~ ]
chain =: DoString`DoNum`] homogenize#{~ ('string';'num')&i.
Note that the default transformation for unknown column-types is the identity function, ].
The verb homogenize normalizes the input & output of each column-processor (i.e abstracts out the pre- and post-processing so that the user only has to provide with the dynamic "core" of the transformation). The verb chain takes a list of column-types as an input and derives a gerund appropriate for use a left-hand argument to ;. (or a similar operator).
Thus:
1 (chain columnTypes);.1 columnify matrix
+---+-----+---------+
|ABC|DEFGH|0j23 0j45|
|FOO|BAR | |
+---+-----+---------+
Or, if you really must have an NxM table of boxed cells, apply the cut "under" columnify:
1 (chain columnTypes);.1&.columnify matrix
+-----+-----+
|ABC |FOO |
+-----+-----+
|DEFGH|BAR |
+-----+-----+
|0j23 |0j45 |
+-----+-----+
But note it is much more appropriate, in a J context, to keep the table as a list of homogeneous columns, for both performance and notational reasons.
J works best when processing arrays "in toto"; the rule of thumb is you should let primitive or user-defined name see as much data as possible at each application. That's the major benefit of this "columificaton" approach: if you store your data as a list of homogeneous columns, it will be faster and easier to manipulate later.
However, if your use-case really demands you keep the data as a NxM table of boxed cells, then converting your data to- and from- column normal form is an expensive no-op. In that case, you should stick with your original solution,
1 chain\"1 matrix
which (because you asked) actually works on the same premise as the ;. approach. In particular, \ is another of those primitive operators which accepts a gerund argument, and applies each verb in succession (i.e. to each a new window of data, cyclically).
In effect, what 1 chain\"1 matrix does is break the matrix into rows ("1), and for each row, it creates a 1-wide moving window, (1 f\ matrix), applying the verbs of chain to each of those 1-wide windows cylically (i.e. f changes with every 1-wide data window of each row of the matrix).
Since the moving 1-window of a row (a rank-1 vector) is the atoms of the row, in order, and the verbs of chain are given in the same order, in effect you're applying those verbs to the columns of the matrix, one. atom. at. a. time.
In short: 1 chain\"1 matrix is analogous to foo"0 matrix, except foo changes for each atom. And it should be avoided for the same reason foo"0 matrix should be avoided in general: because applying functions at small rank works against the grain of J, incurring a performance penalty.
In general, it's better to use apply functions at higher ranks whenever you can, which in this case calls for converting (and maintaining) matrix to column-normal form.
In other words, here, ;. is to "1 as \ is to "0. If you find the whole columnify/homogenize thing too lengthy or bulky (compared to 1 chain\"1 matrix), you can import the script provided at [1], which packages up those definitions as re-usable utilities, with extensions. See the page for examples and instructions.
[1] Related utility script:
http://www.jsoftware.com/jwiki/DanBron/Snippets/DOOG

If these calculations depend only on the data inside individual boxes (and, perhaps, global values,) it is possible to use Agenda with Under Open (aka Each). An application of this technique is shown below:
doCells =: (doNum`doString #. isLiteral)&.>
isLiteral=: 2 -: 3!:0
doNum =: +: NB. Double
doString =: toupper
doCells matrix
┌───┬─────┬──┐
│ABC│DEFGH│46│
├───┼─────┼──┤
│FOO│BAR │90│
└───┴─────┴──┘
(In this example I've put in arbitrary meanings for doNum and doString to help make the viability plain.)
The version of isLiteral used here may well suffice, but it will fail if either sparse literal or unicode values will be involved.
If the calculations need to involve more of the matrix than a single box, this won't be the answer to your question. If calculation needs to occur by line, instead, the solution may involve applying a verb at rank _1 (i.e. to each item of the highest axis.)

Through experimentation, I get this to do what I want:
1 chain\"1 matrix
Now to understand it...

Related

Applying an adverb to a list of gerunds

Consider a list of gerunds and some data we wish to apply them to, cyclically:
ms=.*`+`- NB. list of gerunds
d =.3 4 5 6 NB. some data
We can do:
ms/ d NB. returns 9, ie, the result of 3 * 4 + 5 - 6
Now we pose the question: how does the result change if we change the order in which we apply the verbs? That is, we consider all 6 possible orders:
allms=. (A.~i.#!##) ms
which looks like:
┌─┬─┬─┐
│*│+│-│
├─┼─┼─┤
│*│-│+│
├─┼─┼─┤
│+│*│-│
├─┼─┼─┤
│+│-│*│
├─┼─┼─┤
│-│*│+│
├─┼─┼─┤
│-│+│*│
└─┴─┴─┘
To answer the question, we can do:
allms (4 : 'x/ y')"1 d
NB. returns 9 _21 _1 _23 _41 _31
But notice I was forced to use an anonymous, non-tacit verb to accomplish this. Because in order to apply the adverb /, I had to have a named verb. When what I really wanted to do is treat / like a rank 1 verb and "map" it over my list allms, something in spirit like the illegal formulation:
/&d"1 allms NB. This is invalid J
That is, for each gerund on the list, transform it with the adverb / and apply it to the data d.
J seems to resist this higher-order "treating verbs like data" thinking. So I want to know what the natural J way of approaching this problem would be.
To be specific, you are given the list of gerunds ms and data d, as defined above. The task is to create a verb that returns a list of the results ms/ d, for every possible ordering of ms (ie, a verb that returns 9 _21 _1 _23 _41 _31 in our example). The verb must be a tacit verb.

You don't really want that
There are fundamental syntactic reasons why you can't tacitly slice and dice arguments to operators (adverbs and conjunctions).
Without going into detail, allowing operators to be modified by other operators, like your proposed / modified with "1, would require a fundamental restructuring of J's grammar. And there would be major tradeoffs, in particular to simplicity and expressiveness (i.e. notational elegance)¹,².
So, if you want to distribute operators over gerunds like this, you'll have to write utilities for it, and the most straightforward way by far is by using explicit code. One pre-packaged utility to consider in this domain is the doog script, available in the J Wiki and SVN repo.
Here it is anyway
However, the doog script, like your approach, is fundamentally explicit³.
So if you really want to achieve these ends tacitly:
D =. foo`bar`baz
t =. D / (#:]) NB. Here's our "over" (/)
over =. [^:(D -: ]) L: (L.D) & (5!:1<,'t')
allOver =: (]^:[~ 1:`'' , over f.)~
3 4 5 6 allOver"1~ (A.~i.#!##) *`+`- NB. Note the "1
9 _21 _1 _23 _41 _31
I warned you
Without getting into too much detail, the trick here is using the verb ]^:[ to allow ^: to execute an arbitrary atomic representation as input.
That is, some_atomic_rep f^:[ data turns into f^:some_atomic_rep data, which, for a suitable atomic rep, can execute anything at all, while using all the argument-processing goodness available to verbs (in particular, rank).
The rest is just an elegant (read: lazy) way to turn your gerundial inputs (whichever parts you make available to the verb with rank or other argument-selection mechanisms) into an atomic rep suitable for a right-hand argument to ^:.
The meat of it is we have the template D / (#:]) and we replace D with the gerund of your choice (the #:] is necessary because by the time the gerund gets executed, it'll have two inputs: your actual input, d, as well as itself, D)4.
Lasciate ogne speranza
To visit the Ultima Thule of these wicked follies, check out the discovery of dont in J, which is just like do (".), except ... really, you shouldn't.
¹ As a quick example: puzzle out what this would mean for precedence between wordclasses.
² Having said that, Jose "Pepe" Quintana, the leader of the underground J club F^4 (The Fully Fixable Functional Faction), once found a backdoor that actually did allow operators to take other operators as inputs. See this message in the "J Myths Puzzles" thread from 2008 (scroll past all the spoiler-hiding blank lines). Of course, once he mentioned it, Roger took notice, and immediately plugged the gap.
³ The way I once put it was "Yes, dcog is ugly, but I like to think of it as Messiah Code: it's ugly so that other code doesn't have to be. A sponge for sin".
4 Take note, the template gerund foo`bar`baz can be anything you like, of any length, using any names. It doesn't matter. What does matter is that the names you use are either proverbs or undefined (which the interpreter treats like proverbs, by design). Using pronouns or pro-operators will break stuff. Alternatively, you could use another kind of noun, like simply __ or something (which I find mnemonic for fill in the ____).

ms=.*`+`- NB. list of gerunds
d =.3 4 5 6 NB. some data
allms=. (A.~i.#!##) ms
I'd start by "treating my verbs like data" using strings to represent the gerunds
*`+`-
append the '/'character and then use 128!:2 (Apply) which takes a string describing a verb as its left argument and applies it to the noun that is its right argument. of course to do this you need to make allms into verb strings.
That can be done using:
[ ger=. ,&'/' # }: # (1j1 #!.'`' ;)"1 allms
*`+`-/
*`-`+/
+`*`-/
+`-`*/
-`*`+/
-`+`*/
Then using 128!:2 (Apply)
ger 128!:2 d
9 _21 _1 _23 _41 _31
As a one line tacit verb
gvt=. ,&'/'# }:#(1j1 #!.'`' ;)"1 #: [ 128!: 2 ]
allms gvt d
9 _21 _1 _23 _41 _31
I rarely play these games, so I am not saying this is the best approach, but it works.

build shortest string from a collection of its subsequence

Given a collection of sub-sequences from a string.
For example:
abc
acd
bcd
The problem is, how to determine the shortest string from these sequences?
For the above example, the shortest string is abcd.
Here sub-sequences means part of a string but not necessarily consecutive. like acd is a sub-sequence of string abcd.
Edit: This problem actually comes from Project Euler problem 79, in that problem, we have 50 subsequences, each has the length of 3. and all characters are digits.

This problem is well studied and coined "Shortest common supersequence"
For two strings, it is in O(n). Suppose n is the maximum length of string. O(n) is used to build the suffix array and then O(n) to solve the Longest Common Subsequence problem.
For more than two strings, it is NP-Complete.

Complexity
In common case, as mentioned above, this problem will be NP-hard. There is a way to resolve a matter with suffix structure, but also you can use directed graph to do that. I can not say for sure if that is better in some sense - but may be some advantage may be found in difficult corner cases.
Graph solution
To realize how you can use graph - you need just to build it properly. Let letters of strings be vertexes and order of letter will define edges. That means, ab will mean vertexes a and b and connection between a and b. Of course, connections are directed (i.e. if a is connected to b it doesn't mean that b is connected to a, so ab is a --> b)
After these definitions you'll be able to build all graph. For your sample it's:
-simple enough. But abcd can also be represented with strings of two length as ab, ac, ad, bc, bd, cd - so I'll show graph for that too (it's just for more clarity):
Then, to find your solution, you need in this graph to find a path with maximal length. Obviously, there's the place from which NP-hardness is "derived". From both cases above maximum length is 4 which will be reached if we'll start from a vertex and traverse graph till found a->b->c->d path.
Corner cases
Non-unique solution: in fact, you may face string sets which can't strictly define a superset. Example: ab, ad, bd, ac, cd. Both abcd and acbd will fit those substrings. But, actually, this is not too bad problem. Look at the graph:
(I've chosen that with reason: it's like second graph, but without one connection, that's why result is ambiguous). The maximum path length is now 3 - but it can be reached with two paths: abd and acd. So how to restore at least one solution? Simple: since result strings will have same length (that comes from definition of way which we've found them) - we can just walk from start of first string and check symbols in second string. If they're matching, then it's a strict symbol position. But if they not - then we're free to chose any order between current symbols. So that will be:
[a] matches [a], so current result is a
[b] mismatches [c], so we can place either bc or cb. Let it be first. Then result is abc
[d] matches [d], so current result is abc+d, i.e. abcd
This is kind of "merge" where we are free to choose any result. However, this is a bit twisted case, because now we can't use just any found path - we should find all paths with maximum length (otherwise we'll not be able to restore full supersequence)
Next, non-existent solution: there may be cases when order of symbols in some strings can not be used to reproduce supersequence. Obviously, that means that one order is violating other order, thus, there will be two strings in which some two symbols have different order. Again, simplest case: abc, cbd. On the graph:
-so imminent consequence will be graph loop - it may not be so simple (like in graph above) - but it always be if order is broken. Thus, all you need to realize that is to find graph cycle. In fact. this isn't a thing that you must do separately. You'll just add cycle detection in graph longest path search.
Third: repeated symbols: this is the most tricky case. If string contains repeated symbols, then graph creation is more complicated, but still can be done. For example, let we have three conditions: aac, abc, bbc. Solution for this is aabbc. How to build graph? Now we can't just add links (because of loops at least). But I suggest following way:
Let we process all our strings, assigning indexes to symbols. Index is reset per string. If symbol appears only once, index will be 1. For our strings that means: a1a2c1, a1b1c1 and b1b2c1. After that, store maximum index for each symbol we've found. For sample above that will be 2 for a, 2 for b and 1 for c
If two connected indexed symbols have same original symbol, then connection is done "as is". For example, a1a2 will produce only one connection from a1 to a2
If two connected indexed symbols have different original symbols, then any first indexed symbol may be connected to any second indexed symbol. For example, b1c1 will result in (b1 to c1) connection and (b2 to c1) connection. How do we know about how many indexed symbols may be? That we've done on first step (i.e. found maximum indexes)
Thus, graph for sample above will look like:
+------------+
| |
| |
+------>[a2]------+ |
| | | |
| V V |
[a1]---->[c1]<----[b2] |
| ^ ^ |
| | | |
+------>[b1]------+ |
^ |
| |
+------------+
-so we'll have maximum length 5 for paths a1a2b2b1c1 and a1a2b1b2c1. We can chose any of them, ignoring indexes in result string aabbc.

Maybe there is no general efficient algorithm this kind of problem. But this is a solution for this particular problemprojecteuler 79. You want to observe the input file. You will find 7 only appears at the beginning of all sequences, so 7 should be put at the beginning. Then you will find digit 3 is the only character at the beginning, then you put 3 on the second position. So on and so forth, you can get 73162890. The special case is the last 2 digits, both 0 and 9 are at the beginning, then you have 2 choices. then you try both and get 90 is the optimal solution.
More generally, you can use the depth-first-search algorithm. The trick is when you find there is one character which only appears at the beginning, just choose it, it will
path to the optimal solution.

I think we can start with graphs .
I am not sure of correctness , but say if we build a graph with a->b if b comes after a , with all paths of length 1.
Now , we have to find longest distance (DFS can help) .
I will try to provide example .
say strings are :
abc
acd
bcd
bce
we form a graph :
Now main thing left will be to combine nodes e and d because the required string can be abcde or abced.
This is what i am not sure how to do , so maybe somebody can help !
Sorry for posting this as answer , comments can't include pictures.

Construct the graph out of the subsequences and then find the topological sort of the graph in O(E) using DFS to get the desired string of shortest length which has all subsequence in it. But as you would have noticed the topo-sort is invalid if the graph has cycles in that cases there are repetitions need for characters in the cycles which is difficult to solve.
Conclusion:- You get lucky if there are no cycles in graph and solve it in O(E) else get unlucky and end up doing brute force.

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.
"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!
How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.

Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.

You're algorithm has loss of information from compressing the string's original data set so I'm not sure you can recover the full information set without doing far more work than comparing the original string. Also while your data set appears easier for human readability, it current takes up as much space as the original string and a difference map of the string (where the values are the distance between the prior character and current character) may have a more comparable information set.
However, as to how you can detect all common subsets you should look at Least Common Subsequence algorithms to find the largest matching pattern. It is a well defined algorithm and is efficient -- O(n * m), where n and m are the lengths of the strings. See LCS on SO and Wikipedia. If you also want to see patterns which wrap around a string (as a circular stirng -- where abeab and eabab should match) then you'll need a ciruclar LCS which is described in a paper by Andy Nguyen.
You'll need to change the algorithm slightly to account for number of variations so far. My advise would be to add two additional dimensions to the LCS table representing the number of unique numbers encountered in the past k characters of both original strings along with you're compressed version of each string. Then you could do an LCS solve where you are always moving in the direction which matches on your compressed strings AND matching the same number of unique characters in both strings for the past k characters. This should encode all possible unique substring matches.
The tricky part will be always choosing the direction which maximizes the k which contains the same number of unique characters. Thus at each element of the LCS table you'll have an additional string search for the best step of k value. Since a longer sequence always contains all possible smaller sequences, if you maximize you're k choice during each step you know that the best k on the next iteration is at most 1 step away, so once the 4D table is filled out it should be solvable in a similar fashion to the original LCS table. Note that because you have a 4D table the logic does get more complicated, but if you read how LCS works you'll be able to see how you can define consistent rules for moving towards the upper left corner at each step. Thus the LCS algorithm stays the same, just scaled to more dimensions.
This solution is quite complicated once it's complete, so you may want to rethink what you're trying to achieve/if this pattern encodes the information you actually want before you start writing such an algorithm.

Here goes a solution that uses Prolog's unification capabilities and attributed variables to match templates:
:-dynamic pattern_i/3.
test:-
retractall(pattern_i(_,_,_)),
add_pattern(abab),
add_pattern(bcbc),
add_pattern(babcbc),
add_pattern(dababd),
show_similarities.
show_similarities:-
call(pattern_i(Word, Pattern, Maps)),
match_pattern(Word, Pattern, Maps),
fail.
show_similarities.
match_pattern(Word, Pattern, Maps):-
all_dif(Maps), % all variables should be unique
call(pattern_i(MWord, MPattern, MMaps)),
Word\=MWord,
all_dif(MMaps),
append([_, Pattern, _], MPattern), % Matches patterns
writeln(words(Word, MWord)),
write('mapping: '),
match_pattern1(Maps, MMaps). % Prints mappings
match_pattern1([], _):-
nl,nl.
match_pattern1([Char-Char|Maps], MMaps):-
select(MChar-Char, MMaps, NMMaps),
write(Char), write('='), write(MChar), write(' '),
!,
match_pattern1(Maps, NMMaps).
add_pattern(Word):-
word_to_pattern(Word, Pattern, Maps),
assertz(pattern_i(Word, Pattern, Maps)).
word_to_pattern(Word, Pattern, Maps):-
atom_chars(Word, Chars),
chars_to_pattern(Chars, [], Pattern, Maps).
chars_to_pattern([], Maps, [], RMaps):-
reverse(Maps, RMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
member(Char-PChar, Maps),
!,
chars_to_pattern(Tail, Maps, Pattern, NMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
chars_to_pattern(Tail, [Char-PChar|Maps], Pattern, NMaps).
all_dif([]).
all_dif([_-Var|Maps]):-
all_dif(Var, Maps),
all_dif(Maps).
all_dif(_, []).
all_dif(Var, [_-MVar|Maps]):-
dif(Var, MVar),
all_dif(Var, Maps).
The idea of the algorithm is:
For each word generate a list of unbound variables, where we use the same variable for the same char in the word. e.g: for the word abcbc the list would look something like [X,Y,Z,Y,Z]. This defines the template for this word
Once we have the list of templates we take each one and try to unify the template with a subtemplate of every other word. So for example if we have the words abcbc and zxzx, the templates would be [X,Y,Z,Y,Z] and [H,G,H,G]. Then there is a subtemplate on the first template which unifies with the template of the second word (H=Y, G=Z)
For each template match we show the substitutions needed (variable renamings) to yield that match. So in our example the substitutions would be z=b, x=c
Output for test (words abab, bcbc, babcbc, dababd):
?- test.
words(abab,bcbc)
mapping: a=b b=c
words(abab,babcbc)
mapping: a=b b=c
words(abab,dababd)
mapping: a=a b=b
words(bcbc,abab)
mapping: b=a c=b
words(bcbc,babcbc)
mapping: b=b c=c
words(bcbc,dababd)
mapping: b=a c=b

Looking for an efficient array-like structure that supports "replace-one-member" and "append"

As an exercise I wrote an implementation of the longest increasing subsequence algorithm, initially in Python but I would like to translate this to Haskell. In a nutshell, the algorithm involves a fold over a list of integers, where the result of each iteration is an array of integers that is the result of either changing one element of or appending one element to the previous result.
Of course in Python you can just change one element of the array. In Haskell, you could rebuild the array while replacing one element at each iteration - but that seems wasteful (copying most of the array at each iteration).
In summary what I'm looking for is an efficient Haskell data structure that is an ordered collection of 'n' objects and supports the operations: lookup i, replace i foo, and append foo (where i is in [0..n-1]). Suggestions?

Perhaps the standard Seq type from Data.Sequence. It's not quite O(1), but it's pretty good:
index (your lookup) and adjust (your replace) are O(log(min(index, length - index)))
(><) (your append) is O(log(min(length1, length2)))
It's based on a tree structure (specifically, a 2-3 finger tree), so it should have good sharing properties (meaning that it won't copy the entire sequence for incremental modifications, and will perform them faster too). Note that Seqs are strict, unlike lists.

I would try to just use mutable arrays in this case, preferably in the ST monad.
The main advantages would be making the translation more straightforward and making things simple and efficient.
The disadvantage, of course, is losing on purity and composability. However I think this should not be such a big deal since I don't think there are many cases where you would like to keep intermediate algorithm states around.

Best strategies for reading J code

I've been using J for a few months now, and I find that reading unfamiliar code (e.g. that I didn't write myself) is one of the most challenging aspects of the language, particularly when it's in tacit. After a while, I came up with this strategy:
1) Copy the code segment into a word document
2) Take each operator from (1) and place it on a separate line, so that it reads vertically
3) Replace each operator with its verbal description in the Vocabulary page
4) Do a rough translation from J syntax into English grammar
5) Use the translation to identify conceptually related components and separate them with line breaks
6) Write a description of what each component from (5) is supposed to do, in plain English prose
7) Write a description of what the whole program is supposed to do, based on (6)
8) Write an explanation of why the code from (1) can be said to represent the design concept from (7).
Although I learn a lot from this process, I find it to be rather arduous and time-consuming -- especially if someone designed their program using a concept I never encountered before. So I wonder: do other people in the J community have favorite ways to figure out obscure code? If so, what are the advantages and disadvantages of these methods?
EDIT:
An example of the sort of code I would need to break down is the following:
binconv =: +/# ((|.#(2^i.###])) * ]) # ((3&#.)^:_1)
I wrote this one myself, so I happen to know that it takes a numerical input, reinterprets it as a ternary array and interprets the result as the representation of a number in base-2 with at most one duplication. (e.g., binconv 5 = (3^1)+2*(3^0) -> 1 2 -> (2^1)+2*(2^0) = 4.) But if I had stumbled upon it without any prior history or documentation, figuring out that this is what it does would be a nontrivial exercise.

Just wanted to add to Jordan's Answer : if you don't have box display turned on, you can format things this way explicitly with 5!:2
f =. <.#-:##{/:~
5!:2 < 'f'
┌───────────────┬─┬──────┐
│┌─────────┬─┬─┐│{│┌──┬─┐│
││┌──┬─┬──┐│#│#││ ││/:│~││
│││<.│#│-:││ │ ││ │└──┴─┘│
││└──┴─┴──┘│ │ ││ │ │
│└─────────┴─┴─┘│ │ │
└───────────────┴─┴──────┘
There's also a tree display:
5!:4 <'f'
┌─ <.
┌─ # ─┴─ -:
┌─ # ─┴─ #
──┼─ {
└─ ~ ─── /:
See the vocabulary page for 5!: Representation and also 9!: Global Parameters for changing the default.
Also, for what it's worth, my own approach to reading J has been to retype the expression by hand, building it up from right to left, and looking up the pieces as I go, and using identity functions to form temporary trains when I need to.
So for example:
/:~ i.5
0 1 2 3 4
NB. That didn't tell me anything
/:~ 'hello'
ehllo
NB. Okay, so it sorts. Let's try it as a train:
[ { /:~ 'hello'
┌─────┐
│ehllo│
└─────┘
NB. Whoops. I meant a train:
([ { /:~) 'hello'
|domain error
| ([{/:~)'hello'
NB. Not helpful, but the dictionary says
NB. "{" ("From") wants a number on the left.
(0: { /:~) 'hello'
e
(1: { /:~) 'hello'
h
NB. Okay, it's selecting an item from the sorted list.
NB. So f is taking the ( <. # -: # # )th item, whatever that means...
<. -: # 'hello'
2
NB. ??!?....No idea. Let's look up the words in the dictionary.
NB. Okay, so it's the floor (<.) of half (-:) the length (#)
NB. So the whole phrase selects an item halfway through the list.
NB. Let's test to make sure.
f 'radar' NB. should return 'd'
d
NB. Yay!
addendum:
NB. just to be clear:
f 'drara' NB. should also return 'd' because it sorts first
d

Try breaking the verb up into its components first, and then see what they do. And rather than always referring to the vocab, you could simply try out a component on data to see what it does, and see if you can figure it out. To see the structure of the verb, it helps to know what parts of speech you're looking at, and how to identify basic constructions like forks (and of course, in larger tacit constructions, separate by parentheses). Simply typing the verb into the ijx window and pressing enter will break down the structure too, and probably help.
Consider the following simple example: <.#-:##{/:~
I know that <. -: # { and /: are all verbs, ~ is an adverb, and # is a conjunction (see the parts of speech link in the vocab). Therefore I can see that this is a fork structure with left verb <.#-:## , right verb /:~ , and dyad { . This takes some practice to see, but there is an easier way, let J show you the structure by typing it into the ijx window and pressing enter:
<.#-:##{/:~
+---------------+-+------+
|+---------+-+-+|{|+--+-+|
||+--+-+--+|#|#|| ||/:|~||
|||<.|#|-:|| | || |+--+-+|
||+--+-+--+| | || | |
|+---------+-+-+| | |
+---------------+-+------+
Here you can see the structure of the verb (or, you will be able to after you get used to looking at these). Then, if you can't identify the pieces, play with them to see what they do.
10?20
15 10 18 7 17 12 19 16 4 2
/:~ 10?20
1 4 6 7 8 10 11 15 17 19
<.#-:## 10?20
5
You can break them down further and experiment as needed to figure them out (this little example is a median verb).
J packs a lot of code into a few characters and big tacit verbs can look very intimidating, even to experienced users. Experimenting will be quicker than your documenting method, and you can really learn a lot about J by trying to break down large complex verbs. I think I'd recommend focusing on trying to see the grammatical structure and then figure out the pieces, building it up step by step (since that's how you'll eventually be writing tacit verbs).

(I'm putting this in the answer section instead of editing the question because the question looks long enough as it is.)
I just found an excellent paper on the jsoftware website that works well in combination with Jordan's answer and the method I described in the question. The author makes some pertinent observations:
1) A verb modified by an adverb is a verb.
2) A train of more than three consecutive verbs is a series of forks, which may have a single verb or a hook at the far left-hand side depending on how many verbs there are.
This speeds up the process of translating a tacit expression into English, since it lets you group verbs and adverbs into conceptual units and then use the nested fork structure to quickly determine whether an instance of an operator is monadic or dyadic. Here's an example of a translation I did using the refined method:
d28=: [:+/\{.#],>:#[#(}.-}:)#]%>:#[
[: +/\
{.#] ,
>:#[ #
(}.-}:)#] %
>:#[
cap (plus infix prefix)
(head atop right argument) ravel
(increment atop left argument) tally
(behead minus curtail) atop right
argument
divided by
increment atop left argument
the partial sums of the sequence
defined by
the first item of the right argument,
raveled together with
(one plus the left argument) copies
of
(all but the first element) minus
(all but the last element)
of the right argument, divided by
(one plus the left argument).
the partial sums of the sequence
defined by
starting with the same initial point,
and appending consecutive copies of
points derived from the right argument by
subtracting each predecessor from its
successor
and dividing the result by the number
of copies to be made
Interpolating x-many values between
the items of y

I just want to talk about how I read:
<.#-:##{/:~
First off, I knew that if it was a function, from the command line, it had to be entered (for testing) as
(<.#-:##{/:~)
Now I looked at the stuff in the parenthesis. I saw a /:~, which returns a sorted list of its arguments, { which selects an item from a list, # which returns the number of items in a list, -: half, and <., floor...and I started to think that it might be median, - half of the number of items in the list rounded down, but how did # get its arguments? I looked at the # signs - and realized that there were three verbs there - so this is a fork. The list comes in at the right and is sorted, then at the left, the fork got the list to the # to get the number of arguments, and then we knew it took the floor of half of that. So now we have the execution sequence:
sort, and pass the output to the middle verb as the right argument.
Take the floor of half of the number of elements in the list, and that becomes the left argument of the middle verb.
Do the middle verb.
That is my approach. I agree that sometimes the phrases have too many odd things, and you need to look them up, but I am always figuring this stuff out at the J instant command line.

Personally, I think of J code in terms of what it does -- if I do not have any example arguments, I rapidly get lost. If I do have examples, it's usually easy for me to see what a sub-expression is doing.
And, when it gets hard, that means I need to look up a word in the dictionary, or possibly study its grammar.
Reading through the prescriptions here, I get the idea that this is not too different from how other people work with the language.
Maybe we should call this 'Test Driven Comprehension'?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string