loop rolling algorithm - string

I have come up with the term loop rolling myself with the hope that it does
not overlap with an existing term. Basically I'm trying to come up with an
algorithm to find loops in a printed text.
Some examples from simple to complicated
Example1
Given:
a a a a a b c d
I want to say:
5x(a) b c d
or algorithmically:
for 1 .. 5
print a
end
print b
print c
print d
Example2
Given:
a b a b a b a b c d
I want to say:
4x(a b) c d
or algorithmically:
for 1 .. 4
print a
print b
end
print c
print d
Example3
Given:
a b c d b c d b c d b c e
I want to say:
a 3x(b c d) b c e
or algorithmically:
print a
for 1 .. 3
print b
print c
print d
end
print b
print c
print d
It didn't remind me of any algorithm that I know of. I feel like some of the
problems can be ambiguous but finding one of the solutions is enough to me for
now. Efficiency is always welcome but not mandatory. How can I do this?
EDIT
First of all, thanks for all the discussion. I have adapted an LZW algorithm
from rosetta and ran it on my
input:
abcdbcdbcdbcdef
which gave me:
a
b
c
d
8 => bc
10 => db
9 => cd
11 => bcd
e
f
where I have a dictionary of:
a a
c c
b b
e e
d d
f f
8 bc
9 cd
10 db
11 bcd
12 dbc
13 cdb
14 bcde
15 ef
7 ab
It looks good for compression but it's not quite what I wanted. What I need
is more like compression in the algorithmic representation from my examples
which would have:
subsequent sequences (if a sequence is repeating, there would be no other
sequence in between)
no dictionary but only loops
irreducable
with maximum sequence sizes (which would minimize the algorithmic
representation)
and let's say nested loops are allowed (contrary to what I said before in
the comment)

I start with an algorithm, which gives maximum sequence sizes. Though it would not always minimize the algorithmic representation, it may be used as an approximation algorithm. Or it may be extended to optimal algorithm.
Start with constructing Suffix array for your text along with LCP array.
Sort an array of indexes of LCP array, indexes of larger elements of LCP array come first. This groups together repeating sequences of the same length and allows to process sequences in greedy manner, starting from maximum sequence sizes.
Extract suffix array entries, grouped by LCP value (by group I mean all the entries with selected LCP value as well as all entries with larger LCP values), and sort them by position in the text.
Filter out entries with positional difference not equal to LCP. For remaining entries, get prefixes of length, equal to LCP. This gives all possible sequences in the text.
Add sequences, sorted by starting position, to ordered collection (for example, binary search tree). Sequences are added in order of appearance in sorted LCP, so longer sequences are added first. Sequences are added only if they are independent or if one of them is completely nested inside the other one. Intersecting intervals are ignored. For example, in caba caba bab sequence ab intersects with caba and so it is ignored. But in cababa cababa babab one instance of ab is dropped, 2 instances are completely inside larger sequence, and 2 instances are completely outside of it.
At the end, this ordered collection contains all the information, needed to produce the algorithmic representation.
Example:
Text ababcabab
Suffix array ab abab ababcabab abcabab b bab babcabab bcabab cabab
LCP array 2 4 2 0 1 3 1 0
Sorted LCP 4 3 2 2 1 1 0 0
Positional difference 5 5 2 2 2 2 - -
Filtered LCP - - 2 2 - - - -
Filtered prefixes (ab ab) (ab ab)
Sketch of an algorithm, producing the minimal algorithmic representation.
Start with the first 4 steps of previous algorithm. Fifth step should be modified. Now it is not possible to ignore intersecting intervals, so every sequence is added to the collection. Since the collection now contains intersecting intervals, it is better to implement it as some advanced data structure, for example, Interval tree.
Then recursively determine the length of algorithmic representation for all sequences, that contain any nested sequences, starting from the smallest ones. When every sequence is evaluated, compute optimal algorithmic representation for whole text. Algorithm for processing either a sequence or whole text uses dynamic programming: allocate a matrix with number of columns, equal to text/sequence length and number of rows, equal to the length of algorithmic representation; doing in-order traversal of interval tree, update this matrix with all sequences, possible for each text position; when more than one value for some cell is possible, either choose any of them, or give preference to longer or shorter sub-sequences.

Related

TRUE/FALSE ← VLOOKUP ← Identify the ROW! of the first negative value within a column

Firstly, we have an array of predetermined factors, ie. V-Z;
their attributes are 3, the first two (•xM) multiplied giving the 3rd.
f ... factors
• ... cap, the values in the data set may increase max
m ... fixed multiplier
p ... let's call it power
This is a separate, standalone array .. we'd access with eg. VLOOKUP
f • m pwr
V 1 9 9
W 2 8 16
X 3 7 21
Y 4 6 24
Z 5 5 25
—————————————————————————————————————————————
Then we have 6 columns, in which the actual data to be processed is in, & thereof derive the next-level result, based on the interaction of both samples introduced.
In addition, there are added two columns, for balance & profit.
Here's a short, 6-row data sample:
f • m bal profit
V 2 3 377 1
Y 2 3 156 7
Y 1 1 122 0
X 1 2 -27 2
Z 3 3 223 3
—————————————————————————————————————————————
Ultimately, starting at the end, we are comparing IF -27 inverted → so 27 is within the X's power range ie. 21 (as per the first sample) .. which is then fed into a bigger formula, beyond the scope of this post.
This can be done with VLOOKUP, all fine by now.
—————————————————————————————————————————————
To get to that .. for the working example, we are focusing coincidentally on row5, since that's the one with the first negative value in the 'balance' column, so ..
on factorX = which factor exactly is to us unknown &
balance -27 = which we have to locate amongst potentially dozens to hundreds of rows.
Why!?
Once we know that the factor is X, based on the * & multiplier pertaining to it, then we also know which 'power' (top array) to compare -27, as the identified first negative value in the balance column, to.
Is that clear?
I'd like to know the formula on how to achieve that, & (get to) move on with the broader-scope work.
—————————————————————————————————————————————
The main issue for me is not knowing how to identify the first negative or row -27 pertains to, then having that piece of information how to leverage it to get the X or identify the factor type, especially since its positioned left of the latter & to the best of my knowledge I cannot use negative column index number (so, latter even if possible is out of the question anyway).
To recap;
IF(21>27) = IF(-21<-27)
27 → LOCATE ROW with the first negative number (-27)
21 → IDENTIFY the FACTOR TYPE, same row as (-27)
→ VLOOKUP pwr, based on factor type identified (top array, 4th column right)
→ invert either 21 to a negative number or (-27) to the positive number
= TRUE/FALSE
Guessing your columns I'll say your first chart is in columns A to D, and the second in columns G to K
You could find the letter of that factor with something like this:
=INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0)))
INDEX(J:J<0) converts that column to TRUE and FALSE depending on being negative or not and with XMATCH you find the first TRUE. You could then use that in VLOOKUP:
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0)
That would return the 21. You can use the first concept too to find the the -27 and with ABS have its "positive value"
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0) > INDEX(J:J,XMATCH(TRUE,INDEX(J:J<0)))
That should return true or false in the comparison

Shuffle a matrix using specified indexes in Excel?

My goal is to shuffle a matrix using a specified matrix of indexes.
For example, let's say this is my input matrix:
A B C
D E F
G H I
Now I want to shuffle my input matrix. But not in a random way, I want to use a custom set of indexes (0, 1, 2 etc.) that represent the final order of my matrix. Let's say this is the matrix of indexes:
5 3 8
1 0 4
2 7 6
The resulting matrix should be:
F D I
B A E
C H G
since A was in position 0, B was in position 1 etc.
I'm looking for the formula that I should insert in each cell of the resulting matrix. I tried INDEX and MATCH functions but I'm not sure this is the right way.
I would just use quotient/mod to get row and column
=INDEX($A$1:$C$3,QUOTIENT(E1,3)+1,MOD(E1,3)+1)
You could make it more general if you wanted to.

Find the maximum benifit value from a sub-set of non-prefix neighbours?

1 < n <= 4 x 10^5
Length of each string can be up to 11
Each string contains only uppercase letters
Example - If there are 3 strings, A, B and AE, output is 200.
Explanation - S = {"A", "B", "AE"}
Strings A and AE are prefix neighbors, so they cannot both be in Mark's subset of S. String B has no prefix neighbor, so we include it in Mark's subset.
To maximize the benefit value, we choose AE and B for our subset. We then calculate the following benefit values for the chosen subset:
Benefit value of AE = 65+69 = 134
Benefit value of B = 66
Total benefit value = 134 + 66 = 200.
Insert the input words into a radix tree and splice out the non key words. Compute the maximum-weight independent set of the tree; the link goes to an unweighted algorithm, so you'll need to replace 1 by the weight of the node as defined by this question. All of this is linear-time.

column and row join a second box data

I want to column join
┌─┬─┬─┐
│1│1│2│
│2│4│4│
│3│9│6│
└─┴─┴─┘
and I'd like to put a=.1 2 3 as the fourth row, and then put b=.1 1 1 1 as the first column to the new boxed data. How can I do this easily? Do I have to ravel the whole thing and compute the dimention on my own in order to box it again?
Also, if I want the data i.8 to be 2 rows, do I have to calculate the other dimension 4(=8/2) in order to form a matrix 2 4$i.8? And then box it ;/2 4$i.8? Can I just specify one dimension, either the number of row or columns and ask automatic boxing or forming the matrix?
The answer to your question will involve learning about &. , the 'Under' conjunction, which is tremendously useful in J.
m
┌─┬─┬─┐
│1│1│2│
│2│2│4│
│3│9│6│
└─┴─┴─┘
a=. 1 2 3
b=. 1 1 1 1
So we want to add each item of a to each boxed column of m . It would be perfect if we could unbox the column using unbox(>), append the item of a to the column using append (,) and then rebox the column using box (<). This undo, act, redo cycle is exactly what Under (&.) does. It undoes both its right and left arguments ( m and a ) using the verb to its right, then applies the verb to its left, then uses the reverse of the verb to its right on the result. In practice,
m , &. > a
┌─┬─┬─┐
│1│1│2│
│2│2│4│
│3│9│6│
│1│2│3│
└─┴─┴─┘
The fact that a is unboxed when it was never boxed to begin with means that it is not changed, while m is unboxed before (,) is applied to each a . In fact this is used so often in J that &. > is assigned the name 'each'.
m , each a
┌─┬─┬─┐
│1│1│2│
│2│2│4│
│3│9│6│
│1│2│3│
└─┴─┴─┘
Prepending a boxed version of b requires first giving it an extra dimension with laminate (,:) then transposing (|:) b and finally boxing (<) the result. The step of adding the extra dimension is required because transposing swaps the indices and b start as a one-dimensional list.
(<#|:#,:b)
┌─┐
│1│
│1│
│1│
│1│
└─┘
The rest is easy as we just use append (,) to join the boxed b with (m, each a)
(<#|:#,: b) , m , each a
┌─┬─┬─┬─┐
│1│1│1│2│
│1│2│2│4│
│1│3│9│6│
│1│1│2│3│
└─┴─┴─┴─┘
Brackets around (<#|:#,: b) are necessary to force the correct order of execution.
For the second question, you can use i. n m to create a n X m array, which may help.
i. 4 2
0 1
2 3
4 5
6 7
i. 2 4
0 1 2 3
4 5 6 7
but perhaps I am misunderstanding your intentions here.
Hope this helps, bob
append a (with rank): ,"x a
You can simply append (,) a to your unboxed (>) input but you have to be careful with the append rank. You want to append each "item" of a, so you have right rank of "0". You want to apend to a 2-cell so you have a left rank of "2". Therefore, the , you need has rank "2 0. After the append, you rebox your data to a 2-cell with <"2.
<"2(>in)(,"2 0) a
┌─┬─┬─┐
│1│1│2│
│2│4│4│
│3│9│6│
│1│2│3│
└─┴─┴─┘
prepend b: b,
If your b has the right shape you prepend it with b,. The shape you seem to use is (boxed) 4 1:
b =: < 4 1$ 1
┌─┐
│1│
│1│
│1│
│1│
└─┘
b,in
┌─┬─┬─┬─┐
│1│1│1│2│
│1│2│4│4│
│1│3│9│6│
│1│1│2│3│
└─┴─┴─┴─┘

Generating Unique Combinations from a list of possible repeated characters

I am looking to generate combinations from a list of elements. Right now i am using a approach of generating power set. For example to generate combinations from {a,b,c}, i will enumerate 001,010,100 ,101 etc...and take the element for which the corresponding binary index is set to 1.
But the problem comes when there are repeated characters in the list Say {a,a,b}. the above approach would give a,a,b,ab,ba,aab. where as i would like to see only a,b,ab,aa,aab.
I was thinking of writing some binary mask to eliminate repeated strings but was not succesfull.
Any thoughts on how to generate unique combinations ?
Rather than generate bit vectors, you can generate vectors of positive integers of length equal to the number of distinct elements, subject to the restriction that each component can range from 0 up to the multiplicity of the corresponding element. In your example above, there are two distinct elements (a and b) with multilpicities 2 and 1, respectively. Therefore, you would get
a b
-------
0 1 --> b
1 0 --> a
1 1 --> ab
2 0 --> aa
2 1 --> aab

Resources