Generating Unique Combinations from a list of possible repeated characters - combinatorics

I am looking to generate combinations from a list of elements. Right now i am using a approach of generating power set. For example to generate combinations from {a,b,c}, i will enumerate 001,010,100 ,101 etc...and take the element for which the corresponding binary index is set to 1.
But the problem comes when there are repeated characters in the list Say {a,a,b}. the above approach would give a,a,b,ab,ba,aab. where as i would like to see only a,b,ab,aa,aab.
I was thinking of writing some binary mask to eliminate repeated strings but was not succesfull.
Any thoughts on how to generate unique combinations ?

Rather than generate bit vectors, you can generate vectors of positive integers of length equal to the number of distinct elements, subject to the restriction that each component can range from 0 up to the multiplicity of the corresponding element. In your example above, there are two distinct elements (a and b) with multilpicities 2 and 1, respectively. Therefore, you would get
a b
-------
0 1 --> b
1 0 --> a
1 1 --> ab
2 0 --> aa
2 1 --> aab

Related

Extract subsequences from main dataframe based on the locations in another dataframe

I want to extract the subsequences indicated by the first and last locations in data frame 'B'.
The algorithm that I came up with is:
Identify the rows of B that fall in the locations of A
Find the relative position of the locations (i.e. shift the locations to make them start from 0)
Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last']+1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.

Excel: Count until, then repeat?

I have a list of numbers which are either 1's or 2's. What I'd like to do is count how many 1's there are before a 2 appears, and then keep repeating this down the list (i'm trying to find the average number of 1's between each 2).
What would be the best way of doing this considering I've got over 10,000 rows? (i.e. too many to do manually)
The average number of 1's between each number 2, is the same as the ratio between the number 1 and the number 2.
Example:
1
1
2
1
1
1
1
2
1
1
2
1
1
2
Contains 10 ones and 4 twos.
Or there are five groups of ones, with the following counts: 2, 4, 2, 2
Either way, it will give you and average of 2.5 (10/4 = 2.5)
Note: You have to make a design choice, regarding how to handle beginnings and ends. If you had another one, after the last two, how should it be handled?
You can use the formula as shown in the screenshot below:
Note that the formula in the first row is different.
B C
=IF(A2=1,B1,B1+1) =COUNTIF(B:B,B2)
=IF(A3=1,B2,B2+1) =IFERROR(IF(A4=2,COUNTIF(B:B,B4),"")-1,"")
Then to get the average use:
=AVERAGEIF(C:C,"<>"&0)
Noceo's solution as a formula:
=COUNTIF(A:A,1)/COUNTIF(A:A,2)
The output of all the above:

How do you group data in columns?

I have numeric data under fifty samples that are mostly similar. I want to count identical columns and give statistics on the same. There are too many rows to select them (37,888). Data looks like:
Sample 1 Sample 2 Sample 3 ........ Sample 50
4 4 0
4 4 0
4 4 ...
0 0
0 0
0 0
0 0
... ...
upto thousands of rows for each sample.
There is a column for date/time as well, would be nice if I could include that in the grouping.
In this snippet, there are many rows. Sample 1 and 2 are identical hence should be grouped together. Sample three would form another group and so on.
While I'm not sure what "There are too many rows to select them" means in this context (there is no limit on the number of rows or items that can be selected and included in a formula), this looks like a job for array formulas.
If you want to determine (for instance) whether columns C and D are equal, from rows 1 through 37888, you can use this formula:
=AND(C1:C37888=D1:D37888)
To make Excel treat this as an array formula, you need to press CTRL-SHIFT-ENTER (Windows) or CMD-ENTER (Mac) after typing the formula. The "AND" function will return TRUE if and only if all corresponding entries are equal: C1=D1, C2=D2, C3=D3, ..., C37888=D37888. It returns FALSE if any corresponding entries disagree.
Exactly what you do next will depend on the nature of the statistics that you want to compute for each group, but this formula will at least help you figure out which columns belong in the same group together.

Haskell multivariable Lambda function for lists

I am confused on how this is computed.
Input: groupBy (\x y -> (x*y `mod` 3) == 0) [1,2,3,4,5,6,7,8,9]
Output: [[1],[2,3],[4],[5,6],[7],[8,9]]
First, does x and y refer to the current and the next element?
Second, is this saying that it will group the elements that equal 0 when it is modded by 3? If so, how come there are elements that are not equal to 0 when modded by 3 in the output?
Found here: http://zvon.org/other/haskell/Outputlist/groupBy_f.html
To answer your second question: We compare two elements by multiplying them and seeing if the result is divisible by 3. "So why are there elements in the output not divisible by 3?" If they aren't divisible, that doesn't filter them out (that's what filter does); rather, when the predicate fails, the element goes into a separate group. When it succeeds, the element goes into the current group.
As to your first question, this took me a little while to figure out... x and y aren't two consecutive elements; rather, y is the current element and x is the first element in the current group. (!)
1 * 2 = 2; 2 `mod` 3 = 2; 1 and 2 go in separate groups.
2 * 3 = 6; 6 `mod` 3 = 0; 2 and 3 go in the same group.
2 * 4 = 8; 8 `mod` 3 = 2; 4 gets put in a different group.
...
Notice, on that last line, we're looking at 2 and 4 — not 3 and 4, as you might reasonably expect.
First, does x and y refer to the current and the next element?
Roughly, yes.
Second, is this saying that it will group the elements that equal 0 when it is modded by 3? If so, how come there are elements that are not equal to 0 when modded by 3 in the output?
The lambda defines a relation between two integers x and y, which holds whenever the product x*y is a multiple of 3. Since 3 is prime, x must be a multiple of 3 or y must be such.
For the input [1,2,3,4,5,6,7,8,9], it is first checked whether 1 is in relation with 2. This is false, so 1 gets its own singleton group [1]. Then, we proceed we 2 and 3: now the relation holds, so 2,3 will share their group. Next, we check whether 2 and 4 are in relation: this is false. So, the group is [2,3] and not any larger. Then we proceed with 4 and 5 ...
I must confess that I do not like this example very much, since the relation is not an equivalence relation (because it is not transitive). Because of this, the exact result of groupBy is not guaranteed: the implementation might test the relation on 3,4 (true) instead of 2,4 (false), and build a group [2,3,4] instead.
Quoting from the docs:
The predicate is assumed to define an equivalence.
So, once this contract is violated, there are no guarantees on what the output of groupBy might be.
The groupBy function takes a list and returns a list of lists such that each sublist in the result contains only equal elements, based on the equality function you provide.
In this case, you are trying to find all subsets where for all sublist elements x and y, mod (x*y) 3 == 0 (and the ones where it doesn't == 0). Slightly weird, but there you go. groupBy only looks at adjacent elements. sort the list to reduce the number of duplicate sets.

loop rolling algorithm

I have come up with the term loop rolling myself with the hope that it does
not overlap with an existing term. Basically I'm trying to come up with an
algorithm to find loops in a printed text.
Some examples from simple to complicated
Example1
Given:
a a a a a b c d
I want to say:
5x(a) b c d
or algorithmically:
for 1 .. 5
print a
end
print b
print c
print d
Example2
Given:
a b a b a b a b c d
I want to say:
4x(a b) c d
or algorithmically:
for 1 .. 4
print a
print b
end
print c
print d
Example3
Given:
a b c d b c d b c d b c e
I want to say:
a 3x(b c d) b c e
or algorithmically:
print a
for 1 .. 3
print b
print c
print d
end
print b
print c
print d
It didn't remind me of any algorithm that I know of. I feel like some of the
problems can be ambiguous but finding one of the solutions is enough to me for
now. Efficiency is always welcome but not mandatory. How can I do this?
EDIT
First of all, thanks for all the discussion. I have adapted an LZW algorithm
from rosetta and ran it on my
input:
abcdbcdbcdbcdef
which gave me:
a
b
c
d
8 => bc
10 => db
9 => cd
11 => bcd
e
f
where I have a dictionary of:
a a
c c
b b
e e
d d
f f
8 bc
9 cd
10 db
11 bcd
12 dbc
13 cdb
14 bcde
15 ef
7 ab
It looks good for compression but it's not quite what I wanted. What I need
is more like compression in the algorithmic representation from my examples
which would have:
subsequent sequences (if a sequence is repeating, there would be no other
sequence in between)
no dictionary but only loops
irreducable
with maximum sequence sizes (which would minimize the algorithmic
representation)
and let's say nested loops are allowed (contrary to what I said before in
the comment)
I start with an algorithm, which gives maximum sequence sizes. Though it would not always minimize the algorithmic representation, it may be used as an approximation algorithm. Or it may be extended to optimal algorithm.
Start with constructing Suffix array for your text along with LCP array.
Sort an array of indexes of LCP array, indexes of larger elements of LCP array come first. This groups together repeating sequences of the same length and allows to process sequences in greedy manner, starting from maximum sequence sizes.
Extract suffix array entries, grouped by LCP value (by group I mean all the entries with selected LCP value as well as all entries with larger LCP values), and sort them by position in the text.
Filter out entries with positional difference not equal to LCP. For remaining entries, get prefixes of length, equal to LCP. This gives all possible sequences in the text.
Add sequences, sorted by starting position, to ordered collection (for example, binary search tree). Sequences are added in order of appearance in sorted LCP, so longer sequences are added first. Sequences are added only if they are independent or if one of them is completely nested inside the other one. Intersecting intervals are ignored. For example, in caba caba bab sequence ab intersects with caba and so it is ignored. But in cababa cababa babab one instance of ab is dropped, 2 instances are completely inside larger sequence, and 2 instances are completely outside of it.
At the end, this ordered collection contains all the information, needed to produce the algorithmic representation.
Example:
Text ababcabab
Suffix array ab abab ababcabab abcabab b bab babcabab bcabab cabab
LCP array 2 4 2 0 1 3 1 0
Sorted LCP 4 3 2 2 1 1 0 0
Positional difference 5 5 2 2 2 2 - -
Filtered LCP - - 2 2 - - - -
Filtered prefixes (ab ab) (ab ab)
Sketch of an algorithm, producing the minimal algorithmic representation.
Start with the first 4 steps of previous algorithm. Fifth step should be modified. Now it is not possible to ignore intersecting intervals, so every sequence is added to the collection. Since the collection now contains intersecting intervals, it is better to implement it as some advanced data structure, for example, Interval tree.
Then recursively determine the length of algorithmic representation for all sequences, that contain any nested sequences, starting from the smallest ones. When every sequence is evaluated, compute optimal algorithmic representation for whole text. Algorithm for processing either a sequence or whole text uses dynamic programming: allocate a matrix with number of columns, equal to text/sequence length and number of rows, equal to the length of algorithmic representation; doing in-order traversal of interval tree, update this matrix with all sequences, possible for each text position; when more than one value for some cell is possible, either choose any of them, or give preference to longer or shorter sub-sequences.

Resources