Difference between subarray, subset & subsequence - subset

I'm a bit confused between subarray, subsequence & subset
if I have {1,2,3,4}
then
subsequence can be {1,2,4} OR {2,4} etc. So basically I can omit some elements but keep the order.
subarray would be( say subarray of size 3)
{1,2,3}
{2,3,4}
Then what would be the subset?
I'm bit confused between these 3.

Consider an array:
{1,2,3,4}
Subarray: contiguous sequence in an array i.e.
{1,2},{1,2,3}
Subsequence: Need not to be contiguous, but maintains order i.e.
{1,2,4}
Subset: Same as subsequence except it has empty set i.e.
{1,3},{}
Given an array/sequence of size n, possible
Subarray = n*(n+1)/2
Subseqeunce = (2^n) -1 (non-empty subsequences)
Subset = 2^n

In my opinion, if the given pattern is array, the so called subarray means contiguous subsequence.
For example, if given {1, 2, 3, 4}, subarray can be
{1, 2, 3}
{2, 3, 4}
etc.
While the given pattern is a sequence, subsequence contain elements whose subscripts are increasing in the original sequence.
For example, also {1, 2, 3, 4}, subsequence can be
{1, 3}
{1,4}
etc.
While the given pattern is a set, subset contain any possible combinations of original set.
For example, {1, 2, 3, 4}, subset can be
{1}
{2}
{3}
{4}
{1, 2}
{1, 3}
{1, 4}
{2, 3}
etc.

Consider these two properties in collection (array, sequence, set, etc) of elements: Order and Continuity.
Order is when you cannot switch the indices or locations of two or more elements (a collection with a single element has an irrelevant order).
Continuity is that an element must have their neighbors remain with them or be null.
A subarray has Order and Continuity.
A subsequence has Order but not Continuity.
A subset does not Order nor Continuity.
A collection with Continuity but not Order does not exist (to my knowledge)

In the context of an array, SubSequence - need not be contigious but needs to maintain the order. But SubArray is contigious and inherently maintains the order.
if you have {1,2,3,4} --- {1,3,4} is a valid SubSequence but its not a subarray.
And subset is no order and no contigious.. So you {1,3,2} is a valid sub set but not a subsequence or subarray.
{1,2} is a valid subarray, subset and subsequence.
All Subarrays are subsequences and all subsequence are subset.
But sometimes subset and subarrays and sub sequences are used interchangably and the word contigious is prefixed to make it more clear.

Per my understanding, for example, we have a list say [3,5,7,8,9]. here
subset doesn’t need to maintain order and has non-contiguous behavior. For example, [9,3] is a subset
subsequence maintain order and has non-contiguous behavior. For example, [5,8,9] is a subsequence
subarray maintains order and has contiguous behavior. For example, [8,9] is a subarray

subarray: some continuous elements in the array
subset: some elements in the collection
subsequence: in most case, some elements in the array maintaining relative order (not necessary to be continuous)

A Simple and Straightforward Explanation:
Subarray: It always should be in contiguous form.
For example, lets take an array int arr=[10,20,30,40,50];
-->Now lets see its various combinations:
subarr=[10,20] //true
subarr=[10,30] //false, because its not in contiguous form
subarr=[40,50] //true
Subsequence: which don't need to be in contiguous form but same order.
For example, lets take an array int arr=[10,20,30,40,50];
-->Now lets see its various combinations:
subseq=[10,20]; //true
subseq=[10,30]; //true
subseq=[30,20]; //false, because order isn't maintained
Subset: which mean any possible combinations.
For example, lets take an array int arr=[10,20,30,40,50];
-->Now lets see its various combinations:
subset={10,20}; //true
subset={10,30}; //true
subset={30,20}; //true

Following Are Example of Arrays
Array : 1,2,3,4,5,6,7,8,9
Sub Array : 2,3,4,5,6 >> Contagious Elements in order
Sub Sequence : 2,4,7,8 >> Elements in order by skipping any or 0 elements
Subset : 9,5,2,1 >> Elements by skipping any or 0 elements but not in order

Suppose an Array [3,4,6,7,9]
Sub Array is a continuous and ordered part of that array
example is [3,4,6],[7,9],[5]
Sub Sequence has not need to be continuous but they should be in order
example is [3,4,9],[3,7],[6]
Subset neither need to be continuous nor to be in order
Example is [9,4,7],[3,4],[5]

A subarray is a contiguous part of an array and maintains a relative ordering of elements. For an array/string of size n, there are n*(n+1)/2 non-empty subarrays/substrings.
A subsequence maintains a relative ordering of elements but may or may not be a contiguous part of an array. For a sequence of size n, we can have 2^n-1 non-empty sub-sequences in total.
A subset does not maintain a relative ordering of elements and is neither a contiguous part of an array. For a set of size n, we can have (2^n) sub-sets in total.
Let us understand it with an example.
Consider an array:
array = [1,2,3,4]
Subarray : [1,2],[1,2,3] — is continuous and maintains relative order of elements
Subsequence: [1,2,4] — is not continuous but maintains relative order of elements
Subset: [1,3,2] — is not continuous and does not maintain the relative order of elements
Some interesting observations:
Every Subarray is a Subsequence.
Every Subsequence is a Subset.

Related

Comparing the word-counts of two files, accounting for the number of occurrences

I'm currently working on a program which is supposed to find exploits for vulnerabilities in web-applications by looking at the "Document Object Model" (DOM) of the application.
One approach for narrowing down the number of possible DB-entries follows the strategy of further filtering the entries by comparing the word-count of the DOM and the database entry.
I already have two dicts (actually Dataframes, but showing dict here for better presentation), each containing the top 10 words in descending order of their numbers of ocurrences in the text.
word_count_dom = {"Peter": 10, "is": 6, "eating": 2, ...}
word_count_db = {"eating": 6, "is": 6, "Peter": 1, "breakfast": 1, ...}
Now i would like to calculate some kind of value, which represents how similar the two dicts are while accounting for the number of occurences.
Currently im using:
len([value in word_count_db for value in word_count_dom])
>>> 3
but this does not account for the number of occurrences at all.
Looking at the example i would like the program to give more value to the "is"-match, because of the generally better "Ranking-Position to Number of Occurences"-value.
Just an idea:
Compute for each dict the relative probability of each entry to occur (e.g. among all the top counts "Peter" occurs 20% of the time). Do this for each word occuring in either dict. And then use something like:
https://en.wikipedia.org/wiki/Bhattacharyya_distance

Astropy get table length

How can I get the length (i.e. number of rows) of an astropy Table? From the documentation, there are serveral ways of having the table length printed out, such as t.info(). However, I can't use this information in a script.
How do I assign the length of a table to a variable?
In Python the len() built-in function typically gives the length/size of some collection-like object. For example the length of a 1-D array is given like:
>>> a = [1, 2, 3]
>>> len(a)
3
For a table you could ask what the "size" of a table means--the number of rows? The number of columns? The total number of items in the table? But it sounds like you want the number of rows. In Python, this will almost always be given by len() on table-like objects as well (arguably anything that does otherwise is a mistake). You can consider this by analogy to how you might construct a table-like data structure with simple Python lists, by nesting them:
>>> t = [
... [1, 2, 3],
... [4, 5, 6],
... [7, 8, 9]
... ]
Here each "row" is represented by a single list nested in outer lists, so len(t) gives th number of rows. In fact this is just a convention and can be broken if need-be. For example you could also treat the above t as list of columns for some column-oriented data.
But in Python we typically assume 2-dimensional arrays to be row-oriented unless otherwise stated--to remember you can see that the syntax for a nested list as I wrote above looks row-oriented.
The logic extends to Numpy arrays and other more complicated data structures built on them such as Astropy's Table or Pandas DataFrames.

Generate all "without-replacement" subsets series

I'm looking for a way to generate all possible subcombinations of a set, where each element can be used at most one time.
For example, the set {1,2,3} would yield
{{1},{2},{3}}
{{1},{2,3}}
{{1,2},{3}}
{{2},{1,3}}
{{1,2,3}}
A pseudocode hint would be great. Also, if there is a term for this, or a terminology that applies, I would love to learn it.
First, a few pointers.
The separation of a set into disjoint subsets is called a set partition (Wikipedia, MathWorld).
A common way to encode a set partition is a restricted growth string.
The number of set partitions is a Bell number, and they grow fast: for a set of 20 elements, there are 51,724,158,235,372 set partitions.
Here is how encoding works.
Look at the elements in increasing order: 1, 2, 3, 4, ... .
Let c be the current number of subsets, initially 0.
Whenever the current element is the lowest element of its subset, we assign this set the number c, and then increase c by 1.
Regardless, we write down the number of the subset which contains the current element.
It follows from the procedure that the first element of the string will be 0, and each next element is no greater than the maximum so far plus one. Hence the name, "restricted growth strings".
For example, consider the partition {1,3},{2,5},{4}.
Element 1 is the lowest in its subset, so subset {1,3} is labeled by 0.
Element 2 is the lowest in its subset, so subset {2,5} is labeled by 1.
Element 3 is in the subset already labeled by 0.
Element 4 is the lowest in its subset, so subset {4} is labeled by 2.
Element 5 is in the subset already labeled by 1.
Thus we get the string 01021.
The string tells us:
Element 1 is in subset 0.
Element 2 is in subset 1.
Element 3 is in subset 0.
Element 4 is in subset 2.
Element 5 is in subset 1.
To get a feel of it from a different angle, here are all partitions of a four-element set, along with the respective reduced growth strings:
0000 {1,2,3,4}
0001 {1,2,3},{4}
0010 {1,2,4},{3}
0011 {1,2},{3,4}
0012 {1,2},{3},{4}
0100 {1,3,4},{2}
0101 {1,3},{2,4}
0102 {1,3},{2},{4}
0110 {1,4},{2,3}
0111 {1},{2,3,4}
0112 {1},{2,3},{4}
0120 {1,4},{2},{3}
0121 {1},{2,4},{3}
0122 {1},{2},{3,4}
0123 {1},{2},{3},{4}
As for pseudocode, it's relatively straightforward to generate all such strings.
We do it recursively.
Maintain the value c, assign every number from 0 to c inclusive to the current position, and for each such choice, recursively construct the rest of the string.
Also it is possible to generate them lazily, starting with a string with all zeroes and repeatedly finding the lexicographically next such string, akin to how next_permutation is used to list all permutations.
Lastly, if you'd like to see more than that (along with the mentioned next function), here's a bit of self-promotion.
Recently, we did a learning project at my university, which required the students to implement various functions for combinatorial objects with reasonable efficiency.
Here is the part we got for restricted growth strings; I linked the header part which describes the functions in English.

What is the fastest way to sort n strings of length n each?

I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.
Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.
For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.
You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908
Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

Generating all n-bit strings whose hamming distance is n/2

I'm playing with some variant of Hadamard matrices. I want to generate all n-bit binary strings which satisfy these requirements:
You can assume that n is a multiple of 4.
The first string is 0n.→ a string of all 0s.
The remaining strings are sorted in alphabetic order.→ 0 comes before 1.
Every two distinct n-bit strings have Hamming distance n/2.→ Two distinct n-bit strings agree in exactly n/2 positions and disagree in exactly n/2 positions.
Due to the above condition, every string except for the first string must have the same number of 0s and 1s. → Every string other than the first string must have n/2 ones and n/2 zeros.
(Updated) All the n-bit strings begin with 0.
For example, this is the list that I want for when n=4.
0000
0011
0101
0110
You can easily see that every two distinct rows have hamming distance n/2 = 4/2 = 2 and the list satisfies all the other requirements as well.
Note that I want to generate all such strings. My algorithm may just output three strings 0000, 0011, and 0101 before terminating. This list satisfies all the requirements above but it misses 0110.
What would be a good way to generate such sets? A python pseudo-code is preferred but any high-level description will do.
What is the maximum number of such strings for a given n?For example, when n=4, the max number of such strings happen to be 4. I'm wondering whether there can be any closed form solution for this upper bound.
Thanks.
To answer question 1,
Starting with a string of n zeros (let's call it s0) and a string of n/2 zeros followed by n/2 1's (call it s1), generate the next permutation (call it p):
scan string from right to left
replace first occurrence of "01" with "10"
(unless the first occurrence is at the string start)
move all "1"'s that are on the right of the "01" to the string end
return replaced string
Use the permutation generation order to keep a record of permutations added to sets. If the number of bits set in xoring p with each number currently in the set is n/2, add p to the list; otherwise, if the number of bits set in xoring p with s1 is n/2 and p has not been recorded, start a new set search with s0, s1; and p only as an additional condition for the xor test (since the primary search will review all permutations, this set need not generate additional sets). Use p to generate the next permutation.

Resources