Related
Faced this question in an interview, which basically stated
Find if the given two strings are anagrams of each other in O(N) time without any extra space
I tried the usual solutions:
Using a character frequency count (O(N) time, O(26) space) (as a variation, iterating 26 times to calculate frequency of each character as well)
Sorting both strings and comparing (O(NlogN) time, constant space)
But the interviewer wanted a "better" approach. At the extreme end of the interview, he hinted at "XOR" for the question. Not sure how that works, since "aa" XOR "bb" should also be zero without being anagrams.
Long story short, are the given constraints possible? If so, what would be the algorithm?
Given word_a and word_b in the same length, I would try the following:
Define a variable counter and initialise the value to 0.
For each letter ii in the alphabet do the following:
2.1. for jj in length(word_a):
2.1.1. if word_a[jj] == ii increase counter by 1: counter += 1
2.1.2. if word_b[jj] == ii decrease the counter by 1: counter -= 1
2.2. if after passing all the characters in the words, counter is different than 0, you have a different number of ii characters in each word and in particular they are not anagrams, break out of the loop and return False
Return True
Explanation
In case the words are anagrams, you have the same number of each of the characters, therefore the use of the histogram makes sense, but histograms require space. Using this method, you run over the n characters of the words exactly 26 times in the case of the English alphabet or any other constant c representing the number of letters in the alphabet. Therefor, the runtime of the process is O(c*n) = O(n) since c is constant and you do not use any other space besides the one variable
I haven't proven to myself that this is infallible yet, but it's a possible solution.
Go through both strings and calculate 3 values: the sum, the accumulated xor, and the count. If all 3 are equal then the strings should be anagrams.
Starting with a list of integers the task is to convert each integer into a string such that the resulting list of strings will be in numeric order when sorted lexicographically.
This is needed so that a particular system that is only capable of sorting strings will produce an output that is in numeric order.
Example:
Given the integers
1, 23, 3
we could convert the to strings like this:
"01", "23", "03"
so that when sorted they become:
"01", "03", "23"
which is correct. A wrong result would be:
"1", "23", "3"
because that list is sorted in "string order", not in numeric order.
I'm looking for something more efficient than the simple zero-padding scheme. In order to cover all possible 32 bit integers we'd need to pad to 10 digits which is inefficient.
For integers, prefix each number with the length. To make it more readable, use 'a' for length 1, and 'b' for length 2. Example:
non-encoded encoded
1 "a1"
3 "a3"
23 "b23"
This scheme is a bit simpler than prefixing each digit, but only works with numbers, not numbers mixed with text. It can be made to work for negative numbers as well, and even BigDecimal numbers, using some tricks. I wrote an implementation in Apache Jackrabbit 2.x, to make BigDecimal indexable (sortable) as text. For that, I used a format that only uses the characters '0' to '9' and consists of:
one character for: signum(value) + 2
one character for: signum(exponent) + 2
one character for: length(exponent) - 1
multiple characters for: exponent
multiple characters for: value (-1 if inverted)
Only the signum is encoded if the value is zero. The exponent is not encoded if zero. Negative values are "inverted" character by character (0 => 9, 1 => 8, and so on). The same applies to the exponent.
Examples:
non-encoded encoded
0 "2"
2 "322" (signum 1; exponent 0; value 2)
120 "330212" (signum 1; exponent signum 1, length 1, value 2; value 12)
-1 "179" (signum -1, rest inverted; exponent 0; value 1 (-1, inverted))
Values between BigDecimal(BigInteger.ONE, Integer.MIN_VALUE) and BigDecimal(BigInteger.ONE, Integer.MAX_VALUE) are supported.
TL;DR
Encode digits according to their order of magnitude (OM) and other characters so they sort as desired, relative to numbers: jj-a123 would be encoded zjzjz-zaC1B2A3
Longer explanation
This would depend somewhat upon the sorting algorithm that will finally be used to sort and how one would want any given punctuation characters to be sorted in relation to letters and numbers, but if it's "ascii-betical" or similar, you could encode each digit of a number to represent its order of magnitude (OM) in the number, while encoding other characters such that they would sort according to your desired sort order.
For simplicity, I would suggest beginning with encoding every non-numeric character with a "high" value (e.g. lower case z or even ~ if final value is ASCII), so that it sorts after encoded digits. Then cache each digit encountered until another non-numeric is encountered, then encode each cached digit with a value representing its OM. If the number 12945 was encountered in between non-numerics, you would output an E to encode an OM of 5, then the digit that is that order of magnitude, 1, followed by the next OM of 4 (D) and its associated digit, 2. Continue until all numeric digits have been flushed, then continue with non-numerics.
Non-numerics would be treated individually and ranked relative to the OM of digits. If it is desired for them to sort "above" numbers (perhaps the space character or certain others deemed special) they would be encoded by prepending a low-value character (like the space character, if final value will be treated and sorted as ASCII). When/if another numeric is encountered, begin caching and encode according to OM once all consecutive numerics are cached.
Alternately, processing the string in reverse order would preclude the need to cache numbers except for a single "is it a digit?" test and "is the last character a digit?" test. If the first is not true, then use (one of?) the "non-digit" OM character(s). If the first test is true then use the lowest-OM "digit" character (A in my examples). If both tests are true, then increment your OM character (A -> B or E -> F) before use.
Certain levels of additional filtering - or even translation - could be applied. If one wanted to allow accurate sorting based upon Roman numerals, one could encode them as decimal (or even hexadecimal) numbers with an appropriate OM.
Treating decimal points (either periods or commas, depending) as actual decimal separators, and distinct from other punctuation would probably be beyond the true utility of this encoding scheme, as alphanumeric fields seldom use a period or comma as a decimal separator. If it is desired to use them that way, the algorithm would simply detect a decimal separator (either period or comma as appropriate, in between digits) and not encode the numeric portion after that separator as anything but normal text. Fractional portions are actually sorted correctly during a normal ASCII based sort, because more digits represents greater precision - not greater magnitude.
Examples
non-encoded encoded
----------- -------
12345 E1D2C3B4A5
a100 zaC1B0A0
a20 zaB2A0
a2000 zaD2C0B0A0
x100.5 zxC1B0A0z.A5
x100.23 zxC1B0A0z.B2A3
1, 23, 3 A1z,z B2A1z,z A3
1, 2, 3 A1z,z A2z,z A3
1,2,3 A1z,A2z,A3
Potential advantages
Going somewhat beyond simple numeric sorting, some advantages to this encoding method would be several aspects of flexibility with final effective sort order - you are essentially encoding a category for each character - digits get a category based upon their position within the greater string of digits known as a number, while other characters are simply told to sort in their normal way (e.g. ASCII), but after numbers. Any exceptions that should sort before numbers or in other orders would be in one or more additional categories. ASCII can effectively be re-encoded to sort in a non-ASCII way:
You could encode lower case letters to sort before or along with upper case letters. To switch the lower and upper cases, you encode lower case letters with a y and upper case letters with a z. For a pseudo-case-insensitive sort, categorizing both A and a with the same encoding character would sort both of them before B and b, though A would nonetheless always sort before a
If you want Extended ASCII characters (e.g. with diacritics) to sort along with their ASCII cousins, you encode À, Á, Â, Ã, Ä, Å, and Æ along with A by using an a as the OM character, encode B, C, and Ç with a b, and E, È, É, Ê, and Ë with a c, etc. The same intra-category sort order caveat still applies, and some decisions need to be made on characters like capital Eth, and to a certain extent others like Thorn, and Sharp S (Ð, Þ, and ß respectively) as to whether they will sort based on similarities in appearance or pronunciation, or instead more properly perhaps, alphabetical order.
Small advantage of being basically human-readable, with effort
Caveats
Though this allows many 'categories' of characters to be defined, be sure to remember that each order of magnitude for digits is its own category - you need to know that the data will not contain numbers that are greater in OM than approximately 250, depending upon how many other categories you wish to define (ASCII 0 is reserved for storing strings, and there needs to be at least one other character to indicate "not a digit" - at least for alphanumeric data - making the maximum perhaps 254 orders of magnitude), but that should be plenty for any situation I can imagine. I'm not sure what other issues quantum computing will bring about, but there's probably a quantum solution to it, whatever it is.
Finally, if the hyphen is encoded as a non-numeric character, and all non-numerics are encoded with a higher OM than digits, negative numbers would be encoded as greater than any positive number. The hyphen should be encoded as a lower-than-digit-OM (perhaps only when preceding a digit) if negative numbers need to be sorted correctly according to magnitude.
Since the ASCII code of A is greater than 9, you could encode them as hexadecimal strings.
The integers
1, 23, 3
can be encoded as
00000001, 00000017, 00000003
and 32-bit integers can always be encoded as 8-character strings. (assume unsigned)
I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.
Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.
For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.
You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908
Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays
I'm playing with some variant of Hadamard matrices. I want to generate all n-bit binary strings which satisfy these requirements:
You can assume that n is a multiple of 4.
The first string is 0n.→ a string of all 0s.
The remaining strings are sorted in alphabetic order.→ 0 comes before 1.
Every two distinct n-bit strings have Hamming distance n/2.→ Two distinct n-bit strings agree in exactly n/2 positions and disagree in exactly n/2 positions.
Due to the above condition, every string except for the first string must have the same number of 0s and 1s. → Every string other than the first string must have n/2 ones and n/2 zeros.
(Updated) All the n-bit strings begin with 0.
For example, this is the list that I want for when n=4.
0000
0011
0101
0110
You can easily see that every two distinct rows have hamming distance n/2 = 4/2 = 2 and the list satisfies all the other requirements as well.
Note that I want to generate all such strings. My algorithm may just output three strings 0000, 0011, and 0101 before terminating. This list satisfies all the requirements above but it misses 0110.
What would be a good way to generate such sets? A python pseudo-code is preferred but any high-level description will do.
What is the maximum number of such strings for a given n?For example, when n=4, the max number of such strings happen to be 4. I'm wondering whether there can be any closed form solution for this upper bound.
Thanks.
To answer question 1,
Starting with a string of n zeros (let's call it s0) and a string of n/2 zeros followed by n/2 1's (call it s1), generate the next permutation (call it p):
scan string from right to left
replace first occurrence of "01" with "10"
(unless the first occurrence is at the string start)
move all "1"'s that are on the right of the "01" to the string end
return replaced string
Use the permutation generation order to keep a record of permutations added to sets. If the number of bits set in xoring p with each number currently in the set is n/2, add p to the list; otherwise, if the number of bits set in xoring p with s1 is n/2 and p has not been recorded, start a new set search with s0, s1; and p only as an additional condition for the xor test (since the primary search will review all permutations, this set need not generate additional sets). Use p to generate the next permutation.
I know that most programming languages have functions built in for doing that for you, but how do those functions work?
The javadoc about the Double toString() method is quite comprehensive:
Creates a string representation of the double argument. All characters mentioned below are ASCII characters.
If the argument is NaN, the result is the string "NaN".
Otherwise, the result is a string that represents the sign and magnitude (absolute value) of the argument. If the sign is negative, the first character of the result is '-' ('-'); if the sign is positive, no sign character appears in the result. As for the magnitude m:
If m is infinity, it is represented by the characters "Infinity"; thus, positive infinity produces the result "Infinity" and negative infinity produces the result "-Infinity".
If m is zero, it is represented by the characters "0.0"; thus, negative zero produces the result "-0.0" and positive zero produces the result "0.0".
If m is greater than or equal to 10^-3 but less than 10^7, then it is represented as the integer part of m, in decimal form with no leading zeroes, followed by '.' (.), followed by one or more decimal digits representing the fractional part of m.
If m is less than 10^-3 or not less than 10^7, then it is represented in so-called "computerized scientific notation." Let n be the unique integer such that 10^n<=m<10^(n+1); then let a be the mathematically exact quotient of m and 10^n so that 1<=a<10. The magnitude is then represented as the integer part of a, as a single decimal digit, followed by '.' (.), followed by decimal digits representing the fractional part of a, followed by the letter 'E' (E), followed by a representation of n as a decimal integer, as produced by the method Integer.toString(int).
How many digits must be printed for the fractional part of m or a? There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double. That is, suppose that x is the exact mathematical value represented by the decimal representation produced by this method for a finite nonzero argument d. Then d must be the double value nearest to x; or if two double values are equally close to x, then d must be one of them and the least significant bit of the significand of d must be 0.
Is that enough? Otherwise you might like to look up the implementation too...
A simple (but non-generic, naïve and slow way):
convert the number to an integer, then divide this value by 10 stepwise to find out its digits in reverse order. Concatenate them together and you have the integer representation.
substract the integer from the original number, now multiply by 10 stepwise and find the digits after the decimal point. Concatenate the first string with a point and this second string.
This has a few problems, of course:
slow as hell;
doesn't work for negative numbers;
won't give you exponential notation for very small or large numbers.
All in all, it's an idea, but not a very good one; I suspect there are no programming languages that do this.
This paper by Guy Steele provides details on how to do this correctly. It's much more subtle than you might think.
http://portal.acm.org/citation.cfm?id=93559
"Printing Floating-Point Numbers Quickly and Accurately" - Robert G. Burger
Scheme and C code for above.
As Oded mentioned in a comment, different languages will do this in different ways. As an example, here's how Ruby 1.9 does it (in C). Your best bet, just as a research exercise, will be to look into open-source languages and see how they do it.