Find if two strings are anagrams

Find if two strings are anagrams - string

Faced this question in an interview, which basically stated
Find if the given two strings are anagrams of each other in O(N) time without any extra space
I tried the usual solutions:
Using a character frequency count (O(N) time, O(26) space) (as a variation, iterating 26 times to calculate frequency of each character as well)
Sorting both strings and comparing (O(NlogN) time, constant space)
But the interviewer wanted a "better" approach. At the extreme end of the interview, he hinted at "XOR" for the question. Not sure how that works, since "aa" XOR "bb" should also be zero without being anagrams.
Long story short, are the given constraints possible? If so, what would be the algorithm?

Given word_a and word_b in the same length, I would try the following:
Define a variable counter and initialise the value to 0.
For each letter ii in the alphabet do the following:
2.1. for jj in length(word_a):
2.1.1. if word_a[jj] == ii increase counter by 1: counter += 1
2.1.2. if word_b[jj] == ii decrease the counter by 1: counter -= 1
2.2. if after passing all the characters in the words, counter is different than 0, you have a different number of ii characters in each word and in particular they are not anagrams, break out of the loop and return False
Return True
Explanation
In case the words are anagrams, you have the same number of each of the characters, therefore the use of the histogram makes sense, but histograms require space. Using this method, you run over the n characters of the words exactly 26 times in the case of the English alphabet or any other constant c representing the number of letters in the alphabet. Therefor, the runtime of the process is O(c*n) = O(n) since c is constant and you do not use any other space besides the one variable

I haven't proven to myself that this is infallible yet, but it's a possible solution.
Go through both strings and calculate 3 values: the sum, the accumulated xor, and the count. If all 3 are equal then the strings should be anagrams.

Related

Lookup table for counting number of set bits in an Integer

Was trying to solve this popular interview question - http://www.careercup.com/question?id=3406682
There are 2 approaches to this that i was able to grasp -
Brian Kernighan's algo -
Bits counting algorithm (Brian Kernighan) in an integer time complexity
Lookup table.
I assume when people say use a lookup table, they mean a Hashmap with the Integer as key, and the count of number of set bits as value.
How does one construct this lookup table? Do we use Brian's algo to to count the number of bits the first time we encounter an integer, put it in hashtable, and next time we encounter that integer, retrieve value from hashtable?
PS: I am aware of the hardware and software api's available to perform popcount (Integer.bitCount()), but in context of this interview question, we are not allowed to use those methods.

I was looking for Answer everywhere but could not get the satisfactory explanation.
Let's start by understanding the concept of left shifting. When we shift a number left we multiply the number by 2 and shifting right will divide it by 2.
For example, if we want to generate number 20(binary 10100) from number 10(01010) then we have to shift number 10 to the left by one. we can see number of set bit in 10 and 20 is same except for the fact that bits in 20 is shifted one position to the left in comparison to number 10. so from here we can conclude that number of set bits in the number n is same as that of number of set bit in n/2(if n is even).
In case of odd numbers, like 21(10101) all bits will be same as number 20 except for the last bit, which will be set to 1 in case of 21 resulting in extra one set bit for odd number.
let's generalize this formual
number of set bits in n is number of set bits in n/2 if n is even
number of set bits in n is number of set bit in n/2 + 1 if n is odd (as in case of odd number last bit is set.
More generic Formula would be:
BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
where BitsetTable256 is table we are building for bit count. For base case we can set BitsetTable256[0] = 0; rest of the table can be computed using above formula in bottom up approach.

Integers can directly be used to index arrays;
e.g. so you have just a simple array of unsigned 8bit integers containing the set-bit-count for 0x0001, 0x0002, 0x0003... and do a look up by array[number_to_test].
You don't need to implement a hash function to map an 16 bit integer to something that you can order so you can have a look up function!

To answer your question about how to compute this table:
int table[256]; /* For 8 bit lookup */
for (int i=0; i<256; i++) {
table[i] = table[i/2] + (i&1);
}
Lookup this table on every byte of the given integer and sum the values obtained.

Binary search - worst/avg case

I'm finding it difficult to understand why/how the worst and average case for searching for a key in an array/list using binary search is O(log(n)).
log(1,000,000) is only 6. log(1,000,000,000) is only 9 - I get that, but I don't understand the explanation. If one did not test it, how do we know that the avg/worst case is actually log(n)?
I hope you guys understand what I'm trying to say. If not, please let me know and I'll try to explain it differently.

Worst case
Every time the binary search code makes a decision, it eliminates half of the remaining elements from consideration. So you're dividing the number of elements by 2 with each decision.
How many times can you divide by 2 before you are down to only a single element? If n is the starting number of elements and x is the number of times you divide by 2, we can write this as:
n / (2 * 2 * 2 * ... * 2) = 1 [the '2' is repeated x times]
or, equivalently,
n / 2^x = 1
or, equivalently,
n = 2^x
So log base 2 of n gives you x, which is the number of decisions being made.
Finally, you might ask, if I used log base 2, why is it also OK to write it as log base 10, as you have done? The base does not matter because the difference is only a constant factor which is "ignored" by Big O notation.
Average case
I see that you also asked about the average case. Consider:
There is only one element in the array that can be found on the first try.
There are only two elements that can be found on the second try. (Because after the first try, we chose either the right half or the left half.)
There are only four elements that can be found on the third try.
You can see the pattern: 1, 2, 4, 8, ... , n/2. To express the same pattern going in the other direction:
Half the elements take the maximum number of decisions to find.
A quarter of the elements take one fewer decision to find.
etc.
Since half of the elements take the maximum amount of time, it doesn't matter how much less time the other elements take. We could assume that all elements take the maximum amount of time, and even if half of them actually take 0 time, our assumption would not be more than double whatever the true average is. We can ignore "double" since it is a constant factor. So the average case is the same as the worst case, as far as Big O notation is concerned.

For binary search, the array should be arranged in ascending or descending order.
In each step, the algorithm compares the search key value with the key value of the middle element of the array.
If the keys match, then a matching element has been found and its index, or position, is returned.
Otherwise, if the search key is less than the middle element's key, then the algorithm repeats its action on the sub-array to the left of the middle element.
Or, if the search key is greater,then the algorithm repeats its action on the sub-array to the right.
If the remaining array to be searched is empty, then the key cannot be found in the array and a special "not found" indication is returned.
So, a binary search is a dichotomic divide and conquer search algorithm. Thereby it takes logarithmic time for performing the search operation as the elements are reduced by half in each of the iteration.

For sorted lists which we can do a binary search, each "decision" made by the binary search compares your key to the middle element, if greater it takes the right half of the list, if less it will take the left half of the list (if it's a match it will return the element at that position) you effectively reduce your list by half for every decision yielding O(logn).
Binary search however, only works for sorted lists. For un-sorted lists you can do a straight search starting with the first element yielding a complexity of O(n).
O(logn) < O(n)
Although it entirely depends on how many searches you'll be doing, your inputs, etc what your best approach would be.

For Binary search the prerequisite is a sorted array as input.
• As the list is sorted:
• Certainly we don't have to check every word in the dictionary to look up a word.
• A basic strategy is to repeatedly halve our search range until we find the value.
• For example, look for 5 in the list of 9 #s below.v = 1 1 3 5 8 10 18 33 42
• We would first start in the middle: 8
• Since 5<8, we know we can look at just the first half: 1 1 3 5
• Looking at the middle # again, narrow down to 3 5
• Then we stop when we're down to one #: 5
How many comparison is needed: 4 =log(base 2)(9-1)=O(log(base2)n)
int binary_search (vector<int> v, int val) {
int from = 0;
int to = v.size()-1;
int mid;
while (from <= to) {
mid = (from+to)/2;
if (val == v[mid])
return mid;
else if (val > v[mid])
from = mid+1;
else
to = mid-1;
}
return -1;
}

String pre-processing step, to answer further queries in O(1) time

A string is given to you and it contains characters consisting of only 3 characters. Say, x y z.
There will be million queries given to you.
Query format: x z i j
Now in this we need to find all possible different substrings which begins with x and ends in z. i and j denotes the lower and upper bound of the range where the substring must lie. It should not cross this.
My Logic:-
Read the string. Have 3 arrays which will store the count of x y z respectively, for i=0 till strlen
Store the indexes of each characters separately in 3 more arrays. xlocation[], ylocation[], zlocation[]
Now, accordingly to the query, (a b i j) find all the indices of b within the range i and j.
Calculate the answer, for each index of b and sum it to get the result.
Is it possible to pre-process this string before the query? So, like that it takes O(1) time to answer the query.

As the others suggested, you can do this with a divide and conquer algorithm.
Optimal substructure:
If we are given a left half of the string and a right half and we know how many substrings there are in the left half and how many there are in the right half then we can add the two numbers together. We will be undercounting by all the strings that begin in the left and end in the right. This is simply the number of x's in the left substring multiplied by the number of z's in the right substring.
Therefore we can use a recursive algorithm.
This would be a problem however if we tried to solve for everything single i and j combination as the bottom level subproblems would be solved many many times.
You should look into implementing this with a dynamic programming algorithm keeping track of substrings in range i,j, x's in range i,j, and z's in range i,j.

What is the fastest way to sort n strings of length n each?

I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.

Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.

For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.

You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908

Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

Generating all n-bit strings whose hamming distance is n/2

I'm playing with some variant of Hadamard matrices. I want to generate all n-bit binary strings which satisfy these requirements:
You can assume that n is a multiple of 4.
The first string is 0n.→ a string of all 0s.
The remaining strings are sorted in alphabetic order.→ 0 comes before 1.
Every two distinct n-bit strings have Hamming distance n/2.→ Two distinct n-bit strings agree in exactly n/2 positions and disagree in exactly n/2 positions.
Due to the above condition, every string except for the first string must have the same number of 0s and 1s. → Every string other than the first string must have n/2 ones and n/2 zeros.
(Updated) All the n-bit strings begin with 0.
For example, this is the list that I want for when n=4.
0000
0011
0101
0110
You can easily see that every two distinct rows have hamming distance n/2 = 4/2 = 2 and the list satisfies all the other requirements as well.
Note that I want to generate all such strings. My algorithm may just output three strings 0000, 0011, and 0101 before terminating. This list satisfies all the requirements above but it misses 0110.
What would be a good way to generate such sets? A python pseudo-code is preferred but any high-level description will do.
What is the maximum number of such strings for a given n?For example, when n=4, the max number of such strings happen to be 4. I'm wondering whether there can be any closed form solution for this upper bound.
Thanks.

To answer question 1,
Starting with a string of n zeros (let's call it s0) and a string of n/2 zeros followed by n/2 1's (call it s1), generate the next permutation (call it p):
scan string from right to left
replace first occurrence of "01" with "10"
(unless the first occurrence is at the string start)
move all "1"'s that are on the right of the "01" to the string end
return replaced string
Use the permutation generation order to keep a record of permutations added to sets. If the number of bits set in xoring p with each number currently in the set is n/2, add p to the list; otherwise, if the number of bits set in xoring p with s1 is n/2 and p has not been recorded, start a new set search with s0, s1; and p only as an additional condition for the xor test (since the primary search will review all permutations, this set need not generate additional sets). Use p to generate the next permutation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string