Searching for a "needle" in a two dimnesional "haystack" [closed] - string

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I guess this is one of the most commonly asked interview questions, yet I am unable to solve it in an efficient way(efficient meaning lesser time complexity and use of a suitable Data Structure).
The problem is this way:
If there is a m x n matrix of chars (say haystack) and a given char string of length k (the needle) . Write a program to check if the haystack contains the needle. Please note that we need to search the haystack only top to down or left to right.
For example
Haystack
ahydsfd
sdflddl
dfdfd
dfdl
uifddffdhc
Needle:
hdffi
Output:
Yes Found!!

The naive bruteforce is O(m*n*k). Here are some ideas for optimization.
Single Search
Instead of doing a search for horizontals and then another for verticals, do both simultaneously. Every time you find an occurrence of the first letter of the needle look for a horizontal and a vertical match starting at that letter. This won't improve the time complexity, but could halve the time since you'll only look at bad starts once.
Rare Letters
Instead of looking for the first letter of the needle, look for the rarest letter which occurs in the needle. This could rule out a lot of the possible matches quickly (though it won't improve the worst-case time complexity). To determine which letters are rarest either scan through the entire board or use random sampling.
Efficient String Searching Algorithm
String searching is a well-studied problem and there are several linear-time algorithms such as Knuth–Morris–Pratt and Boyer–Moore. Using a linear-time string searching algorithm to search each each row and each column reduces the time complexity to O(m*n). This is probably what the interviewers are after.
Exploit Short Rows
I notice that not all rows have the same length. When you look for vertical matches, you can stop searching on that row as soon as the needle 'pops out' of the sack, since all needles further along the row will also exit the sack and therefore cannot match.

The brute force method will have worst time complexity of m*n.That is if needle is single character and we start parsing matrix row wise or column wise.

You can restrict the search of the first char to n-k columns and m-k rows. Once found, 2(k-1) would comparisons are required for the answer.

Related

Generate a crossword from given words maximizing intersections

Is there some feasible (i.e. polynomial time) algorithm which builds, starting from a small (~20) set of words, a crossword which maximizes (or at least for which is "big") the number of intersection? Or, if the intersection criteria is impractical, is it possible to maximize the density (in some sense) of the crossword?
I have already written an exhaustive search in Python, but it takes too long for more than six words...
See also:
Algorithm to generate a crossword (but the answers there, althought good, do not really tackle my issue).
Is there some polynomial time algorithm ?
Answer: No.
For a simple version: if a word end's letter is the same of another word beginning's, we can concatenate them. for example:
cat+tree+element -> Valid
aaa+aaa -> Valid
cab+aboard -> Invalid ('a' != 'b')
Question is: try to concatenate words as many as possible.
But it's equivalent to Hamiltonian path problem, so we don't have any polynomial time algorithm for this problem.
See this for details: Hamiltonian path problem
PS:
For a small (~20) set, you can try heuristic search or Dynamic programming method to get a feasible solution.

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.
You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).
Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.
You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).
We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.
Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.
I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

Algorithm to Find if M28K is unique

Today my younger brother asked me a question, the question is as follows:
Given a list of strings & string M28K, where M28K represents a string which starts
from M, ends with K and has 28chars in between . Find if M28K is unique in the
list of strings or not?
I came upto the following algorithm to find the solution for the problem:
For each string:
find string length(L)
if(L==30) then
if(str[0]=='M' && str[L-1]=='K') then
verify rest of 28 characters are matching or not
This solution doesn't seems to be efficient in terms of time complexity. Can anyone give a better algorithm to solve this problem?
I would go with hashing. Usually, since this sounds like an algorithms homework problem, in my experience, we were not allowed to answer with hashing beacause it really depends on your hash function. If it is not good enough, then you won't get unique values for each string.
I would build the list of strings into a binary sort tree based on the characters in the string. Maintaining an algorithm that says if the string comes before the head node in alphabetical order, place it to the left, and if it comes after the head node, place it to the right. Recursively of course. We have a tree. Now granted worst case this will be completed in O(n) time, which would just effectively be a linked list, but with a good head node, somewhere in the m or n area, this lookup can be completed in O(log n). So the whole operation would take O(n log n) time.
Your provided algorithm, worst case would take O(n^2). Let's say every string was 30 characters, and ended with K and began with L. All excpet the 2nd to last character. Effectively we would search 28 characters of all the provided strings. The n^2 comes into play with finding the size of all the strings. Each string would take O(n) time making it an n^2 algorithm. In my algorithm, we are halving the problem each time, which provides a lot quicker of a search.

Matching a line to a rule in c++ [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I have a line and a set of rules (other lines). Line to match (line = rule) may be very long; set of rules may be large and each rule may be long. Shorter rules may be part of longer (need to choose longer).
Currently, I have about 70 rules, <30 characters long, organized in a long if-else-if chain.
Is there any way to predict at what point there will be decrease in performance?
Is there a faster way of matching that comparing line with each of the rules?
Edit: There are no text files. I have an encoded sequence of characters, I go through if-elses comparing to "rules", and then act accordingly.
If you want to just check whether the input line is equal to any of the lines rules, then use an std::set (or std::map, if you want different behavior per rule) to store them. That takes the matching complexity down to O(lg N) where N is the number of rules.
Better yet, use an unordered_set (C++11) for O(1) performance.
If the behavior does not depend on which rule matched, then you can also compile a regular expression from the rules (e.g. (niVVVd__xniVVd__)|(niVVVdxniVVd)) with a tool like RE2 to get worst-case O(n) behavior, where n is the length of the input string.
Since you're comparing for equality, you don't need to match the longest rule first.

Resources