Heuristic function of A* search for sorting - search

My problem is sorting strings given in three lists using implementation of A* search,so I need to develop a heuristic function that will make solving various instances of this problem efficient.
The state can be represented s a list of 3 lists, example: [ [C B], [D], [H G F A E] ]
I pick up the top portion of any stack and move it to any other. For example, the H above could be moved on top of the C or the D, and the [H G] could be moved onto the second stack to create [H G D], etc. In this domain, there are unit operator costs, so each move costs 1, regardless of how many words are being moved.
The goal is to take an initial, configuration of blocks and move them all to the left stack in sorted order from top down. For example, given the 8 blocks in the above example, the goal would be [[A B C D E F G H] [] []].
I need to develop a heuristic that can be used with A* search to make the search efficient.
I should try to keep the heuristics you develop admissible. For that you may try to have your heuristics to approximate the number of steps remaining to reach the goal state.
I thinking about a heuristic depends on the sorted words in each stack and the sum of points of each stack is the heuristic for the state, but this is not efficient, I think that I need to include the ascii code of each letter to my calculations, any ideas?

At the beginning, try a simple or naive heuristic. For example, "h" will be the sum of non-ascending order of two letters. Lets say when you develop a new child you got something like this [ACBED] here take first couple [AC] they are in ascending order, do nothing, but for the next couple [CB] it is not in ascending order then add one to "h". [BE] is good. For [ED] add one to "h". So "h == 2". This Heuristic looks simple but I believe it is better than blind search (i.e. breadth/depth search). From this idea you can add more rules to enhance the heuristic based on your analysis of outcomes. I hope that is useful.

Related

Calculating the highest 2-side average match

"First part of the question is dedicated towards explaining the concept better, so we know, what we're calculating with. Feel free to
skip below to the latter parts, if you find it unnecessary"
1. Basic overview of the question:
Hello, I've got an excel application, something akin to a dating site. You can open various user profiles and even scan through the data and find the potential matches, based on hobbies, cities and other criteria.
How it's calculated is not relevant to the question, but the result of
a "Find Match" calculation looks something like this, a sorted list
of users, depending on how fitting they are (last column)
Relevant to the question are mainly:
the first column (ID) - ID of the user
the last column (Zhoda) - Match% of other users, against the one currently selected
2. What I need to do - how it's currently done
I need to find the highest match on average out of all users. If I were to write this algorhitmically:
1. Loop through all users
2. For each user in our database calculate the potential matches
3. Store the score of selected user ID, against all the found user IDs
4. Once it's all calculated, pit all users against each other _
and find the highest match on average
Obviously that sounds pretty complicated / vague, so here's a
simplified example. Let's say I have completed the first 3 steps and
have gotten the following result:
Here, the desired result would be:
User1 <- 46% -> User2
as they have the highest combined percentage average:
User1 vs User2: 30%
User2 vs User1: 62%
User1 <- (30+62)/2 -> User2
And no other possible combination of users has higher match% average
3. The purpose behind the question:
Now obviously you may ask, if I get the calculation behind it, then why ask the question in the first place? Well, the reason is combination of everything vs everything is extremely inefficient.
As soon as there are let's say 100 users instead of 3 in my database. I would have to do 100*100 calculations on match% alone, let alone afterwards check the average Match% of each individual user against another.
Is there perhaps some better way to approach, in a way I could either
minimize the data I have to calculate with
some sorting algorithm, where I could skip certain calculations in order to be quicker
an overall better approach towards calculating the highest average match%
So to recapitulate:
I've got a database of users.
Each individual user has a certain amount of Match% against every other user
I need to find two users, who one against another (on both sides) have the highest Match% average out of all possible combinations.
If you feel like you need any additional info, please let me know.
I'll try to keep the question updated as much as possible.
As you've presented the problem -- no, you cannot speed this up significantly. Since you've presented match% as an arbitrary function, constrained only by implied range, there are no mathematical properties you can harness to reduce the worst-case search scenario.
Under the given circumstances, the best you can do is to leverage the range. First, don't bother with "average": since these are strictly binary matches, dividing by 2 is simply a waste of time; keep the total.
Start by picking a pair; do the two-way match. Once you find a total of more than 100, store that value and use it to prune any sub-standard searches. For instance, if your best match so far totals 120, then if you find a couple where match(A, B) < 20, you don't bother with computing match(B, A).
In between, you can maintain a sorted list (O(n log n)) of first matches; don't do the second match unless you have reason to believe that this one might exceed your best match.
The rest of your optimization consists of gathering statistics about your matching, so that you can balance when to do first-only against two-way matches. For instance, you might defer the second match for any first match that is below the 70th percentile of those already deferred. This is in hope of finding a far better match that would entire eliminate this one.
If you gather statistics on the distribution of your match function, then you can tune this back-and-forth process better.
If you can derive mathematical properties about your match function, then there may be ways to leverage those properties for greater efficiency. However, since it's already short of being a formal topological "distance" metric d (see below), I don't hold out much hope for that.
Basic metric properties:
d(A, B) exists for all pairs (A, B)
d(A, B) = d(B, A)
d(A, A) = 0 // does not apply to a bipartite graph
d obeys the triangle inequality -- which doesn't apply directly, but has some indirect consequences for a bipartite graph.

Filtering letter combinatons

Hi – I’m looking for help for the following problem.
I have a utility operating that gives me all the combinations for a set of letters (or values). This is in the form of 8 choose n, ie there are 8 letters and I can produce all the combinations for sequences where I want no more than 4 letters. So n can be 2, 3, or 4
Now here it gets a bit more complex: the 8 letters are made up of three lists or groups. Hence, A,B,C,D;E1,E2;F1,F2
As I say, I can get all the 2, 3 and 4-sequences without a problem. But I need to filter them so that I get combinations (or rather can filter the result) where I only want letters in the result that ensures I get (in the n=2 condition) at least one from A,B,C,D and one from either the E set or the F set.
So, as a few examples, where n=2
AE1 or DF2… is ok but AB or E1E2 or E1F1… is not ok
Where n=3 the rules alter slightly but it’s the same principle
ABE1, ABF1, BDF2 or BE2F1… is ok but ABC, ABD, AE1E2, DF1F2 or E1E2F1… is not ok.
Similarly, where n=4
ABE1F1, ABE1F2… is ok but ABCD, ABE1E2, CDF1F2 or E1E2F1F2… is not ok.
I’ve tried a few things using different formulas such as with Match and Countif but can’t quite figure it out. So would be very grateful for any help.
Jon
I've been trying to find an approach to this problem that takes some of the messiness out of it. There are two factors that make this a bit awkward to deal with
(a) Combination of single letters and bigrams (digrams?)
(b) Possibility of several different letters / bigrams at each position in the string.
It's possible to deal with both of these issues by classifying the letters or bigrams into three groups or classes
(1) Letters A-D - let's call this group L
(2) First pair of bigrams E1 & E2 - let's call this group M
(3) Second pair of bigrams F1 & F2 - let's call this group N.
Then we can make a list of the allowed combinations of groups which as far as I can work out is something like this
For N=2
LM
LN
For N=3
LLM
LLN
LMN
For N=4
LLMN
(I don't know if LLLM etc. is allowed but these can be added)
I'm going to make a big assumption that the utility mentioned in OP doesn't generate strings like AAAA or E1E1E1E1 otherwise it would be pretty useless and you would be better off starting from scratch.
So you just need a substitute that looks like this
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"A","L"),"B","L"),"C","L"),"D","L"),"E1","M"),"E2","M"),"F1","N"),"F2","N")
And a lookup in the list of allowed patterns
=ISNUMBER(MATCH(B2,$D$2:$D$10,0))
and filter on the lookup value being TRUE.

Algorithm to find all substrings of a string within a given edit distance of another string

I know the title is a bit messy, so let me explain in detail:
I have two strings, T and P. T represents the text to be searched, and P represents the pattern to be searched for. I want to find ALL substrings of T which are within a given edit distance of P.
Example:
T = "cdddx"
P = "mdddt"
Say I wanted all substrings within edit distance 2, the answers would be:
cdddx //rewrite c and x
dddx // insert m, rewrite x
cddd //insert t, rewrite c
ddd // insert m and t
Don't know if that's all of them, but you get the point.
I know the Wagner–Fischer algorithm can be employed for solution of this problem - I check the numbers of the last row of the Wagner–Fischer matrix and see if they fulfill this condition and find the substrings that way, then run the algorithm again for T', where T' is T where the first letter has been removed, and so on. The problem is the time complexity of this shoots up to a staggering O(T^3*P). I'm looking for a solution close to the original time complexity of the Wagner-Fisher algorithm, i.e. O(T*P).
Is there a way to get this done in such time or something better than what I have right now? Note that I am not necessarily looking for a Wagner-Fischer solution, but anything is ok. Thanks!

Transform string from a1b2c3d4 to abcd1234

I am given a string which has numbers and letters.Numbers occupy all odd positions and letters even positions.I need to transform this string such that all letters move to front of array,and all numbers at the end.
The relative order of the letters and numbers needs to be preserved
I need to do this in O(n) time and O(1) space.
eg: a1b2c3d4 -> abcd1234 , x3y4z6 -> xyz346
This previous question has an explanation algorithm, but no matter how hard i try,i cant get a hold of it.
I hope someone can explain me this with a example test case .
The key is to think of the input array as a matrix like this:
a 1
b 2
c 3
d 4
and realize that you want the transpose of this matrix
a b c d
1 2 3 4
Remember, multi-dimensional arrays are really just single-dimensional arrays in disguise so you can do this.
But you need to do this in-place to satisfy the O(1) space requirement. Fortunately, this is a well-known problem complete with several possible approaches.

Count no. of words in O(n)

I am on an interview ride here. One more interview question I had difficulties with.
“A rose is a rose is a rose” Write an
algorithm that prints the number of
times a character/word occurs. E.g.
A – 3 Rose – 3 Is – 2
Also ensure that when you are printing
the results, they are in order of
what was present in the original
sentence. All this in order n.
I did get solution to count number of occurrences of each word in sentence in the order as present in the original sentence. I used Dictionary<string,int> to do it. However I did not understand what is meant by order of n. That is something I need an explanation from you guys.
There are 26 characters, So you can use counting sort to sort them, in your counting sort you can have an index which determines when specific character visited first time to save order of occurrence. [They can be sorted by their count and their occurrence with sort like radix sort].
Edit: by words first thing every one can think about it, is using Hash table and insert words in hash, and in this way count them, and They can be sorted in O(n), because all numbers are within 1..n steel you can sort them by counting sort in O(n), also for their occurrence you can traverse string and change position of same values.
Order of n means you traverse the string only once or some lesser multiple of n ,where n is number of characters in the string.
So your solution to store the String and number of its occurences is O(n) , order of n, as you loop through the complete string only once.
However it uses extra space in form of the list you created.
Order N refers to the Big O computational complexity analysis where you get a good upper bound on algorithms. It is a theory we cover early in a Data Structures class, so we can torment, I mean help the student gain facility with it as we traverse in a balanced way, heaps of different trees of knowledge, all different. In your case they want your algorithm to grow in compute time proportional to the size of the text as it grows.
It's a reference to Big O notation. Basically the interviewer means that you have to complete the task with an O(N) algorithm.
"Order n" is referring to Big O notation. Big O is a way for mathematicians and computer scientists to describe the behavior of a function. When someone specifies searching a string "in order n", that means that the time it takes for the function to execute grows linearly as the length of that string increases. In other words, if you plotted time of execution vs length of input, you would see a straight line.
Saying that your function must be of Order n does not mean that your function must equal O(n), a function with a Big O less than O(n) would also be considered acceptable. In your problems case, this would not be possible (because in order to count a letter, you must "touch" that letter, thus there must be some operation dependent on the input size).
One possible method is to traverse the string linearly. Then create a hash and list. The idea is to use the word as the hash key and increment the value for each occurance. If the value is non-existent in the hash, add the word to the end of the list. After traversing the string, go through the list in order using the hash values as the count.
The order of the algorithm is O(n). The hash lookup and list add operations are O(1) (or very close to it).

Resources