Permutations of a string of non-unique characters - string

While there are a lot of solutions for how to find all the (unique) permutations of a string of unique characters, I haven't found solutions that work when the characters are non-unique. I have listed out my idea below and would appreciate feedback, but also feel free to provide your own ideas.
My idea:
To illustrate my algorithm, I'm using the example of the string ABBC, which I want to find all permutations of. Since there are two B's I will be labelling them B1 and B2.
Create a new string by removing all duplicate characters from the original string (e.g. turn AB1B2C into AB1C).
Find all possible permutations of the new string (e.g. AB1C, ACB1, B1AC, etc.). There are many algorithms to do this, since the string's characters are all unique.
Choose one duplicate character. For each permutation, insert the chosen duplicate characters at every "position" of the permutation, except when the character just before the duplicate character has the same value as the duplicate character (e.g. For the permutation AB1C, since the duplicate character is B2, insert it to get B2AB1C, AB2B1C, AB1CB2. Exception: Don't do AB1B2C, since that's just a duplicate of AB2B1C).
Continue to do step 3 but now choose a different duplicate character. (Do this until all duplicate characters have been chosen exactly once.)
Prior research: The answer by Prakhar on this SO question claims to work for duplicates: Generate list of all possible permutations of a string. It might, but I suspect there's a bug in the code.

How about this: suppose that the string with duplicates is of length N. Now consider the sequence 0,1,...N-1. Find all its permutations using one of the known algorithms. For each permutation in this list, generate a corresponding string by using the number in the permutation as an index into the original string. For example, if the string is ABBC, then the sequences will be 0,1,2,3; 0,1,3,2; etc. The sequence 3,0,1,2, as an example, is one of the permutations, and it yields the string CABB

Related

Extract Number from string into a list in Scala

I have the following string :
var myStr = "abc12ef4567gh90ijkl789"
The size of the list is not fixed and it contains number in between. I want to extract the numbers and store them in the form of a list in this manner:
List(12,4567,90,789)
I tried the solution mentioned here but cannot extend it to my case. I just want to know if there is any faster or efficient solution instead of just traversing the string and extracting the numbers one by one using brute force ? Also, the string can be arbitrary length.
It seems you may just collect the numbers using
("""\d+""".r findAllIn myStr).toList
See the Scala demo. \d+ matches one or more digits, findAllIn searches for multiple occurrences of the pattern inside a string (and also un-anchors the pattern so that partial matches could be found).
If you prefer a splitting approach, you might use
myStr.split("\\D+").filter(_.nonEmpty).toList
See another demo. Here, \D+ matches one or more non-digit chars, and these chunks are used to split on (texts between these chunks land in the result). .filter(_.nonEmpty) will remove empty items that usually appear due to matches at the start/end of the string.

How should I remove all the repeated words and letters of a String?

I am trying to remove every character repeated over 2 times from an extremely long string. So, for example, the word Terrrrrrific becomes Terrific.
Now my question is, how do I filter out repeats that include more than a single character the same way, i.e. if I have Words words words words words I want to filter it down to words words, however, it might be something less sensible, such as abcdabcdabcdabcdabcd which should become abcdabcd.
I do suspect that I should use a suffix tree, but I'm not sure how to go at the algorithm exactly.
I don't know, Is this efficient algorithm for you but you can do this:
Choose length for finding repeats
Then for every start point from 0 to length-1 go through string
Maintain stack (you use disjoint substrings and push on stack if top two from stack is different from them)

Finding similar strings in large datasets

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.
I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.
I was wondering if you all have a better way of getting this done?
Thanks
One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.
As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

compare a string to a cell array of srings in matlab and find the most similar

I have a list of images stored in a directory. They are all named. My GUI reads all the images and saves their names in a cell array. Now I have added a editable box that the user can type in a name and the program will show that image. The problem is I want the program to take into account typos and misspellings by the user and find the most similar file name to the user typed word. Can you please help me?
Many Thanks,
Hamid
You should read this WP article: Approximate string matching and look at "Calculation of distance between strings" on FEx.
I think you should use the longest common subsequence algorithm to approximately compare strings.
Here is a matlab implementation:
http://www.mathworks.com/matlabcentral/fileexchange/24559-longest-common-subsequence
After, just do something like that:
[~,ind]=min(cellarray( #(x) LCS(lower(userInput),lower(x)), allFileNames));
chosenFile=allFileName{ind};
(the function LCS is the longest common subsequence algorithm, and the functionlower converts to lower case)
Not exactly what you are looking for, but you can compare the first few characters of the strings ignoring case to find a close match. See the command strncmpi:
strncmpi Compare first N characters of strings ignoring case.
TF = strncmpi(S,C,N) performs a case-insensitive comparison between the
first N characters of string S and the first N characters in each element
of cell array C. Input S is a character vector (or 1-by-1 cell array), and
input C is a cell array of strings. The function returns TF, a logical
array that is the same size as C and contains logical 1 (true) for those
elements of C that are a match, except for letter case, and logical 0
(false) for those elements that are not. The order of the two input
arguments is not important.

How will you sort strings in the following example?

so i have a list of string
{test,testertest,testing,tester,testingtest}
I want to sort it in descending order .. how do u sort strings in general ? Is it based on the length or is it character by character ??
how would it be in the example above ?? I want to sort them in a descending way.
No matter what language you’re in, there’s a built-in sort function that performs a lexicographical order, which returns
['test','tester','testertest','testing','testingtest']
for your example. If I wanted this reversed, I would just say reversed(sorted(myList)) in Python and be done with it. If you look to your right you can see plenty of related questions that require a more specialized ordering method (for numbers, dates, etc.), but lexicographic order works on strings containing any kind of data.
Here’s how it works:
compare(string A, string B):
if A and B are both non-empty:
if A[0] == B[0]:
// First letters are the same; compare by the rest
return compare(A[1:], B[1:])
else:
// Compare the first letters by Unicode code point
return compare(A[0], B[0])
else:
// They were equal till now; the shorter one shall be sorted first
return compare(length of A, length of B)
I would sort it like this:
testingtest
testing
testertest
tester
test
Assuming C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
Array.Reverse(myStrings);
foreach(string s in myStrings)
{
Console.WriteLine(s);
}
Not always an ideal way to do it - you could implement a custom comparer instead - but for the trivial example you asked about this is probably the most logical approach.
In computer science strings are usually sorted character by character, with the preferred sort order being (for a standard english character set):
Null characters first
Followed by whitepsace
Followed by symbols
Followed by numeric characters in obvious numerical order
Followed by alphabetic characters in obvious alphabetical order
When sorting characters generally lowercase characters come before uppercase characters.
So for example if we were to sort / compare:
test i ng
test e r
Then "tester" would come before "testing" - the first different character in the string is the 5th one, and "e" comes before "i".
Similarily if we were to compare:
test
testing
Then in this case "test" would come first - once again the strings are identical until the 5th character, where the string "test" ends (i.e. no character) which becomes before any alphanumerical character.
Note that this can produce some counter-intutive results when dealing with numbers - for example try sorting the strings "50" and "100" - you will find that "100" comes before "50". Why? because the strings differ at character 1 and "5" comes after "1".
In nearly all languages there is a function which will do all of the above for you!
You should use that function instead of trying to sort strings yourself! For example:
// C#
string[] myStrings = {"test","testertest","testing","tester","testingtest"};
Array.Sort(myStrings);
in Java you can use natural ordering with
java.util.Collections.sort(list);
the make it descending
java.util.Collections.reverse(list);
or create your own Comparator to do the reverse sorting.
When comparing two strings to see which sorts first, the comparison is typically done on a character by character basis. If the characters in the first position (e.g., t in your example) are identical, you move to the next character. When two characters differ, that "may" define which string is considered "greater".
However, depending on the locale used and a number of other factors, it is possible for later characters in the two strings being compared to override a difference in an earlier character. For example, in some collations, the diacritics on letters are considered to be of secondary weight. So a primary difference in a later character can override the secondary difference.
When two strings are otherwise identical but one is longer, the longer one is typically considered to be "greater". When sorting in descending order, the "greater" of two strings is sorted first.
Do you want to know if test should appear after tester in a descending order? Or are you particularly interested in sorting strings with similar prefixes?
If it's the later, I'd suggest a Trie if the input tends to grow big time.

Resources