String matching algorithm when matching words inside two Strings? - textmatching

For example, when String A has a total of 10 words and String B has total of 100 words, all words in String A are found in String B the result would be a 100% match. If half are found, it is a 50% match. What algorithm produces results like this?

I will try to write a PHP like code
wordsA = explode(' ', A);
wordsB = explode(' ', B);
match = 0;
foreach (wordsA as word) {
if (in_array(word, wordsB)) {
match++;
}
}
echo (count(wordsB)/match*100.'%');

Related

Replacing the number in a string

if my string is lets say "Alfa1234Beta"
how can I convert all the number in to "_"
for example "Alfa1234Beta"
will be "Alfa____Beta"
Going with the Regex approach pointed out by others is possibly OK for your scenario. Mind you however, that Regex sometimes tend to be overused. A hand rolled approach could be like this:
static string ReplaceDigits(string str)
{
StringBuilder sb = null;
for (int i = 0; i < str.Length; i++)
{
if (Char.IsDigit(str[i]))
{
if (sb == null)
{
// Seen a digit, allocate StringBuilder, copy non-digits we might have skipped over so far.
sb = new StringBuilder();
if (i > 0)
{
sb.Append(str, 0, i);
}
}
// Replace current character (a digit)
sb.Append('_');
}
else
{
if (sb != null)
{
// Seen some digits (being replaced) already. Collect non-digits as well.
sb.Append(str[i]);
}
}
}
if (sb != null)
{
return sb.ToString();
}
return str;
}
It is more light weight than Regex and only allocates when there is actually something to do (replace). So, go ahead use the Regex version if you like. If you figure out during profiling that is too heavy weight, you can use something like the above. YMMV
You can run for loop on the string and then use the following method to replace numbers with _
if (!System.Text.RegularExpressions.Regex.IsMatch(i, "^[0-9]*$"))
Here variable i is the character in the for loop .
You can use this:
var s = "Alfa1234Beta";
var s2 = System.Text.RegularExpressions.Regex.Replace(s, "[0-9]", "_");
s2 now contains "Alfa____Beta".
Explanation: the regex [0-9] matches any digit from 0 to 9 (inclusive). The Regex.Replace then replaces all matched characters with an "_".
EDIT
And if you want it a bit shorter AND also match non-latin digits, use \d as a regex:
var s = "Alfa1234Beta๓"; // ๓ is "Thai digit three"
var s2 = System.Text.RegularExpressions.Regex.Replace(s, #"\d", "_");
s2 now contains "Alfa____Beta_".

Time complexity of a string compression algorithm

I tried to do an exercice from cracking the code interview book about compressing a string :
Implement a method to perform basic string compression using the counts of
repeated characters. For example, the string aabcccccaaa would become
a2blc5a3. If the "compressed" string would not become smaller than the original
string, your method should return the original string.
The first proposition given by the author is the following :
public static String compressBad(String str) {
int size = countCompression(str);
if (size >= str.length()) {
return str;
}
String mystr = "";
char last = str.charAt(0);
int count = 1;
for (int i = 1; i < str.length(); i++) {
if (str.charAt(i) == last) {
count++;
} else {
mystr += last + "" + count;
last = str.charAt(i);
count = 1;
}
}
return mystr + last + count;
}
About the time complexity of thie algorithm, she said that :
The runtime is 0(p + k^2), where p is the size of the original string and k is the number of character sequences. For example, if the string is aabccdeeaa, then there are six character sequences. It's slow because string concatenation operates in 0(n^2) time
I have two main questions :
1) Why the time complexity is O(p+k^2), it should be only O(p) (where p is the size of the string) no? Because we do only p iterations not p + k^2.
2) Why the time complexity of a string concatenation is 0(n^2)???
I thought that if we have a string3 = string1 + string2 then we have a complexity of size(string1) + size(string2), because we have to create a copy of the two strings before adding them to a new string (create a string is O(1)). So, we will have an addition not a multiplication, no? It isn't the same thing for an array (if we used a char array for example) ?
Can you clarify these points please? I didn't understand how we calculate the complexity...
String concatenation is O(n) but in this case it's being concatenated K times.
Every time you find a sequence you have to copy the entire string + what you found. for example if you had four total sequences in the original string
the cost to get the final string will be:
(k-3)+ //first sequence
(k-3)+(k-2) + //copying previous string and adding new sequence
((k-3)+(+k-2)+k(-1) + //copying previous string and adding new sequence
((k-3)+(k-2)+(k-1)+k) //copying previous string and adding new sequence
Thus complexity is O(p + K^2)

Dynamic character generator; Generate all possible strings from a character set

I want to make a dynamic string generator that will generate all possible unique strings from a character set with a dynamic length.
I can make this very easily using for loops but then its static and not dynamic length.
// Prints all possible strings with the length of 3
for a in allowedCharacters {
for b in allowedCharacters {
for c in allowedCharacters {
println(a+b+c)
}
}
}
But when I want to make this dynamic of length so I can just call generate(length: 5) I get confused.
I found this Stackoverflow question But the accepted answer generates strings 1-maxLength length and I want maxLength on ever string.
As noted above, use recursion. Here is how it can be done with C#:
static IEnumerable<string> Generate(int length, char[] allowed_chars)
{
if (length == 1)
{
foreach (char c in allowed_chars)
yield return c.ToString();
}
else
{
var sub_strings = Generate(length - 1, allowed_chars);
foreach (char c in allowed_chars)
{
foreach (string sub in sub_strings)
{
yield return c + sub;
}
}
}
}
private static void Main(string[] args)
{
string chars = "abc";
List<string> result = Generate(3, chars.ToCharArray()).ToList();
}
Please note that the run time of this algorithm and the amount of data it returns is exponential as the length increases which means that if you have large lengths, you should expect the code to take a long time and to return a huge amount of data.
Translation of #YacoubMassad's C# code to Swift:
func generate(length: Int, allowedChars: [String]) -> [String] {
if length == 1 {
return allowedChars
}
else {
let subStrings = generate(length - 1, allowedChars: allowedChars)
var arr = [String]()
for c in allowedChars {
for sub in subStrings {
arr.append(c + sub)
}
}
return arr
}
}
println(generate(3, allowedChars: ["a", "b", "c"]))
Prints:
aaa, aab, aac, aba, abb, abc, aca, acb, acc, baa, bab, bac, bba, bbb, bbc, bca, bcb, bcc, caa, cab, cac, cba, cbb, cbc, cca, ccb, ccc
While you can (obviously enough) use recursion to solve this problem, it quite an inefficient way to do the job.
What you're really doing is just counting. In your example, with "a", "b" and "c" as the allowed characters, you're counting in base 3, and since you're allowing three character strings, they're three digit numbers.
An N-digit number in base M can represent NM different possible values, going from 0 through NM-1. So, for your case, that's limit=pow(3, 3)-1;. To generate all those values, you just count from 0 through the limit, and convert each number to base M, using the specified characters as the "digits". For example, in C++ the code can look like this:
#include <string>
#include <iostream>
int main() {
std::string letters = "abc";
std::size_t base = letters.length();
std::size_t digits = 3;
int limit = pow(base, digits);
for (int i = 0; i < limit; i++) {
int in = i;
for (int j = 0; j < digits; j++) {
std::cout << letters[in%base];
in /= base;
}
std::cout << "\t";
}
}
One minor note: as I've written it here, this produces the output in basically a little-endian format. That is, the "digit" that varies the fastest is on the left, and the one that changes the slowest is on the right.

Remove all the occurences of substrings from a string

Given a string S and a set of n substrings. Remove every instance of those n substrings from S so that S is of the minimum length and output this minimum length.
Example 1
S = ccdaabcdbb
n = 2
substrings = ab, cd
Output
2
Explanation:
ccdaabcdbb -> ccdacdbb -> cabb -> cb (length=2)
Example 2
S = abcd
n = 2
substrings = ab,bcd
Output
1
How do I solve this problem ?
A simple Brute-force search algorithm is:
For each substring, try all possible ways to remove it from the string, then recurse.
In Pseudocode:
def min_final_length (input, substrings):
best = len(input)
for substr in substrings:
beg = 0
// find all occurrences of substr in input and recurse
while (found = find_substring(input, substr, from=beg)):
input_without_substr = input[0:found]+input[found+len(substr):len(input)]
best = min(best, min_final_length(input_without_substr,substrings))
beg = found+1
return best
Let complexity be F(S,n,l) where S is the length of the input string, n is the cardinality of the set substrings and l is the "characteristic length" of substrings. Then
F(S,n,l) ~ n * ( S * l + F(S-l,n,l) )
Looks like it is at most O(S^2*n*l).
The following solution would have an complexity of O(m * n) where m = len(S) and n is the number of substring
def foo(S, sub):
i = 0
while i < len(S):
for e in sub:
if S[i:].startswith(e):
S = S[:i] + S[i+len(e):]
i -= 1
break
else: i += 1
return S, i
If you are for raw performance and your string is very large, you can do better than brute force. Use a suffix trie (E.g, Ukkonnen trie) to store your string. Then find each substring (which us done in O(m) time, m being substring length), and store the offsets to the substrings and length in an array.
Then use the offsets and length info to actually remove the substrings by filling these areas with \0 (in C) or another placeholder character. By counting all non-Null characters you will get the minimal length of the string.
This will als handle overlapping substring, e.g. say your string is "abcd", and you have two substrings "ab" and "abcd".
I solved it using trie+dp.
First insert your substrings in a trie. Then define the state of the dp is some string, walk through that string and consider each i (for i =0 .. s.length()) as the start of some substring. let j=i and increment j as long as you have a suffix in the trie (which will definitely land you to at least one substring and may be more if you have common suffix between some substring, for example "abce" and "abdd"), whenever you encounter an end of some substring, go solve the new sub-problem and find the minimum between all substring reductions.
Here is my code for it. Don't worry about the length of the code. Just read the solve function and forget about the path, I included it to print the string formed.
struct node{
node* c[26];
bool str_end;
node(){
for(int i= 0;i<26;i++){
c[i]=NULL;
}
str_end= false;
}
};
class Trie{
public:
node* root;
Trie(){
root = new node();
}
~Trie(){
delete root;
}
};
class Solution{
public:
typedef pair<int,int>ii;
string get_str(string& s,map<string,ii>&path){
if(!path.count(s)){
return s;
}
int i= path[s].first;
int j= path[s].second;
string new_str =(s.substr(0,i)+s.substr(j+1));
return get_str(new_str,path);
}
int solve(string& s,Trie* &t, map<string,int>&dp,map<string,ii>&path){
if(dp.count(s)){
return dp[s];
}
int mn= (int)s.length();
for(int i =0;i<s.length();i++){
string left = s.substr(0,i);
node* cur = t->root->c[s[i]-97];
int j=i;
while(j<s.length()&&cur!=NULL){
if(cur->str_end){
string new_str =left+s.substr(j+1);
int ret= solve(new_str,t,dp,path);
if(ret<mn){
path[s]={i,j};
}
}
cur = cur->c[s[++j]-97];
}
}
return dp[s]=mn;
}
string removeSubstrings(vector<string>& substrs, string s){
map<string,ii>path;
map<string,int>dp;
Trie*t = new Trie();
for(int i =0;i<substrs.size();i++){
node* cur = t->root;
for(int j=0;j<substrs[i].length();j++){
if(cur->c[substrs[i][j]-97]==NULL){
cur->c[substrs[i][j]-97]= new node();
}
cur = cur->c[substrs[i][j]-97];
if(j==substrs[i].length()-1){
cur->str_end= true;
}
}
}
solve(s,t,dp,path);
return get_str(s, path);
}
};
int main(){
vector<string>substrs;
substrs.push_back("ab");
substrs.push_back("cd");
Solution s;
cout << s.removeSubstrings(substrs,"ccdaabcdbb")<<endl;
return 0;
}

String Matching: Matching words with or without spaces

I want to find a way by which I can map "b m w" to "bmw" and "ali baba" to "alibaba" in both the following examples.
"b m w shops" and "bmw"
I need to determine whether I can write "b m w" as "bmw"
I thought of this approach:
remove spaces from the original string. This gives "bmwshops". And now find the Largest common substring in "bmwshop" and "bmw".
Second example:
"ali baba and 40 thieves" and "alibaba and 40 thieves"
The above approach does not work in this case.
Is there any standard algorithm that could be used?
It sounds like you're asking this question: "How do I determine if string A can be made equal to string B by removing (some) spaces?".
What you can do is iterate over both strings, advancing within both whenever they have the same character, otherwise advancing along the first when it has a space, and returning false otherwise. Like this:
static bool IsEqualToAfterRemovingSpacesFromOne(this string a, string b) {
return a.IsEqualToAfterRemovingSpacesFromFirst(b)
|| b.IsEqualToAfterRemovingSpacesFromFirst(a);
}
static bool IsEqualToAfterRemovingSpacesFromFirst(this string a, string b) {
var i = 0;
var j = 0;
while (i < a.Length && j < b.Length) {
if (a[i] == b[j]) {
i += 1
j += 1
} else if (a[i] == ' ') {
i += 1;
} else {
return false;
}
}
return i == a.Length && j == b.Length;
}
The above is just an ever-so-slightly modified string comparison. If you want to extend this to 'largest common substring', then take a largest common substring algorithm and do the same sort of thing: whenever you would have failed due to a space in the first string, just skip past it.
Did you look at Suffix Array - http://en.wikipedia.org/wiki/Suffix_array
or Here from Jon Bentley - Programming Pearl
Note : you have to write code to handle spaces.

Resources