I need to know if two strings "match" where "matching" basically means that there is significant overlap between the two strings. For example, if string1 is "foo" and string2 is "foobar", this should be a match. If string2 was "barfoo", that should also be a match with string1. However, if string2 was "fobar", this should not be a match. I'm struggling to find a clever way to do this. Do I need to split the strings into lists of characters first or is there a way to do this kind of comparison already in Groovy? Thanks!
Using Apache commons StringUtils:
#Grab( 'org.apache.commons:commons-lang3:3.1' )
import static org.apache.commons.lang3.StringUtils.getLevenshteinDistance
int a = getLevenshteinDistance( 'The quick fox jumped', 'The fox jumped' )
int b = getLevenshteinDistance( 'The fox jumped', 'The fox' )
// Assert a is more similar than b
assert a < b
Levenshtein Distance tells you the number of characters that have to change for one string to become another
So to get from 'The quick fox jumped' to 'The fox jumped', you need to delete 6 chars (so it has a score of 6)
And to get from 'The fox jumped' to 'The fox', you need to delete 7 chars.
As per your examples, plain old String.contains may suffice:
assert 'foobar'.contains('foo')
assert 'barfoo'.contains('foo')
assert !'fobar'.contains('foo')
Related
I am searching for a way to use a formatter to put a space between two characters. i thought it would be easy with a string formatter.
here is what i am trying to accomplish:
given: "AB" it will produce "A B"
Here is what i have tried so far:
"AB".format("%#s")
but this keep returning "AB" i want "A B". i thought the number sign could be used for space.
i also tried this:
"26".format("%#d") but its still prints "26"
is there anyway to do this with string.formatter.
It is kind of possible with the string formatter although not directly with a pattern.
jshell> String.format("%1$c %2$c", "AB".chars().boxed().toArray())
$10 ==> "A B"
We need to turn the string into an object array so it can be passed in as varargs and the formatter pattern can extract characters based on index (1$ and 2$) and format them as characters (c).
A much simpler regex solution is the following which scales to any number of characters:
jshell> "ABC^&*123".replaceAll(".", "$0 ").trim()
$3 ==> "A B C ^ & * 1 2 3"
All single characters are replaced with them-self ($0) followed by a space. Then the last extra space is removed with the trim() call.
I could not find way to do this using String#format. But here is a way to accomplish this using regex replacement:
String input = "AB";
String output = input.replaceAll("(?<=[A-Z])(?=[A-Z])", " ");
System.out.println(output);
The regex pattern (?<=[A-Z])(?=[A-Z]) will match every position in between two capital letters, and interpolate a space at that point. The above script prints:
A B
I am building a puzzle word game in Python. I have the correct puzzle word, and the guessed puzzle word. I want to build a third string which shows the correct letters in the guessed puzzle in the correct puzzle word, and _ at the position of the incorrect letters.
For example, say the correct word is APPLE and the guessed word is APTLE
then i want to have a third string: AP_L_
The guessed word and correct word are guaranteed to be 3 to 5 characters long, but the guessed word is not guaranteed to be the same length as the correct word
For example, correct word is TEA and the guessed word is TEAKO, then the third string should be TEA__ because the players guessed the last two letters incorrectly.
Another example, correct word is APPLE and guessed word is POP, the third string should be:
_ _ P_ _ (without space separation)
I can successfully get the matched indexes of the correct and guessed word; however, I am having problems building the third string. I just learned that strings in Python are immutable and that i cannot assign something like str1[index] = str2[index]
I have tried many things, including using lists, but i am not getting the correct answer. The attached code is my most recent attempt, would you please help me solve this?
Thank you
find the match between puzzle_word and guess
def matcher(str_a, str_b):
#find indexes where letters overlap
matched_indexes = [i for i, (a, b) in enumerate(zip(str_a, str_b)) if a == b]
result = []
for i in str_a:
result.append('_')
for value in matched_indexes:
result[value].replace('_', str_a[value])
print(result)
matcher("apple", "allke")
the output result right now is list of five "_"
cases:
correct word is APPLE and the guessed word is APTLE third
string: AP_L_
correct word is TEA and the guessed word is TEAKO,
third string should be TEA__
correct word is APPLE and guessed
word is POP, third string should be _ _ P_ _
You can use itertools.zip_longest here to always make sure you pad out to the longest word provided and then create a new string by joining the matching characters or otherwise a _. eg:
from itertools import zip_longest
correct_and_guess = [
('APPLE', 'APTLE'),
('TEA', 'TEAKO'),
('APPLE', 'POP')
]
for correct, guess in correct_and_guess:
# If characters in same positions match - show character otherwise `_`
new_word = ''.join(c if c == g else '_' for c, g in zip_longest(correct, guess, fillvalue='_'))
print(correct, guess, new_word)
Will print the following:
APPLE APTLE AP_LE
TEA TEAKO TEA__
APPLE POP __P__
Couple of things here.
str.replace() does not replace inline; as you noted strings are immutable, so you have to assign the result of replace:
result[value] = result[value].replace('_', str_a[value])
However, there's no point doing this since you can just assign to the list element:
result[value] = str_a[value]
And finally you can assign a list of the length of str_a without the for loop, which might be more readable:
result = ['_'] * len(str_a)
I've try to break down a sentence to letters and rearrange them alphabetically.
Please see if i could improve this code in someway.
Kind regards
sen = "the quick brown fox jumps over the lazy dog"
smallest=[]
re=''
while len(sen) >0:
smallest.append( min(sen))
print(ord(min(sen)))
re=re+min(sen)
sen = sen[:sen.index(min(sen))]+sen[sen.index(min(sen))+1:]
counter+=1
print(smallest) #list
print(re) #string
What you're doing is exactly like sorting an array of number. You got a value for each char. There is many way to sort an array of number some are pretty fast but depends on what you're seeking. I think the fastest are insertion sort, bubble sort, or selection sort.
You can code them or find them already done in many languages.
There is other way to sort an array you can fin all of them here :
https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms
In MATLAB how can we compare 2 strings and print the common word out. For Example string1 = "hello my name is bob"; and string2 = "today bob went to the park"; the word bob is common in both. What is the structure to follow.
Use intersect with strsplit for a one-liner -
common_word = intersect(strsplit(string1),strsplit(string2))
strsplit splits each string to cells of words and then intersect finds the common one out.
If you would like to avoid strsplit, you can use regexp instead -
common_word =intersect(regexp(string1,'\s','Split'),regexp(string2,'\s','Split'))
Bonus: Removing stop-words from the common words
Let's add some stop-words that are common to these two strings -
string1 = 'hello my name is bob and I am going to the zoo'
string2 = 'today bob went to the park'
Using the solution presented earlier, you would get -
common_word =
'bob' 'the' 'to'
Now, these words - 'the' and 'to' are part of the stop-words. If you would like to have them removed, let me suggest this - Removing stop words from single string
and it's accepted solution.
The final output would be 'bob', whom you were looking for!
If you are looking for matching words only, that are separated by spaces, you can use strsplit to change each string into cell arrays of words, then loop through and search for each one.
str1 = 'test if this works';
str2 = 'does this work?';
cell1 = strsplit(str1);
cell2 = strsplit(str2);
for n = 1:length(cell1)
for m = 1:length(cell2)
if strcmp(cell1{n},cell2{m})
disp(cell1{n});
end
end
end
Notice that in my example the last member of cell2 is 'work?' so if you have punctuation in your strings, you'll have to do a check for that (isletter might help).
Question:
Write a program to remove fragment that occur in "all" strings,where a fragment
is 3 or more consecutive word.
Example:
Input::
s1 = "It is raining and I want to drive home.";
s2 = "It is raining and I want to go skiing.";
s3 = "It is hot and I want to go swimming.";
Output::
s1 = "It is raining drive home.";
s2 = "It is raining go skiing.";
s3 = "It is hot go swimming.";
Removed fragment = "and i want to"
The program will be tested again large files.
Efficiency will be taken into consideration.
Assumptions: Ignore capitalization ,punctuation. but preserve in output.
Note: Take care of cases like
a a a a a b c b c b c b c where removing would create more fragments.
My Solution: (which i think is not the most efficient)
Hash three word phrases into an int and store them in an array, for all strings.
reduces to array of numbers like
1 2 3 4 5
3 5 7 9 8
9 3 1 7 9
Problem reduces to intersection of arrays.
sort the arrays. (k * nlogn)
keep k pointers. if all equal match found. else increment the pointer pointing to least value.
To solve for the Note above. I was thinking of doing a lazy delete, i.e mark phrases for deletion and delete at the end.
Are there cases where my solution might not work? Can we optimize my solution/ find the best solution ?
First observation: replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
Now you have the problem reduced to remove the longest letter sequence that appears in all words of a given list.
So you have to compute the longest common substring for a set of "words". You find it using a generalized suffix tree as this is the most efficient algorithm. This should do the trick and I believe has the best possible complexity.
The first step is as already suggested by izomorphius:
Replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
For the second you don't need to know the longest common substring - you just want to erase it from all the strings.
Note that this is equivalent to erasing all common substrings of length exactly 3, because if you have a longer commmon substring, then its substrings with length 3 are also common.
To do that you can use a hash table (storing key value pairs).
Just iterate over the first string and put all it's 3-substrings into the hash table as keys with values equal to 1.
Then iterate over the second string and for each 3-substring x if x is in the hash table and its value is 1, then set the value to 2.
Then iterate over the third string and for each 3-substring x, if x is in the hash table and its value is 2, then set the value to 3.
...and so on.
At the end the keys that have the value of k are the common 3-substrings.
Now just iterate once more over all the strings and remove those 3-substrings that are common.
import java.io.*;
import java.util.*;
public class remove_unique{
public static void main(String args[]){
String s1 = "Everyday I do exercise if";
String s2 = "Sometimes I do exercise if i feel stressed";
String s3 = "Mostly I do exercise on morning";
String[] words1=s1.split("\\s");
String[] words2=s2.split("\\s");
String[] words3=s3.split("\\s");
StringBuilder sb = new StringBuilder();
for(int i=0;i<words1.length;i++){
for(int j=0;j<words2.length;j++){
for(int k=0;k<words3.length;k++){
if(words1[i].equals(words2[j]) && words2[j].equals(words3[k])
&&words3[k].equals(words1[i])){
//Concatenating the returned Strings
sb.append(words1[i]+" ");
}
}
}
}
System.out.println(s1.replaceAll(sb.toString(), ""));
System.out.println(s2.replaceAll(sb.toString(), ""));
System.out.println(s3.replaceAll(sb.toString(), ""));
}
}
//LAKSHMI ARJUNA
My solution would be something like,
F = all fragments with length > 3 shared by the first 2 lines, avoid overlaps
for each line from the 3rd line and up
remove fragments in F which do not exist in line, or cause overlaps
return sentences with fragments in F removed
I assume finding/matching fragments in sentences can be done with some known algo. but in terms of the time complexity for n lines this is O(n)