Compare two strings and extract common word out

Compare two strings and extract common word out - string

In MATLAB how can we compare 2 strings and print the common word out. For Example string1 = "hello my name is bob"; and string2 = "today bob went to the park"; the word bob is common in both. What is the structure to follow.

Use intersect with strsplit for a one-liner -
common_word = intersect(strsplit(string1),strsplit(string2))
strsplit splits each string to cells of words and then intersect finds the common one out.
If you would like to avoid strsplit, you can use regexp instead -
common_word =intersect(regexp(string1,'\s','Split'),regexp(string2,'\s','Split'))
Bonus: Removing stop-words from the common words
Let's add some stop-words that are common to these two strings -
string1 = 'hello my name is bob and I am going to the zoo'
string2 = 'today bob went to the park'
Using the solution presented earlier, you would get -
common_word =
'bob' 'the' 'to'
Now, these words - 'the' and 'to' are part of the stop-words. If you would like to have them removed, let me suggest this - Removing stop words from single string
and it's accepted solution.
The final output would be 'bob', whom you were looking for!

If you are looking for matching words only, that are separated by spaces, you can use strsplit to change each string into cell arrays of words, then loop through and search for each one.
str1 = 'test if this works';
str2 = 'does this work?';
cell1 = strsplit(str1);
cell2 = strsplit(str2);
for n = 1:length(cell1)
for m = 1:length(cell2)
if strcmp(cell1{n},cell2{m})
disp(cell1{n});
end
end
end
Notice that in my example the last member of cell2 is 'work?' so if you have punctuation in your strings, you'll have to do a check for that (isletter might help).

Related

Extract different numbers from multiple strings

In a .csv spreadsheet, I have multiple strings with incrementing numerical values contained in each, and I need to extract the numbers from each string. For example, here are two strings:
DEVICE1.CM1 - 4.1.1.C1.CA_VALUE (A)
DEVICE1.CM2 - 6.7.1.C2.CA_VALUE (A)
DEVICE1.CM1 - 4.1.2.C1.CA_VALUE (A)
DEVICE1.CM1 - 4.1.2.C2.CA_VALUE (A)
DEVICE1.CM1 - 4.1.2.C3.CA_VALUE (A)
DEVICE1.CM1 - 5.1.1.C1.CA_VALUE (A)
DEVICE1.CM1 - 5.1.1.C2.CA_VALUE (A)
DEVICE1.CM1 - 5.10.1.C3.CA_VALUE (A)
DEVICE1.CM1 - 6.13.1.C10.CA_VALUE (A)
And I am looking to extract "4.1.1.C1" from the first string, and "6.7.1.C2" from the second string.
I have over 1000 strings, each with a different incremental value in the form of "#.#.#.C.#" and all of the options I have tried so far involve searching for a specific value to extract, rather than extracting all values of that general form. Is there any reasonable way to accomplish this?

I am not a big fan of regular expressions because they are often hard to read, but this is a typical example where you should use them. Read carefully the Q&A BigBen linked to in the comments.
Function extractCode(s As String) As String
Static rx As RegExp
If rx Is Nothing Then Set rx = New RegExp
rx.Pattern = "\d+\.\d+\.\d+\.C\d"
If rx.Test(s) Then
extractCode = rx.Execute(s)(0)
End If
End Function
(You will need to add the reference to the Microsoft VBScript Regular Expression library)
--> Updated my answer, you need to escape the dot, else it is a placeholder for any character and the pattern would also match something like "4x1y2zC3",

So here goes:
MID(A1,FIND("-",A1,1)+2,(FIND("_",A1,1)-FIND("-",A1,1))-5)

The fixed structure
(items) are always preceeded by " - " and followed by ".CA_VALUE (A)"
allows to isolate the code string via Split as follows:
consider ".CA_VALUE (A)" as closing delimiter, but change occurrence(s) to "- "
execute Split now on the resulting string using only the first delimiter (StartDelim "- ")
isolate the second token (index: 1 as split results are zero-based)
Function ExtractCode(ByVal s As String) As String
Const StartDelim As String = "- "
Const ClosingDelim As String = ".CA_VALUE (A)"
ExtractCode = Split(Replace(s, ClosingDelim, StartDelim), StartDelim)(1)
End Function
Another approach with focus on splitting via point delimiters //Edit 2021-11-20
If you want to experiment with a fixed start position of your 4-items code in a split array (based on point delimiters "."),
you might also consider the following approach:
split via point delimiters "."
filter only the 3rd,4th,5th and 6th item via WorksheetFunction.Index (by its columns argument)
join the resulting items again via connecting points "."
a) Using (Excel) version MS 365
Function ExtractCode(ByVal s As String, Optional startPos As Long = 3) As Variant
Const delim As String = "."
Dim tmp
tmp = Split(Replace(s, "- ", delim), delim) ' normalize hyphen to point delimiter
With Application.WorksheetFunction
ExtractCode = Join(.Index(tmp, 0, .Sequence(1, 4, startPos)), ".")
End With
End Function
b) Make it backwards compatible
Just change the function result assignment to
ExtractCode = Join(.Index(tmp, 0, Evaluate("{1,2,3,4}-1+" & startPos)), ".")
which in both cases changes the Index column argument to a 1-based column number Array(3,4,5,6)

Overlapping values of strings in Python

I am building a puzzle word game in Python. I have the correct puzzle word, and the guessed puzzle word. I want to build a third string which shows the correct letters in the guessed puzzle in the correct puzzle word, and _ at the position of the incorrect letters.
For example, say the correct word is APPLE and the guessed word is APTLE
then i want to have a third string: AP_L_
The guessed word and correct word are guaranteed to be 3 to 5 characters long, but the guessed word is not guaranteed to be the same length as the correct word
For example, correct word is TEA and the guessed word is TEAKO, then the third string should be TEA__ because the players guessed the last two letters incorrectly.
Another example, correct word is APPLE and guessed word is POP, the third string should be:
_ _ P_ _ (without space separation)
I can successfully get the matched indexes of the correct and guessed word; however, I am having problems building the third string. I just learned that strings in Python are immutable and that i cannot assign something like str1[index] = str2[index]
I have tried many things, including using lists, but i am not getting the correct answer. The attached code is my most recent attempt, would you please help me solve this?
Thank you
find the match between puzzle_word and guess
def matcher(str_a, str_b):
#find indexes where letters overlap
matched_indexes = [i for i, (a, b) in enumerate(zip(str_a, str_b)) if a == b]
result = []
for i in str_a:
result.append('_')
for value in matched_indexes:
result[value].replace('_', str_a[value])
print(result)
matcher("apple", "allke")
the output result right now is list of five "_"
cases:
correct word is APPLE and the guessed word is APTLE third
string: AP_L_
correct word is TEA and the guessed word is TEAKO,
third string should be TEA__
correct word is APPLE and guessed
word is POP, third string should be _ _ P_ _

You can use itertools.zip_longest here to always make sure you pad out to the longest word provided and then create a new string by joining the matching characters or otherwise a _. eg:
from itertools import zip_longest
correct_and_guess = [
('APPLE', 'APTLE'),
('TEA', 'TEAKO'),
('APPLE', 'POP')
]
for correct, guess in correct_and_guess:
# If characters in same positions match - show character otherwise `_`
new_word = ''.join(c if c == g else '_' for c, g in zip_longest(correct, guess, fillvalue='_'))
print(correct, guess, new_word)
Will print the following:
APPLE APTLE AP_LE
TEA TEAKO TEA__
APPLE POP __P__

Couple of things here.
str.replace() does not replace inline; as you noted strings are immutable, so you have to assign the result of replace:
result[value] = result[value].replace('_', str_a[value])
However, there's no point doing this since you can just assign to the list element:
result[value] = str_a[value]
And finally you can assign a list of the length of str_a without the for loop, which might be more readable:
result = ['_'] * len(str_a)

Convert underscores to spaces in Matlab string?

So say I have a string with some underscores like hi_there.
Is there a way to auto-convert that string into "hi there"?
(the original string, by the way, is a variable name that I'm converting into a plot title).

Surprising that no-one has yet mentioned strrep:
>> strrep('string_with_underscores', '_', ' ')
ans =
string with underscores
which should be the official way to do a simple string replacements. For such a simple case, regexprep is overkill: yes, they are Swiss-knifes that can do everything possible, but they come with a long manual. String indexing shown by AndreasH only works for replacing single characters, it cannot do this:
>> s = 'string*-*with*-*funny*-*separators';
>> strrep(s, '*-*', ' ')
ans =
string with funny separators
>> s(s=='*-*') = ' '
Error using ==
Matrix dimensions must agree.
As a bonus, it also works for cell-arrays with strings:
>> strrep({'This_is_a','cell_array_with','strings_with','underscores'},'_',' ')
ans =
'This is a' 'cell array with' 'strings with' 'underscores'

Try this Matlab code for a string variable 's'
s(s=='_') = ' ';

If you ever have to do anything more complicated, say doing a replacement of multiple variable length strings,
s(s == '_') = ' ' will be a huge pain. If your replacement needs ever get more complicated consider using regexprep:
>> regexprep({'hi_there', 'hey_there'}, '_', ' ')
ans =
'hi there' 'hey there'
That being said, in your case #AndreasH.'s solution is the most appropriate and regexprep is overkill.
A more interesting question is why you are passing variables around as strings?

regexprep() may be what you're looking for and is a handy function in general.
regexprep('hi_there','_',' ')
Will take the first argument string, and replace instances of the second argument with the third. In this case it replaces all underscores with a space.

In Matlab strings are vectors, so performing simple string manipulations can be achieved using standard operators e.g. replacing _ with whitespace.
text = 'variable_name';
text(text=='_') = ' '; //replace all occurrences of underscore with whitespace
=> text = variable name

I know this was already answered, however, in my case I was looking for a way to correct plot titles so that I could include a filename (which could have underscores). So, I wanted to print them with the underscores NOT displaying with as subscripts. So, using this great info above, and rather than a space, I escaped the subscript in the substitution.
For example:
% Have the user select a file:
[infile inpath]=uigetfile('*.txt','Get some text file');
figure
% this is a problem for filenames with underscores
title(infile)
% this correctly displays filenames with underscores
title(strrep(infile,'_','\_'))

Lua frontier pattern match (whole word search)

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string

Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

MATLAB search cell array for string subset

I'm trying to find the locations where a substring occurs in a cell array in MATLAB. The code below works, but is rather ugly. It seems to me there should be an easier solution.
cellArray = [{'these'} 'are' 'some' 'nicewords' 'and' 'some' 'morewords'];
wordPlaces = cellfun(#length,strfind(cellArray,'words'));
wordPlaces = find(wordPlaces); % Word places is the locations.
cellArray(wordPlaces);
This is similar to, but not the same as this and this.

The thing to do is to encapsulate this idea as a function. Either inline:
substrmatch = #(x,y) ~cellfun(#isempty,strfind(y,x))
findmatching = #(x,y) y(substrmatch(x,y))
Or contained in two m-files:
function idx = substrmatch(word,cellarray)
idx = ~cellfun(#isempty,strfind(word,cellarray))
and
function newcell = findmatching(word,oldcell)
newcell = oldcell(substrmatch(word,oldcell))
So now you can just type
>> findmatching('words',cellArray)
ans =
'nicewords' 'morewords'

I don't know if you would consider it a simpler solution than yours, but regular expressions are a very good general-purpose utility I often use for searching strings. One way to extract the cells from cellArray that contains words with 'words' in them is as follows:
>> matches = regexp(cellArray,'^.*words.*$','match'); %# Extract the matches
>> matches = [matches{:}] %# Remove empty cells
matches =
'nicewords' 'morewords'

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Compare two strings and extract common word out - string

In MATLAB how can we compare 2 strings and print the common word out. For Example string1 = "hello my name is bob"; and string2 = "today bob went to the park"; the word bob is common in both. What is the structure to follow.

Related

Extract different numbers from multiple strings

Overlapping values of strings in Python

Convert underscores to spaces in Matlab string?

Lua frontier pattern match (whole word search)

MATLAB search cell array for string subset

Categories

Resources