Matlab: Find string pattern with a list of words and replace in text with one word of the list - string

In Matlab, Consider the string:
str = 'text text text [[word1,word2,word3]] text text'
I want to isolate randomly one word of the list ('word1','word2','word3'), say 'word2', and then write, in a possibly new file, the string:
strnew = 'text text text word2 text text'
My approach is as follows (certainly pretty bad):
Isolating the string '[[word1,word2,word3]]' can be achieved via
str2=regexp(str,'\[\[(.*?)\]\]','match')
Removing the opening and closing square brackets in the string is achieved via
str3=str2(3:end-2)
Finally we can split str3 into a list of words (stored in a cell)
ListOfWords = split(str3,',')
which outputs {'word1'}{'word2'}{'word3'} and I am stuck there. How can I pick one of the entries and plug it back into the initial string (or a copy of it...)? Note that the delimiters [[ and ]] could both be changed to || if it can help.

You can do it as follows:
Use regexp with the 'split' option;
Split the middle part into words;
Select a random word;
Concatenate back.
str = 'text text text [[word1,word2,word3]] text text'; % input
str_split = regexp(str, '\[\[|\]\]', 'split'); % step 1
list_of_words = split(str_split{2}, ','); % step 2
chosen_word = list_of_words{randi(numel(list_of_words))}; % step 3
strnew = [str_split{1} chosen_word str_split{3}]; % step 4

I have a horrible solution. I was trying to see if I could do it in one function call. You can... but at what cost! Abusing dynamic regular expressions like this barely counts as one function call.
I use a dynamic expression to process the comma separated list. The tricky part is selecting a random element. This is made exceedingly difficult because MATLAB's syntax doesn't support paren indexing off the result of a function call. To get around this, I stick it in a struct so I can dot index. This is terrible.
>> regexprep(str,'\[\[(.*)\]\]',"${struct('tmp',split(string($1),',')).tmp(randi(count($1,',')+1))}")
ans =
'text text text word3 text text'
Luis definitely has the best answer, but I think it could be simplified a smidge by not using regular expressions.
str = 'text text text [[word1,word2,word3]] text text'; % input
tmp = extractBetween(str,"[[","]]"); % step 1
tmp = split(tmp, ','); % step 2
chosen_word = tmp(randi(numel(tmp))) ; % step 3
strnew = replaceBetween(str,"[[","]]",chosen_word,"Boundaries","Inclusive") % step 4

Related

how to extract a substring in a text file, when the substring is between two parentheses?

I have a text file that contains sections as shown below
V1('ww', '6deg')
V2('bb', '15meter')
V3('cc','25yards')
.
.
V4('dd', '72cm')
these sections are randomly distributed inside the text file.
Using MATLAB, I need to find all the occurrences of VariableProp(VarName, VarValue) in the file, and change the VarValue.
Any ideas?
Thank you
You can do this with textscan. (You could also probably do it with regexp). Here's a textscan approach:
str = "V4('dd', '72cm')"; % a line from the file
% Call textscan on a single line of text
x = textscan(str, "%[^(](%[^']%[^'])", ...
MultipleDelimsAsOne=true, Delimiter=[","," ", "'"]);
% x is a 3-element cell array. If we got a match, each element in the
% outer cell is a scalar. Use vertcat to unwrap a layer of cell-ness:
x = vertcat(x{:});
% If we're left with 3 elements, it was a match
isMatch = numel(x) == 3;

Move a character or word to a new line

Given a string how do i move part of the string in to a new line. without moving the rest of the line or characters
'This' and 'this' word should go in the next line
Output:
> and word should go in the next line
This this
This is just an example of the output i want assuming the words can be different by characters. To be more clear say i have some string elements in an array and i have to move every second and third word of the elements to a new line and printing the rest of the line as is. I've tried using \n and a for loop. But it also moves the rest of the string to a new line
['This and this', 'word should go', 'in the next']
Output:
> This word in
and this should go the next
So the 2nd and 3rd word of the elements are moved without affecting the rest of the line. Is it possible to do this without much complication? I'm aware of the format method but i don't know how to use it in this situation.
For your first example, in case you don't know the order of the target words in advance, I would use a dictionary to store the indices of the found words. Then you can sort those to put the found words in the second line in the same order as they appeared in the text:
targets = ['this', 'This']
source = 'This and this word should go in the next line.'
target_ixs = {source.find(target): target for target in targets}
line2 = ' '.join([target_ixs[i] for i in sorted(target_ixs)])
line1 = source
for target in targets:
line1 = line1.replace(target, '')
line1 = line1.replace(' ', ' ').lstrip()
result = line1 + '\n' + line2
print(result)
and word should go in the next line.
This this
Your second example is easier, because you already know which parts of the strings to put in the second line, so you just need to split each string into a list of words and select from those:
source = ['This and this', 'word should go', 'in the next']
source_lists = [s.split() for s in source]
line1 = ' '.join([source_list[0] for source_list in source_lists])
line2 = ' '.join([' '.join(source_list[1:]) for source_list in source_lists])
result = line1 + '\n' + line2
print(result)
This word in
and this should go the next
You can probably do quite a bit without much complication using the regular expression library and some python language features. That being said, it depends on how complex the rules are for determining what words go where. Typically, you want to start with a string and "tokenize" it into the constituent words. See the code example below:
import re
sentence = "This and this word should go in the next line"
all_words = re.split(r'\W+', sentence)
matched_words = " ".join(re.findall(r"this", sentence, re.IGNORECASE))
unmatched_words = " ".join([word for word in all_words if word not in matched_words])
print(f"{unmatched_words}\n{matched_words}")
> and word should go in the next line
This this
Final Thoughts:
I am by no means a regex ninja so, there may be even more clever things that can be done with just regex patterns and functions. Hopefully, this gives you some food for thought at least.
Got it:
data = ['This and this', 'word should go', 'in the next']
first_line = []
second_line = []
for item in data:
item = item.split(' ')
first_word = item[0]
item.remove(first_word)
others = " ".join(item)
first_line.append(first_word)
second_line.append(others)
print(" ".join(first_line) + "\n" + " ".join(second_line))
My Solution:
input_data = ['This and this', 'word should go ok', 'this next']
I've slightly altered your test string to better test the code.
# Example 1
# Print all words in input_data, moving any word matching the
# string "this" (match is case insensitive) to the next line.
print('Example 1')
lines = ([], [])
for words in input_data:
for word in words.split():
lines[word.lower() == 'this'].append(word)
result = ' '.join(lines[0]) + '\n' + ' '.join(lines[1])
print(result)
The code in example 1 sorts each word into the 2-element tuple, lines. The key part is the boolean expression that preforms the string comparison.
# Example 2
# Print all words in input_data, moving the second and third
# word in any string to the next line.
from itertools import count
print('\nExample 2')
lines = ([], [])
for words in input_data:
for q in zip(count(), words.split()):
lines[q[0] in (1, 2)].append(q[1])
result = ' '.join(lines[0]) + '\n' + ' '.join(lines[1])
print(result)
The next solution is basically the same as the first. I zip each word to an integer so you know the word's position when you get to the boolean expression which, again, sorts the words into their appropriate list in lines.
As you can see, this solution is fairly flexible and can be adjusted to fit a number of scenarios.
Good luck, and I hope this helped!

How to break a text block up so that it will display only One Word on each line

I am importing longer form text into a Unity program. I need one word of the longer text to be displayed on each line...
Thanks
The problem with working with large blocks of text in Word is that operations like Find and Replace can only be performed with Find text strings of 255 characters or less without causing an error. Once you import your text and assign it to a string variable, you can use Len() to determine the length of the string and then use Left() Mid() and Right() to breakup the larger string into shorter chunks of 250 characters each. Here's some code I wrote for just a find and replace situation:
With Selection.Find
y = Len(Selection.Text)
Select Case y
Case Is <= 250
x = 1
.Text = stFound
.Execute Replace:=wdReplaceAll
Case Is <= 500
Dim stFound2 As String
x = 2
z = Len(stFound) - 250
stFound1 = Left(stFound, 250)
stFound2 = Right(stFound, z)
Case Is <= 750
Dim stFound2 As String
Dim stFound3 As String
x = 3
stFound1 = Left(stFound, 250)
stFound2 = Mid(stFound, 251, 249)
stFound3 = Right(stFound, Len(stFound) - 500)
End Select
End With
I then used a For Next loop to run a Find and Replace on each string.
In your situation, it's going to be important to not break up the strings in the middle of a word. To do this you can use the InStr() function to find the position of spaces within your string and then break up the text according to where the spaces are. I wouldn't try using the Split() function on the raw text as depending on the size of the string you could run into a Subscript Out of Range error.
Once the text is chunked down into useable pieces, use the Split() function to send each word to an array and then run the following code to put each word on it's own line or paragraph:
Dim stTxt as String
dim stWord as String
dim stArr() as String
dim x as long
stTxt = 'One of your text strings
stArr() = Split(stTxt)
For x = LBound(stArr()) to UBound(stArr())
stWord = stArr(x) & "^p"
Selection.Typetext stWord
Next
After a little more research, I determined that the 255 character limit to text strings only affects some functions, not all. So I took a 17,335 character (including spaces) Word document and ran Split() on it to create an Array. There were no errors and the resulting array had a UBound of 2690.
So the next question is what kind of text is being imported into Word and what size is it. Is it just a list of words separated by spaces, or another delimiter? Does it contain any punctuation? If it's just a list of words separated by spaces or another delimiter such as a comma or semicolon, the Split() function will sort the words into an Array, at least up to 17,000 characters. More testing would be required for a larger text block. If the text contains punctuation, you would have to process the text to remove the unwanted punctuation which can be done with a Wildcard Find and Replace as long as the Find string is <= 255 characters. But if all you have are words and spaces or some other delimiter, using Split() to separate each word into an array element would work and then just run code as in the second half of my previous example:
For x = LBound(stArr()) to UBound(stArr())
stWord = stArr(x) & "^p"
Selection.Typetext stWord
Next

Is there a way to substring, which is between two words in the string in Python?

My question is more or less similar to:
Is there a way to substring a string in Python?
but it's more specifically oriented.
How can I get a par of a string which is located between two known words in the initial string.
Example:
mySrting = "this is the initial string"
Substring = "initial"
knowing that "the" and "string" are the two known words in the string that can be used to get the substring.
Thank you!
You can start with simple string manipulation here. str.index is your best friend there, as it will tell you the position of a substring within a string; and you can also start searching somewhere later in the string:
>>> myString = "this is the initial string"
>>> myString.index('the')
8
>>> myString.index('string', 8)
20
Looking at the slice [8:20], we already get close to what we want:
>>> myString[8:20]
'the initial '
Of course, since we found the beginning position of 'the', we need to account for its length. And finally, we might want to strip whitespace:
>>> myString[8 + 3:20]
' initial '
>>> myString[8 + 3:20].strip()
'initial'
Combined, you would do this:
startIndex = myString.index('the')
substring = myString[startIndex + 3 : myString.index('string', startIndex)].strip()
If you want to look for matches multiple times, then you just need to repeat doing this while looking only at the rest of the string. Since str.index will only ever find the first match, you can use this to scan the string very efficiently:
searchString = 'this is the initial string but I added the relevant string pair a few more times into the search string.'
startWord = 'the'
endWord = 'string'
results = []
index = 0
while True:
try:
startIndex = searchString.index(startWord, index)
endIndex = searchString.index(endWord, startIndex)
results.append(searchString[startIndex + len(startWord):endIndex].strip())
# move the index to the end
index = endIndex + len(endWord)
except ValueError:
# str.index raises a ValueError if there is no match; in that
# case we know that we’re done looking at the string, so we can
# break out of the loop
break
print(results)
# ['initial', 'relevant', 'search']
You can also try something like this:
mystring = "this is the initial string"
mystring = mystring.strip().split(" ")
for i in range(1,len(mystring)-1):
if(mystring[i-1] == "the" and mystring[i+1] == "string"):
print(mystring[i])
I suggest using a combination of list, split and join methods.
This should help if you are looking for more than 1 word in the substring.
Turn the string into array:
words = list(string.split())
Get the index of your opening and closing markers then return the substring:
open = words.index('the')
close = words.index('string')
substring = ''.join(words[open+1:close])
You may want to improve a bit with the checking for the validity before proceeding.
If your problem gets more complex, i.e multiple occurrences of the pair values, I suggest using regular expression.
import re
substring = ''.join(re.findall(r'the (.+?) string', string))
The re should store substrings separately if you view them in list.
I am using the spaces between the description to rule out the spaces between words, you can modify to your needs as well.

Python 2.7 - remove special characters from a string and camelCasing it

Input:
to-camel-case
to_camel_case
Desired output:
toCamelCase
My code:
def to_camel_case(text):
lst =['_', '-']
if text is None:
return ''
else:
for char in text:
if text in lst:
text = text.replace(char, '').title()
return text
Issues:
1) The input could be an empty string - the above code does not return '' but None;
2) I am not sure that the title()method could help me obtaining the desired output(only the first letter of each word before the '-' or the '_' in caps except for the first.
I prefer not to use regex if possible.
A better way to do this would be using a list comprehension. The problem with a for loop is that when you remove characters from text, the loop changes (since you're supposed to iterate over every item originally in the loop). It's also hard to capitalize the next letter after replacing a _ or - because you don't have any context about what came before or after.
def to_camel_case(text):
# Split also removes the characters
# Start by converting - to _, then splitting on _
l = text.replace('-','_').split('_')
# No text left after splitting
if not len(l):
return ""
# Break the list into two parts
first = l[0]
rest = l[1:]
return first + ''.join(word.capitalize() for word in rest)
And our result:
print to_camel_case("hello-world")
Gives helloWorld
This method is quite flexible, and can even handle cases like "hello_world-how_are--you--", which could be difficult using regex if you're new to it.

Resources