Matlab: How to delete prefix from strings - string

Problem: From TrajCompact, i find all the prefix and the value after prefix, using regexp, with this code:
[digits{1:2}] = ndgrid(0:4);
for k=1:25
matches(:,k)=regexp(TrajCompact(:,1),sprintf('%d%d.*',digits{1}(k),digits{2}(k)),'match','once');
end
I want only the postfix of matches, how can delete the prefix from matches?

Method using regular expressions
You can put the .* section in a group by enclosing it in parenthesis (i.e. (.*)). Matlab has some peculiar 'token' nomenclature for this. In any case, an example of how it works:
[match, group] = regexp('25blah',sprintf('%d%d(.*)',2,5),'match','once','tokens');
Then:
match would be a char array containing '25blah'
group would be a 1x1 cell array containing the string 'blah'.
That is, the variable group would hold what you're looking for.
Hack method
Since your prefix is always two digits, you could also just take everything from the 3rd character of the match onwards:
my_string = match(3:end);
other comments
You may want to require the prefix to occur at the beginning of the string by adding ^ to the beginning of your regular expression. Eg., make the line:
[match, group] = regexp('25blah',sprintf('^%d%d(.*)',2,5),'match','once','tokens');
As it is, your current regular expression would match strings like zzzzzzzzz25stuff. I'm not sure if you want that (assuming it can occur in your data).

Related

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.
In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.
Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.
Demo on regex101 (requires JS that supports variable width lookbehinds)
The pattern you tried does not match because this part (\d)(?<!\1) can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured
digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Regex demo | Python demo
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44
#Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Demo
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat".
The internal string pointer is initially at the beginning of the line.
"c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
"c" is matched and saved to capture group 1.
The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
"a" fails the first two parts of the assertion so the third part is considered.
The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
The next character ("a") is matched and saved in capture group 3.
The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.
Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)
See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

Efficient way to insert characters between other characters in a string

What is an efficient way in MATLAB to replace/insert one symbol (in series of symbols) with several others that correspond to the one that is being replaced?
For example, consider having a string Eq: Eq = 'A*exp(-((x-xc)/w)^2)'. Is there a way to replace * with .*, / with ./,\ with .\, and ^ with .^ without writing four separate strrep() lines?
Regular expressions will do the job nicely. Regular expressions simply find patterns in text. You specify what kind of pattern you are looking for by a regular expression, and the output gives you the locations of where the pattern occurred.
For our particular case, not only do we want to find where patterns occur, we also want to replace those patterns with something else. Specifically, use the function regexprep from MATLAB to replace matches in a string with something else. What you want to do is replace all *, /, \ and ^ symbols by adding a . in front of each.
How regexprep works is that the first input is the string you're looking at, the second input is a pattern that you're trying to find. In our case, we want to find any of *, /, \ and ^. To specify this pattern, you put those desired symbols in [] brackets. Regular expressions reserve \ as a special symbol to delineate characters that can be parsed as a regular expression but actually aren't. As such, you need to use \\ for the \ character and \^ for the ^ character. The third input is what you want to replace each match with. In our case, we simply want to reuse each matched character, but we add a . at the beginning of the match. This is done by doing \.$0 in the regular expression syntax. $0 means to grab the first token produced by a match... which is essentially the matched symbol from the pattern. . is also a reserved keyword using regular expressions, so we must prepend this symbol with a \ character.
Without further ado:
>> Eq = 'A*exp(-((x-xc)/w)^2)';
>> out = regexprep(Eq, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2)
The pattern we are looking for is [*/\\\^], which means that we want to find any of *, /, \ - denoted as \\ in regex, and \^ - denoted as ^ in regex. We want to find any of these symbols and replace them with the same symbol by adding a . character in front - \.$0.
As a more complicated example, let's make sure that we include all of the symbols you're looking for in a sample equation:
>> A = 'A*exp(-((x-xc)/w)^2) \ b^2';
>> out = regexprep(A, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2) .\ b.^2
I'd go with regexp as in rayryeng's answer. But here's another approach, just to provide an alternative.
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
[~, jj] = sort([1:numel(Eq) ii-.5]); %// will be used to properly order the result
result = [Eq repmat('.',1,numel(ii))]; %// insert dots at the end
result = result(jj); %// properly order the result
And a variant:
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
jj = sort([1:numel(Eq) ii-.5]); %// dot locations are marked with fractional part
result = Eq(ceil(jj)); %// repeat characters where the dots will be placed
result(mod(jj,1)>0) = '.'; %// place dots at indices with fractional part
The vectorize function already does almost all of what you want except that it does not convert mldivide (\) to ldivide (.\).
By "efficient," do you mean fewer lines of code or faster? Regular expressions are almost always slower than other approaches and less readable. I don't think they're necessary or a good choice in this case. If you only need to convert your string once, then speed is less of a concern than readability (strrep will still be faster). If you need to do it many times, this simple code that you alluded to is 4–5 times faster than regexrep for short strings like your example (and much faster for longer strings):
out = strrep(Eq,'*','.*');
out = strrep(out,'/','./');
out = strrep(out,'\','.\');
out = strrep(out,'^','.^');
If you want one line, use:
out = strrep(strrep(strrep(strrep(Eq,'*','.*'),'/','./'),'\','.\'),'^','.^');
which will also be slightly faster still. Or create your own version of vectorize and call that.
Where regular expressions shine is in more complex cases, e.g., if your string is already partially vectorized: Eq = 'A.*exp(-((x-xc)/w)^2)'. Even still, the vectorize function just uses strrep and then calls strfind to "remove any possible '..*', '../', etc." and replace them with the proper element-wise operators because it's faster (symbolic math strings can get very large, for example).

Algorithms for "shortening" strings?

I am looking for elegant ways to "shorten" the (user provided) names of object. More precisely:
my users can enter free text (used as "name" of some object), they can use up to 64 chars (including whitespaces, punctuation marks, ...)
in addition to that "long" name; we also have a "reduced" name (exactly 8 characters); required for some legacy interface
Now I am looking for thoughts on how to generate these "reduced" names, based on the 64-char name.
With "elegant" I am wondering about any useful ideas that "might" allow the user to recognize something with value within the shortened string.
Like, if the name is "Production Test Item A5"; then maybe "PTIA5" might (or might not) tell the user something useful.
Apply a substring method to the long version, trim it, in case there are any whitespace characters at the end, optionally remove any special characters from the very end (such as dashes) and finally add a dot, in case you want to indicate your abbreviation that way.
Just a quick hack to get you started:
String longVersion = "Aswaghtde-5d";
// Get substring 0..8 characters
String shortVersion = longVersion.substring(0, (longVersion.length() < 8 ? longVersion.length() : 8));
// Remove whitespace characters from end of String
shortVersion = shortVersion.trim();
// Remove any non-characters from end of String
shortVersion = shortVersion.replaceAll("[^a-zA-Z0-9\\s]+$", "");
// Add dot to end
shortVersion = shortVersion.substring(0, (shortVersion.length() < 8 ? shortVersion.length() : shortVersion.length() - 1)) + ".";
System.out.println(shortVersion);
I needed to shorten names to function as column names in a database. Ideally, the names should be recognizable for users. I set up a dictionary of patterns for commonly occuring words with corresponding "abbreviations". This was applied ONLY to those names which were over the limit of 30 characters.

Lua pattern for parsing strings with optional part

I have to parse a string in the form value, value, value, value, value. The two last values are optional. This is my code, but it works only for the required arguments:
Regex = "([^,])+, ([^,])+, ([^,])+"
I'm using string.match to get the value into variables.
Since you're splitting the string by a comma, use gmatch:
local tParts = {}
for sMatch in str:gmatch "([^,]+)" do
table.insert( tParts, sMatch )
end
Now, once the parts are stored inside the table; you can check if the table contains matched groups at indexes 4 and 5 by:
if tParts[4] and tParts[5] then
-- do your job
elseif tParts[3] then
-- only first three matches were there
end
In Lua you can't make a capturing group optional, and also you are not able to use a logical OR operator. So the answer is: It isn't possible.

Resources