Replace multiple substrings using strrep in Matlab - string

I have a big string (around 25M characters) where I need to replace multiple substrings of a specific pattern in it.
Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
The substring I need to remove is the 'Frame #' and it occurs around 7670 times. I can give multiple search strings in strrep, using a cell array
strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')
However that returns a cell array, where in each cell, I have the original string with the corresponding substring of one of my input cell changed.
Is there a way to replace multiple substrings from a string, other than using regexprep? I noticed that it is considerably slower than strrep, that's why I am trying to avoid it.
With regexprep it would be:
regexprep(text,'Frame \d*',';')
and for a string of 25MB it takes around 47 seconds to replace all the instances.
EDIT 1: added the equivalent regexprep command
EDIT 2: added size of the string for reference, number of occurences for the substring and timing of execution for the regexprep

Ok, in the end I found a way to go around the problem. Instead of using regexprep to change the substring, I remove the 'Frame ' substring (including whitespace, but not the number)
rawData = strrep(text,'Frame ','');
This results in something like this:
1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Then, I change all the commas (,) and newline characters (\n) into a semicolon (;), using again strrep, and I create a big vector with all the numbers
rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');
then I remove the unnecessary numbers (1,2,...,7670), since they are located at a specific point in the array (each frame contains a specific amount of numbers).
rawData{1}(firstInstance:spacing:lastInstance)=[];
And then I go on with my manipulations. It seems that the additional strrep and removal of the values from the array is much much faster than the equivalent regexprep. With a string of 25M chars with regexprep I can do the whole operation in about 47", while with this workaround it takes only 5"!
Hope this helps somehow.

I think that this can be done using only textscan, which is known to be very fast. Be specifying a 'CommentStyle' the 'Frame #' lines are stripped out. This may only work because these 'Frame #' lines are on their own lines. This code returns the raw data as one big vector:
s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}
You may want to know how many elements are in each frame or even reshape the data into a matrix. You can use textscan again (or before the above) to get just the data for the first frame:
f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}
In fact, if you just want the elements from the first line, you can use this:
l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}
However, the other nice thing about textscan is that you can use it to read in the file directly (it looks like you may be using some other means currently) using just fopen to get an FID. Thus the string data text doesn't have to be in memory.

Using regular expressions:
result = regexprep(text,'Frame [0-9]+','');
It's possible to avoid regular expressions as follows. I use strrep with suitable replacement strings that act as masks. The obtained strings are equal-length and are assured to be aligned, and can thus be combined into the final result using the masks. I've also included the ; you want. I don't know if it will be faster than regexprep or not, but it's definitely more fun :-)
% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths
% Computations
rep_dest = cellfun(#(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';'
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters
With these example data:
>> text
text =
Hello Frame 1 test string Frame 22 end of Frame 2 this
>> result
result =
Hello ; test string ; end of ; this

Related

How to assign multiple lines to a string variable in Matlab

I have a few lines of text like this:
abc
def
ghi
and I want to assign these multiple lines to a Matlab variable for further processing.
I am copying these from very large text file and want to process it in Matlab Instead of saving the text into a file and then reading line by line for processing.
I tried to handle the above text lines as single string but am getting an error whilst trying to assign to a variable:
x = 'abc
def
ghi'
Error:
x = 'abc
|
Error: String is not terminated properly.
Any suggestions which could help me understand and solve the issue will be highly appreciated.
I frequently do this, namely copy text from elsewhere which I want to hard-code into a MATLAB script (in my case it's generally SQL code I want to manipulate and call from MATLAB).
To achieve this I have a helper function in clipboard2cellstr.m defined as follows:
function clipboard2cellstr
str = clipboard('paste');
str = regexprep(str, '''', ''''''); % Double any single quotes
strs = regexp(str, '\s*\r?\n\r?', 'split');
cs = sprintf('{\n''%s''\n}', strjoin(strs, sprintf('''\n''')));
clipboard('copy', cs);
disp(cs)
disp('(Copied to Clipboard)')
end
I then copy the text using Ctrl-c (or however) and run clipboard2cellstr. This changes the contents of the clipboard to something I can paste into the MATLAB editor using Ctrl-v (or however).
For example, copying this line
and this line
and this one, and then running the function generates this:
{
'For example, copying this line'
'and this line'
'and this one, and then running the function generates this:'
}
which is valid MATLAB which can be pasted directly in.
Your error is because you ended the line when MATLAB was expecting a closing quote character. You must use array notation to have multi-line or multi-element arrays.
You can assign like this if you use array notation
x = ['abc'
'def'
'hij']
>> x = 3×3 char array
Note: with this method, your rows must have the same number of characters, as you are really dealing with a character array. You can think of a character array like a numeric matrix, hence why it must be "rectangular".
If you have MATLAB R2016b or newer, you can use the string data type. This uses double quotes "..." rather than single quotes '...', and can be multi-line. You must still use array notation:
x = ["abc"
"def"
"hijk"]
>> x = 3×1 string array
We can have different numbers of characters in each line, as this is simply a 3 element string array, not a character array.
Alternatively, use a cell array of character arrays (or strings)
x = {'abc'
'def'
'hijk'}
>> x = 3×1 cell array
Again, you can have character arrays or strings of different lengths within a cell array.
In all of the above examples, a newline is simply for readability and can be replaced by a semi-colon ; to denote the next line of the array.
The option you choose will depend on what you want to do with the text. If you're reading from a file, I would suggest the string array or the cell array, as they can deal with different length lines. For backwards compatibility, use a cell array. You may find cellfun relevant for operating on cell arrays. For native string operations, use a string array.

Sort letters in string alphabetically- SAS

I would like to sort the letters in a string alphabetically.
E.g.
'apple' = 'aelpp'
The only function I have seen that is somewhat similar is SORTC, but I would like to avoid splitting each word into an array of letters if possible.
Joe's right - there is no built-in function that does this. You have two options here that I can see:
Split your string into an array and sort the array using call sortc. You can do this fairly painlessly using call pokelong provided that you have first defined an array of sufficient length.
Implement a sorting algorithm of your choice. If you choose to go down this route, I would suggest using substr on the left of the = sign to change individual characters without rewriting the whole string.
Here's an example of how you might do #1. #2 would be much more work.
data _null_;
myword = 'apple';
array letters[5] $1;
call pokelong(myword,addrlong(letters1),5); /*Limit # of chars to copy to the length of array*/
call sortc(of letters[*]);
myword = cat(of letters[*]);
putlog _all_;
run;
N.B. for an array of length 5 as used here, make sure you only write the first 5 characters of the string into memory at the start of the array when using call pokelong in order to avoid overflowing past the end of the array - otherwise you could overwrite some other arbitrary section of memory when processing longer values of myword. This could cause undesirable side effects, e.g. application / system crashes. Also, this technique for populating the array will not work in SAS University Edition - if you're using that, you'll need to use a do-loop instead.
I did a little test of this - sorting 2m random words of length 100 consisting of characters chosen from the whole ASCII printable range took about 15 seconds using a single CPU of a several-years-old PC - slightly less time than it took to create the test dataset.
data have;
length myword $100;
do i = 1 to 2000000;
do j = 1 to 100;
substr(myword,j,1) = byte(32 + int(ranuni(1) * (126 - 32)));
end;
output;
end;
drop i j;
run;
data want;
set have;
array letters[100] $1;
call pokelong(myword,addrlong(letters1),100); /*Limit # of chars to copy to the length of array*/
call sortc(of letters[*]);
myword = cat(of letters[*]);
drop letters:;
run;

What is the difference between strings and characters in Matlab?

What is the difference between string and character class in MATLAB?
a = 'AX'; % This is a character.
b = string(a) % This is a string.
The documentation suggests:
There are two ways to represent text in MATLAB®. You can store text in character arrays. A typical use is to store short pieces of text as character vectors. And starting in Release 2016b, you can also store multiple pieces of text in string arrays. String arrays provide a set of functions for working with text as data.
This is how the two representations differ:
Element access. To represent char vectors of different length, one had to use cell arrays, e.g. ch = {'a', 'ab', 'abc'}. With strings, they can be created in actual arrays: str = [string('a'), string('ab'), string('abc')].
However, to index characters in a string array directly, the curly bracket notation has to be used:
str{3}(2) % == 'b'
Memory use. Chars use exactly two bytes per character. strings have overhead:
a = 'abc'
b = string('abc')
whos a b
returns
Name Size Bytes Class Attributes
a 1x3 6 char
b 1x1 132 string
The best place to start for understanding the difference is the documentation. The key difference, as stated there:
A character array is a sequence of characters, just as a numeric array is a sequence of numbers. A typical use is to store short pieces of text as character vectors, such as c = 'Hello World';.
A string array is a container for pieces of text. String arrays provide a set of functions for working with text as data. To convert text to string arrays, use the string function.
Here are a few more key points about their differences:
They are different classes (i.e. types): char versus string. As such they will have different sets of methods defined for each. Think about what sort of operations you want to do on your text, then choose the one that best supports those.
Since a string is a container class, be mindful of how its size differs from an equivalent character array representation. Using your example:
>> a = 'AX'; % This is a character.
>> b = string(a) % This is a string.
>> whos
Name Size Bytes Class Attributes
a 1x2 4 char
b 1x1 134 string
Notice that the string container lists its size as 1x1 (and takes up more bytes in memory) while the character array is, as its name implies, a 1x2 array of characters.
They can't always be used interchangeably, and you may need to convert between the two for certain operations. For example, string objects can't be used as dynamic field names for structure indexing:
>> s = struct('a', 1);
>> name = string('a');
>> s.(name)
Argument to dynamic structure reference must evaluate to a valid field name.
>> s.(char(name))
ans =
1
Strings do have a bit of overhead, but still increase by 2 bytes per character. After every 8 characters it increases the size of the variable. The red line is y=2x+127.
figure is created using:
v=[];N=100;
for ct = 1:N
s=char(randi([0 255],[1,ct]));
s=string(s);
a=whos('s');v(ct)=a.bytes;
end
figure(1);clf
plot(v)
xlabel('# characters')
ylabel('# bytes')
p=polyfit(1:N,v,1);
hold on
plot([0,N],[127,2*N+127],'r')
hold off
One important practical thing to note is, that strings and chars behave differently when interacting with square brackets. This can be especially confusing when coming from python. consider following example:
>>['asdf' '123']
ans =
'asdf123'
>> ["asdf" "123"]
ans =
1×2 string array
"asdf" "123"

Extracting a specific word and a number of tokens on each side of it from each string in a column in SAS?

Extracting a specific word and a number of tokens on each side of it from each string in a column in SAS EG ?
For example,
row1: the sun is nice
row2: the sun looks great
row3: the sun left me
Is there a code that would produce the following result column (2 words where sun is the first):
SUN IS
SUN LOOKS
SUN LEFT
and possibly a second column with COUNT in case of duplicate matches.
So if there was 20 SUN LOOKS then it they would be grouped and have a count of 20.
Thanks
I think you can use functions findw() and scan() to do want you want. Both of those functions operate on the concept of word boundaries. findw() returns the position of the word in the string. Once you know the position, you can use scan() in a loop to get the next word or words following it.
Here is a simple example to show you the concept. It is by no means a finished or polished solution, but intended you point you in the right direction. The input data set (text) contains the sentences you provided in your question with slight modifications. The data step finds the word "sun" in the sentence and creates a variable named fragment that contains 3 words ("sun" + the next 2 words).
data text2;
set text;
length fragment $15;
word = 'sun'; * search term;
fragment_len = 3; * number of words in target output;
word_pos = findw(sentence, word, ' ', 'e');
if word_pos then do;
do i = 0 to fragmen_len-1;
fragment = catx(' ', fragment, scan(sentence, word_pos+i));
end;
end;
run;
Here is a partial print of the output data set.
You can use a combination of the INDEX, SUBSTR and SCAN functions to achieve this functionality.
INDEX - takes two arguments and returns the position at which a given substring appears in a string. You might use:
INDEX(str,'sun')
SUBSTR - simply returns a substring of the provided string, taking a second numeric argument referring to the starting position of the substring. Combine this with your INDEX function:
SUBSTR(str,INDEX(str,'sun'))
This returns the substring of str from the point where the word 'sun' first appears.
SCAN - returns the 'words' from a string, taking the string as the first argument, followed by a number referring to the 'word'. There is also a third argument that specifies the delimiter, but this defaults to space, so you wouldn't need it in your example.
To pick out the word after 'sun' you might do this:
SCAN(SUBSTR(str,INDEX(str,'sun')),2)
Now all that's left to do is build a new string containing the words of interest. That can be achieved with concatenation operators. To see how to concatenate two strings, run this illustrative example:
data _NULL_;
a = 'Hello';
b = 'World';
c = a||' - '||b;
put c;
run;
The log should contain this line:
Hello - World
As a result of displaying the value of the c variable using the put statement. There are a number of functions that can be used to concatenate strings, look in the documentation at CAT,CATX,CATS for some examples.
Hopefully there is enough here to help you.

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

Resources