Finding mean of ascii values in a string MATLAB - string

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?

If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean

You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.

Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

Related

re.sub replacing string using original sub-string

I have a text file. I would like to remove all decimal points and their trailing numbers, unless text is preceding.
e.g 12.29,14.6,8967.334 should be replaced with 12,14,8967
e.g happypants2.3#email.com should not be modified.
My code is:
import re
txt1 = "9.9,8.8,22.2,88.7,morris1.43#email.com,chat22.3#email.com,123.6,6.54"
txt1 = re.sub(r',\d+[.]\d+', r'\d+',txt1)
print(txt1)
unless there is an easier way of completing this, how do I modify r'\d+' so it just returns the number without a decimal place?
You need to make use of groups in your regex. You put the digits before the '.' into parentheses, and then you can use '\1' to refer to them later:
txt1 = re.sub(r',(\d+)[.]\d+', r',\1',txt1)
Note that in your attempted replacement code you forgot to replace the comma, so your numbers would have been glommed together. This still isn't perfect though; the first number, since it doesn't begin with a comma, isn't processed.
Instead of checking for a comma, the better way is to check word boundaries, which can be done using \b. So the solution is:
import re
txt1 = "9.9,8.8,22.2,88.7,morris1.43#email.com,chat22.3#email.com,123.6,6.54"
txt1 = re.sub(r'\b(\d+)[.]\d+\b', r'\1',txt1)
print(txt1)
Considering these are the only two types of string that is present in your file, you can explicitly check for these conditions.
This may not be an efficient way, but what I have done is split the str and check if the string contains #email.com. If thats true, I am just appending to a new list. For your 1st condition to satisfy, we can convert the str to int which will eliminate the decimal points.
If you want everything back to a str variable, you can use .join().
Code:
txt1 = "9.9,8.8,22.2,88.7,morris1.43#email.com,chat22.3#email.com,123.6,6.54"
txt_list = []
for i in (txt1.split(',')):
if '#email.com' in i:
txt_list.append(i)
else:
txt_list.append(str(int(float(i))))
txt_new = ",".join(txt_list)
txt_new
Output:
'9,8,22,88,morris1.43#email.com,chat22.3#email.com,123,6'

SAS finding an uppercase word within a string

I have a string which contains one word in uppercase somewhere within it. I want to extract that one word into a new variable using SAS.
I think I need to find a way to code up finding a word which contains two or more uppercase letters (as the start of a sentence would begin with an uppercase letter).
i.e. How do I create the variable 'word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
Hope someone can help and the question is clear
Thanks in advance
Consider a regex match/replace with a negative lookbehind to include two types of matches:
consecutive upper case words followed by a space with at least two characters (to avoid title cases at beginning of sentence): (([A-Z ]){2,})
consecutive upper case words followed by a period with at least two characters: (to avoid title cases at beginning of sentence): (([A-Z.]){2,})
CAVEAT: This solution works except the I article is also matched which technically is a valid match as it is also an all uppercase one-word. Being the only type in English language, consider a tranwrd() replace for such a special case. In fact, relatedly, this solution matches ALL uppercase words.
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC
SAS has a prxsubstr() function call that finds the starting position and length of a substring that matches a given regex pattern within a given string. Here's a sample solution using the prxsubstr() function call:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr() documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
Regex word boundary info: http://www.regular-expressions.info/wordboundaries.html

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.
Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)
See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

Removing first character from a string in octave

I wanted to know how to remove first character of a string in octave. I am manipulating the string in a loop and after every loop, I want to remove the first character of the remaining string.
Thanks in advance.
If it's just a one-line string then:
short_string = long_string(2:end)
But if you have a cell array of strings then either do it as above if you have a loop already, otherwise you can use this shorthand to do it in one line:
short_strings = cellfun(#(x)(x(2:end)), long_strings, 'uni', false)
Or else if you have a matrix of strings (i.e. all the same length), then you can vectorize it as:
short_strings = long_strings(:, 2:end)

Replace multiple substrings using strrep in Matlab

I have a big string (around 25M characters) where I need to replace multiple substrings of a specific pattern in it.
Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
The substring I need to remove is the 'Frame #' and it occurs around 7670 times. I can give multiple search strings in strrep, using a cell array
strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')
However that returns a cell array, where in each cell, I have the original string with the corresponding substring of one of my input cell changed.
Is there a way to replace multiple substrings from a string, other than using regexprep? I noticed that it is considerably slower than strrep, that's why I am trying to avoid it.
With regexprep it would be:
regexprep(text,'Frame \d*',';')
and for a string of 25MB it takes around 47 seconds to replace all the instances.
EDIT 1: added the equivalent regexprep command
EDIT 2: added size of the string for reference, number of occurences for the substring and timing of execution for the regexprep
Ok, in the end I found a way to go around the problem. Instead of using regexprep to change the substring, I remove the 'Frame ' substring (including whitespace, but not the number)
rawData = strrep(text,'Frame ','');
This results in something like this:
1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Then, I change all the commas (,) and newline characters (\n) into a semicolon (;), using again strrep, and I create a big vector with all the numbers
rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');
then I remove the unnecessary numbers (1,2,...,7670), since they are located at a specific point in the array (each frame contains a specific amount of numbers).
rawData{1}(firstInstance:spacing:lastInstance)=[];
And then I go on with my manipulations. It seems that the additional strrep and removal of the values from the array is much much faster than the equivalent regexprep. With a string of 25M chars with regexprep I can do the whole operation in about 47", while with this workaround it takes only 5"!
Hope this helps somehow.
I think that this can be done using only textscan, which is known to be very fast. Be specifying a 'CommentStyle' the 'Frame #' lines are stripped out. This may only work because these 'Frame #' lines are on their own lines. This code returns the raw data as one big vector:
s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}
You may want to know how many elements are in each frame or even reshape the data into a matrix. You can use textscan again (or before the above) to get just the data for the first frame:
f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}
In fact, if you just want the elements from the first line, you can use this:
l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}
However, the other nice thing about textscan is that you can use it to read in the file directly (it looks like you may be using some other means currently) using just fopen to get an FID. Thus the string data text doesn't have to be in memory.
Using regular expressions:
result = regexprep(text,'Frame [0-9]+','');
It's possible to avoid regular expressions as follows. I use strrep with suitable replacement strings that act as masks. The obtained strings are equal-length and are assured to be aligned, and can thus be combined into the final result using the masks. I've also included the ; you want. I don't know if it will be faster than regexprep or not, but it's definitely more fun :-)
% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths
% Computations
rep_dest = cellfun(#(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';'
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters
With these example data:
>> text
text =
Hello Frame 1 test string Frame 22 end of Frame 2 this
>> result
result =
Hello ; test string ; end of ; this

Resources