What is the difference between strings and characters in Matlab? - string

What is the difference between string and character class in MATLAB?
a = 'AX'; % This is a character.
b = string(a) % This is a string.

The documentation suggests:
There are two ways to represent text in MATLAB®. You can store text in character arrays. A typical use is to store short pieces of text as character vectors. And starting in Release 2016b, you can also store multiple pieces of text in string arrays. String arrays provide a set of functions for working with text as data.
This is how the two representations differ:
Element access. To represent char vectors of different length, one had to use cell arrays, e.g. ch = {'a', 'ab', 'abc'}. With strings, they can be created in actual arrays: str = [string('a'), string('ab'), string('abc')].
However, to index characters in a string array directly, the curly bracket notation has to be used:
str{3}(2) % == 'b'
Memory use. Chars use exactly two bytes per character. strings have overhead:
a = 'abc'
b = string('abc')
whos a b
returns
Name Size Bytes Class Attributes
a 1x3 6 char
b 1x1 132 string

The best place to start for understanding the difference is the documentation. The key difference, as stated there:
A character array is a sequence of characters, just as a numeric array is a sequence of numbers. A typical use is to store short pieces of text as character vectors, such as c = 'Hello World';.
A string array is a container for pieces of text. String arrays provide a set of functions for working with text as data. To convert text to string arrays, use the string function.
Here are a few more key points about their differences:
They are different classes (i.e. types): char versus string. As such they will have different sets of methods defined for each. Think about what sort of operations you want to do on your text, then choose the one that best supports those.
Since a string is a container class, be mindful of how its size differs from an equivalent character array representation. Using your example:
>> a = 'AX'; % This is a character.
>> b = string(a) % This is a string.
>> whos
Name Size Bytes Class Attributes
a 1x2 4 char
b 1x1 134 string
Notice that the string container lists its size as 1x1 (and takes up more bytes in memory) while the character array is, as its name implies, a 1x2 array of characters.
They can't always be used interchangeably, and you may need to convert between the two for certain operations. For example, string objects can't be used as dynamic field names for structure indexing:
>> s = struct('a', 1);
>> name = string('a');
>> s.(name)
Argument to dynamic structure reference must evaluate to a valid field name.
>> s.(char(name))
ans =
1

Strings do have a bit of overhead, but still increase by 2 bytes per character. After every 8 characters it increases the size of the variable. The red line is y=2x+127.
figure is created using:
v=[];N=100;
for ct = 1:N
s=char(randi([0 255],[1,ct]));
s=string(s);
a=whos('s');v(ct)=a.bytes;
end
figure(1);clf
plot(v)
xlabel('# characters')
ylabel('# bytes')
p=polyfit(1:N,v,1);
hold on
plot([0,N],[127,2*N+127],'r')
hold off

One important practical thing to note is, that strings and chars behave differently when interacting with square brackets. This can be especially confusing when coming from python. consider following example:
>>['asdf' '123']
ans =
'asdf123'
>> ["asdf" "123"]
ans =
1×2 string array
"asdf" "123"

Related

String decompression : Reduce time and space complexity

N-rounds of compression are run on a string, where each round replaces some character pattern with one special character (using a dictionary).
Given this compressed string and the dictionary used for compression, we need to find the original string.
For ex:
Dictionary used for compression:
b12k -> ?
a?l -> #
#mn -> !
So, the string ab12klmn is compressed as !
What data structure suits best to store this dictionary such that the decompression is O(n) operation with least possible extra space used?
What I've tried:
This was an interview question, I stored the dictionary in a map with target alphabet (of the compression dictionary) as the key of my map and decompressed strings as the values.
Then a traversal through the given string replacing the special characters with their respective expansions.
For ex:
! -> ab12klmn
# -> ab12k
? -> b12k
Then to reduce the duplicacy of string patterns I did a tree like structuring of this dictionary but the interviewer wasn't satisfied.
Where can I improve this solution?
I understand that we need to get back the original string from the given compressed string.
The best data structure that you can use here can be an 2-dimensional vector (dynamic array). I will try and explain why this can be the best data structure for this problem.
When we use a map we introduce a logn factor while looking for a particular key. With vectors if you know the location of your search query it can be done in O(1).
When we use a vector we are not wasting any extra memory blocks. This is also the case with maps. But if you use 2-d arrays unnecessary memory will be wasted.
But since there are only 256 characters, we will store the dictionary as follows. Lets have a 2d vector of strings with max 256 rows. For this example
b12k -> ?
a?l -> #
#mn -> !
So we will store "b12k" at v[63] as ASCII value of '?' is 63. Similarly, we will store we will store "a?l" at v[35] as ASCII value of '#' is 35 and so on,
NOW HOW TO FIND THE ORIGINAL STRING:
We start from the compressed string.
Initialize your string which will store the final ans. Lets call it origString = "".
Start traversing the string. If its a non-special character add this character to the origString.
If we find any special character just go to that characters ASCII value and its corresponding location in 2d-vector.
Go to step 2.
The pseudo-code for this is
origString = "";
func getOriginalFromCompressed(string s)
for i = [0:s.length()-1]
if(v[s[i]].length()) getOriginalFromCompressed(v[s[i]]);
else origString = stringConcat(origString,s[i]); //add the charcacter to your final ans
end for
end func
origString has the original string.
So the time and space complexity of this solution is O(n).
where n=sum of lengths of all the strings in dictionary given.

Python. Why the length of the list changes after turning it from int to string?

I have a bunch of users in a list called UserList.
And I do not want the output to have the square brackets, so I run this line:
UserList = [1,2,3,4...]
UserListNoBrackets = str(UserList).strip('[]')
But if I run:
len(UserList) #prints22 (which is correct).
However:
len(UserListNoBrackets) #prints 170 (whaaat?!)
Anyway, the output is actually correct (I'm pretty sure). Just wondering why that happens.
Here:
UserListNoBrackets = str(UserList).strip('[]')
UserListNoBrackets is a string. A string is a sequence of characters, and len(str) returns the numbers of characters in the string. A comma is a character, a white space is a character, and the string represention of an integer has has many characters as there are digits in the integer. So obviously, the length of your UserListNoBrackets string is much greater than the length of you UserList list.
You probably need str.join
Ex:
user_list = [1,2,3,4...]
print(",".join(map(str, user_list)))
Note:
Using map method to convert all int elements in list to string.

Padding a hexadecimal string with zeros to a 6 character length

I have this:
function dec2hex(IN)
local OUT
OUT = string.format("%x",IN)
return OUT
end
and need IN to have padded zeros to string length of 6.
I can't use String.Utils or PadLeft. It's within an app called Watchmaker which uses a cut down version of Lua.
String formats in Lua work mostly just like in C. So to pad a number with zeros, just use %0n where n is the number of places. For example
print(string.format("%06x", 16^4-1))
will print 00ffff.
See chapter 20 The String Library of “Programming in Lua”, the reference of string.format, and the C reference for the printf family of functions for details.
If you store your format string locally you can call the format method on to the format string and the example of #Henri results in ("%06x"):format(0xffff)
print(("%06x"):format(0xffff)) -- Prints `00ffff`
You can write numbers in hex format. It is the same as C.

Replace multiple substrings using strrep in Matlab

I have a big string (around 25M characters) where I need to replace multiple substrings of a specific pattern in it.
Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
The substring I need to remove is the 'Frame #' and it occurs around 7670 times. I can give multiple search strings in strrep, using a cell array
strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')
However that returns a cell array, where in each cell, I have the original string with the corresponding substring of one of my input cell changed.
Is there a way to replace multiple substrings from a string, other than using regexprep? I noticed that it is considerably slower than strrep, that's why I am trying to avoid it.
With regexprep it would be:
regexprep(text,'Frame \d*',';')
and for a string of 25MB it takes around 47 seconds to replace all the instances.
EDIT 1: added the equivalent regexprep command
EDIT 2: added size of the string for reference, number of occurences for the substring and timing of execution for the regexprep
Ok, in the end I found a way to go around the problem. Instead of using regexprep to change the substring, I remove the 'Frame ' substring (including whitespace, but not the number)
rawData = strrep(text,'Frame ','');
This results in something like this:
1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0
...........
Then, I change all the commas (,) and newline characters (\n) into a semicolon (;), using again strrep, and I create a big vector with all the numbers
rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');
then I remove the unnecessary numbers (1,2,...,7670), since they are located at a specific point in the array (each frame contains a specific amount of numbers).
rawData{1}(firstInstance:spacing:lastInstance)=[];
And then I go on with my manipulations. It seems that the additional strrep and removal of the values from the array is much much faster than the equivalent regexprep. With a string of 25M chars with regexprep I can do the whole operation in about 47", while with this workaround it takes only 5"!
Hope this helps somehow.
I think that this can be done using only textscan, which is known to be very fast. Be specifying a 'CommentStyle' the 'Frame #' lines are stripped out. This may only work because these 'Frame #' lines are on their own lines. This code returns the raw data as one big vector:
s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}
You may want to know how many elements are in each frame or even reshape the data into a matrix. You can use textscan again (or before the above) to get just the data for the first frame:
f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}
In fact, if you just want the elements from the first line, you can use this:
l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}
However, the other nice thing about textscan is that you can use it to read in the file directly (it looks like you may be using some other means currently) using just fopen to get an FID. Thus the string data text doesn't have to be in memory.
Using regular expressions:
result = regexprep(text,'Frame [0-9]+','');
It's possible to avoid regular expressions as follows. I use strrep with suitable replacement strings that act as masks. The obtained strings are equal-length and are assured to be aligned, and can thus be combined into the final result using the masks. I've also included the ; you want. I don't know if it will be faster than regexprep or not, but it's definitely more fun :-)
% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths
% Computations
rep_dest = cellfun(#(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';'
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters
With these example data:
>> text
text =
Hello Frame 1 test string Frame 22 end of Frame 2 this
>> result
result =
Hello ; test string ; end of ; this

Given a set of strings, find an optimal lexicon which can be used to build those strings

Imagine I have a set of strings, for example:
"entrance",
"scent",
"transcend".
I would like to find an optimal "lexicon" of sub-strings which can be used to build the initial strings. The criterion is the smallest amount of memory needed to store both the lexicon and the strings using that lexicon.
For instance, for the given set of strings, the optimal lexicon of sub-strings may be:
"scen" = 1,
"tran" = 2,
"en" = 3,
"ce" = 4,
"t" = 5,
"d" = 6
with the initial set of strings encoded the following way (each \N represents a reference to the string N from the lexicon):
\3\2\4
\1\5
\2\1\6
for the total of 8 references used to build the strings + 14 chars needed to store the lexicon, versus the 22 chars in the original strings + 8 chars comprising the original alphabet. If you need an exact formula for the footprint, assume that sizeof( reference ) == sizeof( char ), and the footprint of a single string (both encoded and in lexicon) is length of string * sizeof( char or reference ), with no additional overhead.
What is the best algorithm to solve this problem? Does this algorithm have an established name? Is it NP-hard? If so, is there a sub-optimal, but polynomial solution?
EDIT: The best solution I could come up with is the following: find the longest common sub-string in the initial set of strings. Let the score for that sub-string be (substring_length - 1) * total_occurrences_of_it_in_the_set - substring_length, accounting for the number of chars/refs saved by that replacement. Now find all the smaller sub-strings (up until the length of 2) and calculate their scores the same way. Among all the sub-strings found this way, a sub-string with the largest score wins and gets into the lexicon. The sub-string is then replaced in the initial set of strings by the references to it, and the process repeats until our set of strings only contains single chars and lexicon references. After that, introduce all the remaining single chars into the lexicon and replace them with their references in the set. The explanation of the scoring is the following: we remove substring_length chars from each occurrence, add a reference instead (hence -1), and need substring_length chars to store the sub-string (hence -substring_length).
Any better approach you can think of?

Resources