Since a string scalar is a sequence or array of characters, why do we still need character vector as string scalar combines single characters? - string

For string scalar like "abc" which is an array of characters 'a', 'b', 'c',
but for character vector like 'abc', is this also an array of characters?
Why do we need two types of data to preserve the same message?

The single quote version is the historical method, and is a rectangular array of characters. If all you want to store is a single string, this works fine. But if you want to store multiple strings in the same variable, the rectangular array becomes less useful because you have to pad blanks on the shorter strings to get everything to fit in the rectangular array. Also each individual string held as a row of the array is not contiguous in memory.
This led to using cell arrays for holding multiple strings of different lengths in the same variable. However, that also has drawbacks because each string is required to have it's own variable header (> 100 bytes), so there are performance impacts.
The double quote string is a relatively recent class introduced by MATLAB for holding multiple strings in a single variable. The individual strings are held in memory in contiguous chunks without the need for individual variable headers, and the operations on them are more optimized as a result.
MATLAB will no doubt continue to support all three methods in the future for backward compatability.

Related

Most efficient way to store a string in bytes?

Say I have a simple bytecode-like file format for saving data.
If I want to store a string, should I do it like in source files where all characters between a certain byte is the string,
or should I first store the length of the string then the string bytes?
Or are both solutions horrible and if so which one can I use?
It depends on whether you want to store:
a single string
a number of strings
different length strings
all the same length
For all of the above, it may also matter if your strings contain:
any characters
only certain characters
formatting
In general, you should use Unicode.
For a single string, you simply can use an entire file to contain the string, the end-of-file will be the same as the end of string. No need to store the length of the string.
If the strings aren't all (around) the same length you can use an inline separator to separate the strings. Often the newline character is useful for this (especially since a lot of programming languages support this way of reading in a file line-by-line), but other markers such as tab are common.
CSV text files often use double quotes to enclose strings that contain commas (or other column separator) (which would otherwise indicate the next column value was starting), or line-breaks (which would otherwise indicate the next row).
Of course, now you have the problem of how to store a double quote in your string.
If you want to store formatting, you can use a markup language (html) or it may be enough to allow for line breaks and/or some markdown.

Why strings cannot be indexed by integer values

I learned that Swift strings cannot be indexed by integer values. I remembered it and I use the rule. But I've never fully understood the mechanic behind it.
The explanation of from the official document is as follows
"Different characters can require different amounts of memory to store, so in order to determine which Character is at a particular position, you must iterate over each Unicode scalar from the start or end of that String. For this reason, Swift strings cannot be indexed by integer values"
I've read it several times, I still don't quite get the point. Can someone explain me a bit more why Swift String cannot be indexed by integer values?
Many Thanks
A string is stored in memory as an array of bytes.
A given character can require 1 to 4 bytes for the basic codepoint, plus any number of combining diacritical mark.
For example, é requires 2 bytes.
Now, if you have the strings efgh and éfgh, to access the second character (f), for the first string, the character is in the byte array at index 1, for the second string, it is at index 2.
In order to know that, you need to inspect the first character. For accessing any character based on its index, you need to go through all the previous characters to know how many bytes each takes.

Loading dataset containing both strings and number

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

Comparing strings in MIPS assembly

I have a bunch of strings in an array that I have defined in the data segment. If I were to take 2 of the strings from the array, is it possible to compare them to see which has a greater value in mips? How would I do this? Basically, I'm looking to rearrange the strings based on alphabetical order.
EDIT: This is less of me trying to get help with a specific problem, and more of just a general question that will help me with my approach to the code. Thanks!
If it were me, I'd create a list of pointers to the strings. That is, a list of the addresses of each string. Then you'd write a subroutine the compares two strings given their pointers. Then, when you need to swap the strings, you simply swap the actual pointers.
You want to avoid swapping the strings themselves, since they may well be tightly packed, thus you'd have to do a lot of shifting to move the holes of memory around. Pointers are simple to swap. You could swap strings more easily if they were all of a fixed length (or less), then you wouldn't have to worry about moving the memory holes around.
But sorting the pointer list is really the hot tip.
To compare strings, the simplest way is to iterate over each character of each string, and subtract them from each other. If the result is 0, they're equal. If not, then if the result is > 0, then the first string is before the other string, otherwise the second string is lower and you would swap them. If you run out of either string before the other, and they're equal all the way to that point, the shorter string is less than the longer one.

How can using strings instead of simple types like integers alter the O-notation of operations?

Proposed answer:
Strings are simply arrays of characters so the O-notation will be dependent on the number of characters in the string (if the loop depends on the length of the string). In this case the O-notation wouldn't be affected because the length of the string is a constant.
Any other ideas? Am I reading this question correctly?
This is not true, since representing integers in arrays are not boundless.
IOW a string that represents an 32-bit integer is maximally 32-bit, thus maximally 10 digits in base 10, and O(10) is a negiable constant that doesn't change the O notation.
So, in summary, while strings are O(n), basic integer types represented as strings are O(maximally 10)=O(0)
I think you need to specify your problem better.
Try thinking about something that operates on an array of integers or an array of strings, clearly in the latter case you have an array of array of a primitive type rather than an array of a primitive type. How does this change things?
That depends entirely on what you are doing with the strings.
If you for example copy items from one array to another, the result is depending on the implementation. It's still an O(n) operation, but the meaning of n changes. If copying a string causes a new copy to be created, n means the total number of characters in all the strings. If copying a string is only copying the reference to it, n means the total number of strings.

Resources