MATLAB: Quickest Way To Convert Characters to A Custom Set Numbers and Back - string

I'm looking for a quick way to convert a large character array of lowercase letters, spaces and periods into a set of integers and vice-versa in MATLAB.
Usually I would use the double and char functions, but I would like to use a special set of integers to represent each letter (so that 'a' matches with '1', 'b' matches with '2'.... 'z' matches with 26, ' ' matches with 27, and '.' matches with 28)
The current method that I have is:
text = 'quick brown fox jumps over dirty dog';
alphabet ='abcdefghijklmnopqrstuvwxyz .';
converted_text = double(text);
converted_alphabet = double(alphabet);
numbers = nan(28,1)
for i = 1:28
numbers(converted_text(i)==converted_alphabet(i)) = i;
end
newtext = nan(size(numbers))
for i = 1:size(numbers,1)
newtext(numbers==i) = alphabet(i)
end
Unfortunately this takes quite a bit of time for large arrays, and I'm wondering if there is a quicker way to do this in MATLAB?

An easy way is to use ismember():
[~,pos] = ismember(text,alphabet)
Or use the implicit conversion carried out by -:
out = text - 'a' + 1;
note that blanks will have -64 and full stops -50, which means that you will need:
out(out == -64) = 27;
out(out == -50) = 28;
Speed considerations:
For small sized arrays the latter solution is slightly faster IF you are happy to leave blanks and full stops with their negative index.
For big arrays, on my machine 1e4 times longer, the latter solution is twice faster than ismember().
Going back:
alphabet(out)

Related

Dealing with problems where memory isn't sufficient. Dynamic programming

I was solving a Problem using python, here i was storing a repetitive string "abc" in a string with everytime each character getting double like "abcaabbccaaaabbbbcccc.......... , and i had to find the nth character. The constraints were n<=10^9 , Now when i tried to store this their was memory error as the string was to too large (i tried to store all the charaters till the charater 2^30 times repeated). CAn somebody help me with the approach to tackle this situation.
t=' '
for i in range(0 , 30):
t = t +'a'*(2**i)
t = t +'b'*(2**i)
t = t +'c'*(2**i)
Obviously, you can't do this the straightforward, brute-force way. Instead, you need to count along a virtual string to find where your given index appears. I'll lay this out in too much detail so you can see the logic:
n = 314159265 # Pick a large value for illustration
rem = n
for i in range(0 , 30):
size = 2**i
# print(size, rem)
rem -= size
if rem <= 0:
char = 'a'
break
rem -= size
if rem <= 0:
char = 'b'
break
rem -= size
if rem <= 0:
char = 'c'
break
print("Character", n, "is", char)
Output:
Character 314159265 is b
You can shorten this with a better loop body; I'll leave that as a further exercise. If you get insightful with your arithmetic, you can simply compute the appropriate letter from the chunk sizes you generate.

Combining words in a dictionary to match a single word

I'm working on a problem where I need to check how many words in a dictionary can be combined to match a single word.
For example:
Given the string "hellogoodsir", and the dictionary: {hello, good, sir, go, od, e, l}, the goal is to find all the possible combinations to form the string.
In this case, the result would be hello + good + sir, and hello + go + od + sir, resulting in 3 + 4 = 7 words used, or 1 + 1 = 2 combinations.
What I've come up with is simply to put all the words starting with the first character ("h" in this instance) in one hashmap (startH), and the rest in another hashmap (endH). I then go through every single word in the startH hashmap, and check if "hellogoodsir" contains the new word (start + end), where end is every word in the endH hashmap. If it does, I check if it equals the word to match, and then increments the counter with the value of the number for each word used. If it contains it, but doesn't equal it, I call the same method (recursion) using the new word (i.e. start + end), and proceed to try to append any word in the end hashmap to the new word to get a match.
This is obviously very slow for large number of words (and a long string to match). Is there a more efficient way to solve this problem?
As far as I know, this is an O(n^2) algorithm, but I'm sure this can be done faster.
Let's start with your solution. It is no linear nor quadric time, it's actually exponential time. A counter example that shows that is:
word = "aaa...a"
dictionary = {"a", "aa", "aaa", ..., "aa...a"}
Since your solution is going through each possible matching, and there is exponential number of such in this example - the solution is exponential time.
However, that can be done more efficiently (quadric time worst case), with Dynamic Programming, by following the recursive formula:
D[0] = 1 #
D[i] = sum { D[j] | word.Substring(i,j) is in the dictionary | 0 <= j < i }
Calculating each D[i] (given the previous ones are already known) is done in O(i)
This sums to total O(n^2) time, with O(n) extra space.
Quick note: By iterating the dictionary instead of all (i,j) pairs for each D[i], you can achieve O(k) time for each D[i], which ends up as O(n*k), where k is the dictionary size. This can be optimized for some cases by traversing only potentially valid strings - but for the same counter example as above, it will result in O(n*k).
Example:
dictionary = {hello, good, sir, go, od, e, l}
string = "hellogoodsir"
D[0] = 1
D[1] = 0 (no substring h)
D[2] = 0 (no substring he, d[1] = 0 for e)
...
D[5] = 1 (hello is the only valid string in dictionary)
D[6] = 0 (no dictionary string ending with g)
D[7] = D[5], because string.substring(5,7)="go" is in dictionary
D[8] = 0, no substring ending with "oo"
D[9] = 2: D[7] for "od", and D[5] for "good"
D[10] = D[11] = 0 (no strings in dictionary ending with "si" or "s")
D[12] = D[7] = 2 for substring "sir"
My suggestion would be to use a prefix tree. The nodes beneath the root would be h, g, s, o, e, and l. You will need nodes for terminating characters as well, to differentiate between go and good.
To find all matches, use a Breadth-first-search approach. The state you will want to keep track of is a composition of: the current index in the search-string, the current node in the tree, and the list of words used so far.
The initial state should be 0, root, []
While the list of states is not empty, dequeue the next state, and see if the index matches any of the keys of the children of the node. If so, modify a copy of the state and enqueue it. Also, if any of the children are the terminating character, do the same, adding the word to the list in the state.
I'm not sure on the O(n) time on this algorithm, but it should be much faster.

I need help converting characters in a string to numerical values in Matlab [duplicate]

I'm looking for a quick way to convert a large character array of lowercase letters, spaces and periods into a set of integers and vice-versa in MATLAB.
Usually I would use the double and char functions, but I would like to use a special set of integers to represent each letter (so that 'a' matches with '1', 'b' matches with '2'.... 'z' matches with 26, ' ' matches with 27, and '.' matches with 28)
The current method that I have is:
text = 'quick brown fox jumps over dirty dog';
alphabet ='abcdefghijklmnopqrstuvwxyz .';
converted_text = double(text);
converted_alphabet = double(alphabet);
numbers = nan(28,1)
for i = 1:28
numbers(converted_text(i)==converted_alphabet(i)) = i;
end
newtext = nan(size(numbers))
for i = 1:size(numbers,1)
newtext(numbers==i) = alphabet(i)
end
Unfortunately this takes quite a bit of time for large arrays, and I'm wondering if there is a quicker way to do this in MATLAB?
An easy way is to use ismember():
[~,pos] = ismember(text,alphabet)
Or use the implicit conversion carried out by -:
out = text - 'a' + 1;
note that blanks will have -64 and full stops -50, which means that you will need:
out(out == -64) = 27;
out(out == -50) = 28;
Speed considerations:
For small sized arrays the latter solution is slightly faster IF you are happy to leave blanks and full stops with their negative index.
For big arrays, on my machine 1e4 times longer, the latter solution is twice faster than ismember().
Going back:
alphabet(out)

How to tangle/scramble/rearrange a string in MATLAB?

For an example exam question, I've been asked to "tangle" a string as shown:
tangledWord('today')='otady'
tangledWord('12345678')='21436587'
I understand this is an extremely simple problem but it's got me stumped.
I can make it produce a tangled word when the length is even, but I'm having trouble when it's odd, here's my function:
function tangledWord(s)
n=length(s);
a=s(1:2:n);
b=s(2:2:n);
s(1:2:n)=b;
s(2:2:n)=a;
disp(s);
end
For odd word length, you need to reduce n by 1 to leave the last char untouched. Use mod to detect odd word length.
If you want to scramble every char randomly, you can try:
string = '1234567';
shuffled = string(randperm(numel(string)))
shuffled = 5741326
If you want to change the first two chars:
tangled = [string(2) string(1) string(3:end)]
tangled = 2134567
If you want to change every two chars:
n = ( numel(string)-mod(numel(string),2));
tangled2 = [flipud(reshape(string(1:n),[],n/2))(:); string(n+1:end)]'
tangled2 = 2143657
function tangledWord(s)
n=length(s);
if mod(n,2) == 0
a=s(1:2:n);
b=s(2:2:n);
s(1:2:n)=b;
s(2:2:n)=a;
disp(s)
elseif mod(n,2) ~= 0
a=s(1:2:end-1);
b=s(2:2:end-1);
s(1:2:end-1)=b;
s(2:2:end-1)=a;
disp(s)
end
end

How compiler is converting integer to string and vice versa

Many languages have functions for converting string to integer and vice versa. So what happens there? What algorithm is being executed during conversion?
I don't ask in specific language because I think it should be similar in all of them.
To convert a string to an integer, take each character in turn and if it's in the range '0' through '9', convert it to its decimal equivalent. Usually that's simply subtracting the character value of '0'. Now multiply any previous results by 10 and add the new value. Repeat until there are no digits left. If there was a leading '-' minus sign, invert the result.
To convert an integer to a string, start by inverting the number if it is negative. Divide the integer by 10 and save the remainder. Convert the remainder to a character by adding the character value of '0'. Push this to the beginning of the string; now repeat with the value that you obtained from the division. Repeat until the divided value is zero. Put out a leading '-' minus sign if the number started out negative.
Here are concrete implementations in Python, which in my opinion is the language closest to pseudo-code.
def string_to_int(s):
i = 0
sign = 1
if s[0] == '-':
sign = -1
s = s[1:]
for c in s:
if not ('0' <= c <= '9'):
raise ValueError
i = 10 * i + ord(c) - ord('0')
return sign * i
def int_to_string(i):
s = ''
sign = ''
if i < 0:
sign = '-'
i = -i
while True:
remainder = i % 10
i = i / 10
s = chr(ord('0') + remainder) + s
if i == 0:
break
return sign + s
I wouldn't call it an algorithm per se, but depending on the language it will involve the conversion of characters into their integral equivalent. Many languages will either stop on the first character that cannot be represented as an integer (e.g. the letter a), will blindly convert all characters into their ASCII value (e.g. the letter a becomes 97), or will ignore characters that cannot be represented as integers and only convert the ones that can - or return 0 / empty. You have to get more specific on the framework/language to provide more information.
String to integer:
Many (most) languages represent strings, on some level or another, as an array (or list) of characters, which are also short integers. Map the ones corresponding to number characters to their number value. For example, '0' in ascii is represented by 48. So you map 48 to 0, 49 to 1, and so on to 9.
Starting from the left, you multiply your current total by 10, add the next character's value, and move on. (You can make a larger or smaller map, change the number you multiply by at each step, and convert strings of any base you like.)
Integer to string is a longer process involving base conversion to 10. I suppose that since most integers have limited bits (32 or 64, usually), you know that it will come to a certain number of characters at most in a string (20?). So you can set up your own adder and iterate through each place for each bit after calculating its value (2^place).

Resources