Split string into cell array by positions - string

I have a file with strings of a known length, but no separator.
% What should be the result
vals = arrayfun(#(x) ['Foobar ', num2str(x)], 1:100000, 'UniformOutput', false);
% what the file looks like when read in
strs = cell2mat(vals);
strlens = cellfun(#length, vals);
The most straightforward approach is quite slow:
out = cell(1, length(strlens));
for i=1:length(strlens)
out{i} = fread(f, strlens(i), '*char');
end % 5.7s
Reading everything in and splitting it up afterwards is a lot faster:
strs = fread(f, sum(strlens), '*char');
out = cell(1, length(strlens));
slices = [0, cumsum(strlens)];
for i=1:length(strlens)
out{i} = strs(slices(i)+1:slices(i+1));
end % 1.6s
With a mex function I can get down to 0.6s, so there's still a lot of room for improvement. Can I get comparable performance with pure Matlab (R2016a)?
Edit: the seemingly perfect mat2cell function doesn't help:
out = mat2cell(strs, 1, strlens); % 2.49s

Your last approach – reading everything at once and splitting it up afterwards – looks pretty optimal to me, and is how I do stuff like this.
For me, it's running in about 80 ms seconds when the file is on a local SSD in both R2016b and R2019a, on Mac.
function out = scratch_split_strings(strlens)
%
% Example:
% in_strs = arrayfun(#(x) ['Foobar ', num2str(x)], 1:100000, 'UniformOutput', false);
% strlens = cellfun(#length, in_strs);
% big_str = cat(2, in_strs{:});
% fid = fopen('text.txt'); fprintf(fid, '%s', big_str); fclose(fid);
% scratch_split_strings(strlens);
t0 = tic;
fid = fopen('text.txt');
txt = fread(fid, sum(strlens), '*char');
fclose(fid);
fprintf('Read time: %0.3f s\n', toc(t0));
str = txt;
t0 = tic;
out = cell(1, length(strlens));
slices = [0, cumsum(strlens)];
for i = 1:length(strlens)
out{i} = str(slices(i)+1:slices(i+1))';
end
fprintf('Munge time: %0.3f s\n', toc(t0));
end
>> scratch_split_strings(strlens);
Read time: 0.002 s
Munge time: 0.075 s
Have you stuck it in the profiler to see what's taking up your time here?
As far as I know, there is no faster way to split up a single primitive array into variable-length subarrays with native M-code. You're doing it right.

Related

How to extract substrings with different lengths?

I have an n by 2 matrix that contains start and end indices of substrings of a specified string. How can I extract the n by 1 cell array of substrings without a for-loop?
string = 'Hello World!';
ranges = [1 1;
2 3;
4 5;
3 7];
substrings = cell(size(ranges, 1), 1);
for i=1:size(ranges, 1)
substrings{i} = string(ranges(i, 1):ranges(i, 2));
end
The expected result:
substrings =
'H'
'el'
'lo'
'llo W'
You can use cellfun to make it a single-line operation:
str = 'Hello World!';
ranges = [ 1 1;
2 3;
4 5;
3 7];
% first convert "ranges" to a cell object
Cranges = mat2cell(ranges,ones(size(ranges,1),1),2);
% call "cellfun" on every row/entry of "Cranges"
cellfun(#(x)str(x(1):x(2)),Cranges, 'UniformOutput',false)
ans =
4×1 cell array
{'H' }
{'el' }
{'lo' }
{'llo W'}
I have changed the variable string to str because string is a native function in MATLAB (converting the input to the type string).
Although this is single-line operation, it doesn't mean that it is more efficient:
Num = 1000000;
substrings = cell(size(ranges, 1), 1);
% time for-loop
tic
for j = 1:Num
for i = 1:size(ranges, 1)
substrings{i} = str(ranges(i, 1):ranges(i, 2));
end
end
toc;
Cranges = mat2cell(ranges,ones(size(ranges,1),1),2);
% time function-call
tic
for j = 1:Num
substrings = cellfun(#(x)str(x(1):x(2)),Cranges, 'UniformOutput',false);
end
toc;
Elapsed time is 3.929622 seconds.
Elapsed time is 50.319609 seconds.

Does xlswrite have limitations?

I'm running MATLAB R2017a. I am trying to execute a simple program that writes 3 characters to an Excel file. When I run the program with a small number of values it is fine but when I increase it to the millions, the program pauses.
Does anyone know why the programming is pausing like this?
X = []
filename = 'PopltnFL.xlsx';
NumTrump = 4617886;
NumClinton = 4504975;
NumOther = 297025;
*% Values for which program runs without puasing*
% NumTrump = 4;
% NumClinton = 4;
% NumOther = 2;
%
for ii = 1:NumTrump
X = [X,'T'];
end
for jj = 1:NumClinton
X = [X,'C'];
end
for kk = 1:NumOther
X = [X,'O'];
end
X = X';
xlswrite(filename,X)

MATLAB: How load a list of filenames from .txt file

filelist.txt contains a list of files:
/path/file1.json
/path/file2.json
/path/fileN.json
Is there a (simple) MATLAB command that will accept filelist.txt and read each file as a string and store each string into a cell array?
Just use readtable, asking it to read each line in full.
>> tbl = readtable('filelist.txt','ReadVariableNames',false,'Delimiter','\n');
>> tbl.Properties.VariableNames = {'filenames'}
tbl =
3×1 table
filenames
__________________
'/path/file1.json'
'/path/file2.json'
'/path/fileN.json'
Then access the elements in a loop
for idx = 1:height(tbl)
this_filename = tbl.filenames{idx};
end
This problem is a bit to specific for a standard function. However, it is easily doable with the combination of two functions:
First, you have to open the file:
fid = fopen('filelist.txt');
Next you can read line by line with:
line_ex = fgetl(fid)
This function includes a counter. If you call the function the next time, it will read the second line and so on. You find more information here.
The whole code might look like this:
% Open file
fid = fopen('testabc');
numberOfLines = 3;
% Preallocate cell array
line = cell(numberOfLines, 1);
% Read one line after the other and save it in a cell array
for i = 1:numberOfLines
line{i} = fgetl(fid);
end
% Close file
fclose(fid);
For this replace the for loop with a while loop:
i=0;
while ~feof(fid)
i=i+1
line{1} = fgetl(fid)
end
Alternative to while loop: Retrieve the number of lines and use in Caduceus' for-loop:
% Open file
fid = fopen('testabc');
numberOfLines = numlinestextfile('testable'); % function defined below
% Preallocate cell array
line = cell(numberOfLines, 1);
% Read one line after the other and save it in a cell array
for i = 1:numberOfLines
line{i} = fgetl(fid);
end
% Close file
fclose(fid);
Custom function:
function [lineCount] = numlinestextfile(filename)
%numlinestextfile: returns line-count of filename
% Detailed explanation goes here
if (~ispc) % Tested on OSX
evalstring = ['wc -l ', filename];
% [status, cmdout]= system('wc -l filenameOfInterest.txt');
[status, cmdout]= system(evalstring);
if(status~=1)
scanCell = textscan(cmdout,'%u %s');
lineCount = scanCell{1};
else
fprintf(1,'Failed to find line count of %s\n',filenameOfInterest.txt);
lineCount = -1;
end
else
if (~ispc) % For Windows-based systems
[status, cmdout] = system(['find /c /v "" ', filename]);
if(status~=1)
scanCell = textscan(cmdout,'%s %s %u');
lineCount = scanCell{3};
disp(['Found ', num2str(lineCount), ' lines in the file']);
else
disp('Unable to determine number of lines in the file');
end
end
end

Writing strings to binary in Lua

I'm having issues writing strings to binary in Lua. There is an existing example and I tried modifying it. Take a look:
function StringToBinary()
local file = io.open("file.bin", "wb")
local t = {}
local u = {}
local str = "Hello World"
file:write("string len = " ..#str ..'\n')
math.randomseed(os.time())
for i=1, #str do
t[i] = string.byte(str[i])
file:write(t[i].." ");
end
file:write("\n")
for i=1, #str do
u[i] = math.random(0,255)
file:write(u[i].." ");
end
file:write("\n"..string.char(unpack(t)))
file:write("\n"..string.char(unpack(u)))
file:close()
end
file:write(t[i].." ") and file:write(u[i].." ") write both tables with integer value. However with my last two writes: unpack(t) displays the original text, while unpack(u) displays the binaries.
It's probably string.byte(str[i]) that is mistaken. What should I replace it with? Am I missing something?
t[i] = string.byte(str[i])
is wrong, it should be:
t[i] = string.byte(str, i)

How do you sort and efficiently find elements in a cell array (of strings) in Octave?

Is there built-in functionality for this?
GNU Octave search a cell array of strings in linear time O(n):
(The 15 year old code in this answer was tested and correct on GNU Octave 3.8.2, 5.2.0 and 7.1.0)
The other answer has cellidx which was depreciated by octave, it still runs but they say to use ismember instead, like this:
%linear time string index search.
a = ["hello"; "unsorted"; "world"; "moobar"]
b = cellstr(a)
%b =
%{
% [1,1] = hello
% [2,1] = unsorted
% [3,1] = world
% [4,1] = moobar
%}
find(ismember(b, 'world')) %returns 3
ismember finds 'world' in index slot 3. This is a expensive linear time O(n) operation because it has to iterate through all elements whether or not it is found.
To achieve a logarathmic time O(log n) solution, then your list needs to come pre-sorted and then you can use binary search:
If your cell array is already sorted, you can do O(log-n) worst case:
function i = binsearch(array, val, low, high)
%binary search algorithm for numerics, Usage:
%myarray = [ 30, 40, 50.15 ]; %already sorted list
%binsearch(myarray, 30, 1, 3) %item 30 is in slot 1
if ( high < low )
i = 0;
else
mid = floor((low + high) / 2);
if ( array(mid) > val )
i = binsearch(array, val, low, mid-1);
elseif ( array(mid) < val )
i = binsearch(array, val, mid+1, high);
else
i = mid;
endif
endif
endfunction
function i = binsearch_str(array, val, low, high)
% binary search for strings, usage:
%myarray2 = [ "abc"; "def"; "ghi"]; #already sorted list
%binsearch_str(myarray2, "abc", 1, 3) #item abc is in slot 1
if ( high < low )
i = 0;
else
mid = floor((low + high) / 2);
if ( mystrcmp(array(mid, [1:end]), val) == 1 )
i = binsearch(array, val, low, mid-1);
elseif ( mystrcmp(array(mid, [1:end]), val) == -1 )
i = binsearch_str(array, val, mid+1, high);
else
i = mid;
endif
endif
endfunction
function ret = mystrcmp(a, b)
%this function is just an octave string compare, its behavior follows the
%strcmp(str1,str2)'s in C and java.lang.String.compareTo(...)'s in Java,
%that is:
% -returns 1 if string a > b
% -returns 0 if string a == b
% -return -1 if string a < b
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
a_as_string = a;
if iscellstr( a )
a_as_string = a{1}; %a was passed as a single-element cell array.
endif
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
b_as_string = b;
if iscellstr( b )
b_as_string = b{1}; %b was passed as a single-element cell array.
endif
% Space-pad the shortest word so as they can be used with gt() and lt() operators.
if length(a_as_string) > length( b_as_string )
b_as_string( length( b_as_string ) + 1 : length( a_as_string ) ) = " ";
elseif length(a_as_string) < length( b_as_string )
a_as_string( length( a_as_string ) + 1 : length( b_as_string ) ) = " ";
endif
letters_gt = gt(a_as_string, b_as_string); %list of boolean a > b
letters_lt = lt(a_as_string, b_as_string); %list of boolean a < b
ret = 0;
%octave makes us roll our own string compare because
%strings are arrays of numerics
len = length(letters_gt);
for i = 1:len
if letters_gt(i) > letters_lt(i)
ret = 1;
return
elseif letters_gt(i) < letters_lt(i)
ret = -1;
return
endif
end;
endfunction
%Assuming that myarray is already sorted, (it must be for binary
%search to finish in logarithmic time `O(log-n))` worst case, then do
myarray = [ 30, 40, 50.15 ]; %already sorted list
binsearch(myarray, 30, 1, 3) %item 30 is in slot 1
binsearch(myarray, 40, 1, 3) %item 40 is in slot 2
binsearch(myarray, 50, 1, 3) %50 does not exist so return 0
binsearch(myarray, 50.15, 1, 3) %50.15 is in slot 3
%same but for strings:
myarray2 = [ "abc"; "def"; "ghi"]; %already sorted list
binsearch_str(myarray2, "abc", 1, 3) %item abc is in slot 1
binsearch_str(myarray2, "def", 1, 3) %item def is in slot 2
binsearch_str(myarray2, "zzz", 1, 3) %zzz does not exist so return 0
binsearch_str(myarray2, "ghi", 1, 3) %item ghi is in slot 3
To sort your array if it isn't already:
Complexity of sorting depends on the kind of data you have and whatever sorting algorithm GNU octave language writers selected, it's somewhere between O(n*log(n)) and O(n*n).
myarray = [ 9, 40, -3, 3.14, 20 ]; %not sorted list
myarray = sort(myarray)
myarray2 = [ "the"; "cat"; "sat"; "on"; "the"; "mat"]; %not sorted list
myarray2 = sortrows(myarray2)
Code buffs to make this backward compatible with GNU Octave 3. 5. and 7. goes to #Paulo Carvalho in the other answer here.
Yes check this: http://www.obihiro.ac.jp/~suzukim/masuda/octave/html3/octave_36.html#SEC75
a = ["hello"; "world"];
c = cellstr (a)
⇒ c =
{
[1,1] = hello
[2,1] = world
}
>>> cellidx(c, 'hello')
ans = 1
>>> cellidx(c, 'world')
ans = 2
The cellidx solution does not meet the OP's efficiency requirement, and is deprecated (as noted by help cellidx).
Håvard Geithus in a comment suggested using the lookup() function on a sorted cell array of strings, which is significantly more efficient than cellidx. It's still a binary search though, whereas most modern languages (and even many 20 year old ones) give us easy access to associative arrays, which would be a much better approach.
While Octave doesn't obviously have associated arrays, that's effectively what the interpreter is using for ocatve's variables, including structs, so you can make us of that, as described here:
http://math-blog.com/2011/05/09/associative-arrays-and-cellular-automata-in-octave/
Built-in Function: struct ("field", value, "field", value,...)
Built-in Function: isstruct (expr)
Built-in Function: rmfield (s, f)
Function File: [k1,..., v1] = setfield (s, k1, v1,...)
Function File: [t, p] = orderfields (s1, s2)
Built-in Function: fieldnames (struct)
Built-in Function: isfield (expr, name)
Function File: [v1,...] = getfield (s, key,...)
Function File: substruct (type, subs,...)
Converting Matlab to Octave is there a containers.Map equivalent? suggests using javaObject("java.util.Hashtable"). That would come with some setup overhead, but would be a performance win if you're using it a lot. It may even be viable to link in some library written in C or C++? Do think about whether this is a maintainable option though.
Caveat: I'm relatively new to Octave, and writing this up as I research it myself (which is how I wound up here). I haven't yet run tests on the efficiency of these techniques, and while I've got a fair knowledge of the underlying algorithms, I may be making unreasonable assumptions about what's actually efficient in Octave.
This is a version of mystrcmp() that works in Octave of recent version (7.1.0):
function ret = mystrcmp(a, b)
%this function is just an octave string compare, its behavior follows the
%strcmp(str1,str2)'s in C and java.lang.String.compareTo(...)'s in Java,
%that is:
% -returns 1 if string a > b
% -returns 0 if string a == b
% -return -1 if string a < b
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
a_as_string = a;
if iscellstr( a )
a_as_string = a{1}; %a was passed as a single-element cell array.
endif
% The gt() operator does not support cell array. If the single word
% is passed as an one-element cell array, converts it to a string.
b_as_string = b;
if iscellstr( b )
b_as_string = b{1}; %b was passed as a single-element cell array.
endif
% Space-pad the shortest word so as they can be used with gt() and lt() operators.
if length(a_as_string) > length( b_as_string )
b_as_string( length( b_as_string ) + 1 : length( a_as_string ) ) = " ";
elseif length(a_as_string) < length( b_as_string )
a_as_string( length( a_as_string ) + 1 : length( b_as_string ) ) = " ";
endif
letters_gt = gt(a_as_string, b_as_string); %list of boolean a > b
letters_lt = lt(a_as_string, b_as_string); %list of boolean a < b
ret = 0;
%octave makes us roll our own string compare because
%strings are arrays of numerics
len = length(letters_gt);
for i = 1:len
if letters_gt(i) > letters_lt(i)
ret = 1;
return
elseif letters_gt(i) < letters_lt(i)
ret = -1;
return
endif
end;
endfunction

Resources