How to compute effectively string length of cell array of strings - string

I have a cell array in Matlab:
strings = {'one', 'two', 'three'};
How can I efficiently calculate the length of all three strings? Right now I use a for loop:
lengths = zeros(3,1);
for i = 1:3
lengths(i) = length(strings{i});
end
This is however unusable slow when you have a large amount of strings (I've got 480,863 of them). Any suggestions?

You can also use:
cellfun(#length, strings)
It will not be faster, but makes the code clearer.
Regarding the slowness, you should first run the profiler to check where the bottleneck is. Only then should you optimize.
Edit: I just recalled that 'length' used to be a built-in function in cellfun in older Matlab versions. So it might actually be faster! Try
cellfun('length',strings)
Edit(2) : I have to admit that my first answer was a wild guess. Following #Rodin s comment, I decided to check out the speedup.
Here is the code of the benchmark:
First, the code that generates a lot of strings and saves to disk:
function GenerateCellStrings()
strs = cell(1,10000);
for i=1:10000
strs{i} = GenerateRandomString();
end
save strs;
end
function st = GenerateRandomString()
MAX_STR_LENGTH = 1000;
n = randi(MAX_STR_LENGTH);
st = char(randi([97 122], 1,n ));
end
Then, the benchmark itself:
function CheckRunTime()
load strs;
tic;
disp('Loop:');
for i=1:numel(strs)
n = length(strs{i});
end
toc;
disp('cellfun (String):');
tic;
cellfun('length',strs);
toc;
disp('cellfun (function handle):');
tic;
cellfun(#length,strs);
toc;
end
And the results are:
Loop:
Elapsed time is 0.010663 seconds.
cellfun (String):
Elapsed time is 0.000313 seconds.
cellfun (function handle):
Elapsed time is 0.006280 seconds.
Wow!! The 'length' syntax is about 30 times faster than a loop! I can only guess why it becomes so fast. Maybe the fact that it recognizes length specifically. Might be JIT optimization.
Edit(3) - I found out the reason for the speedup. It is indeed recognition of length specifically. Thanks to #reve_etrange for the info.

Keep an array of the lengths of said strings, and update that array when you update the strings. This will allow you O(1) time access to string lengths. Since you are updating it at the same time you generate or load strings, it shouldn't slow things down much, since integer array operations are (generally) faster than string operations.

Related

Why is string creation so slow in Julia?

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.
This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?
This is some reproducible code:
function chop(s; nc=80)
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
s = "A"^500000
chop(s)
It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]
Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.
I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl
Can someone help me understand why this code above is so slow nonetheless?
Weirdly, the below is much faster, just by adding s = collect(s):
function chop(s; nc=80)
s = collect(s) #this line is new
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):
chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')
The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.
How the solution works
How does the given solution work:
findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
finally all is joined with \n as separator.
What is wrong in the OP solutions
First attempt:
the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)
Second attempt:
doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String
What would be a fast solution
If you want something faster than regex and correct you can use this code:
function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
#assert nc > 0
isempty(s) && return s
sz = sizeof(s)
cu = codeunits(s)
buf_sz = sz + div(sz, nc)
buf = Vector{UInt8}(undef, buf_sz)
start = 1
buf_loc = 1
while true
stop = min(nextind(s, start, nc), sz + 1)
copyto!(buf, buf_loc, cu, start, stop - start)
buf_loc += stop - start
if stop == sz + 1
resize!(buf, buf_loc - 1)
break
else
start = stop
buf[buf_loc] = UInt8('\n')
buf_loc += 1
end
end
return String(buf)
end
String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.
You could operate on bytes
function chop2(s; nc=80)
b = transcode(UInt8, s)
nr = ceil(Int64, length(b)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(b))
dat = UInt8[]
for i in 1:nr
append!(dat, #view(b[l(i):r(i)]))
i < nr && push!(dat, UInt8('\n'))
end
String(dat)
end
and the benchmarks (around 5000x faster):
#btime chop($s);
1.531 s (6267 allocations: 1.28 MiB)
julia> #btime chop2($s);
334.100 μs (13 allocations: 1.57 MiB)
Notes:
this code could be still made slightly faster by pre-allocating dat but I tried to bi similar to the original.
when having unicode characters neither yours nor this approach will not work as you cannot cut a unicode character in the middle
With the help of a colleage we figured out the main reason that makes the provided implementation so slow.
It turns out length(::String) has time complexity O(n) in Julia, and the results are not cached, so the longer the string, the more calls to length which itself takes longer the longer the input. See this Reddit post for a good discussion of the phenomenon:
Collecting the string into a vector resolves the bottleneck, because length of a vector is O(1) instead of O(n).
This is of course by no means the best way to solve the general problem, but it's a one line change that speeds up the code as provided.
This has similar performance to the version by #PrzemyslawSzufel, but is much simpler.
function chop3(s; nc=80)
L = length(s)
join((#view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end
I didn't choose firstindex(s), lastindex(s) as strings may not have arbitrary indices, but it makes no difference anyway.
#btime chop3(s) setup=(s=randstring(10^6)) # 1.625 ms (18 allocations: 1.13 MiB)
#btime chop2(s) setup=(s=randstring(10^6)) # 1.599 ms (14 allocations: 3.19 MiB)
Update: Based on suggestions by #BogumiłKamiński, working with ASCII strings, this version with sizeof is even 60% faster.
function chop3(s; nc=80)
L = sizeof(s)
join((#view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end

TypeError when applying sum to a list of strings [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Why is the time complexity of this algorithm exponential?

During an interview, I was asked the time complexity of the following algorithm:
static bool SetContainsString(string searchString, HashSet<string> setOfStrings)
{
for (int i = 0; i < searchString.Length; i++)
{
var segment = searchString.Substring(0, i + 1);
if (setOfStrings.Contains(segment))
{
var remainingSegment = searchString.Substring(segment.Length);
if (remainingSegment == "") return true;
return SetContainsString(remainingSegment, setOfStrings);
}
}
return false;
}
I answered "linear" because it appears to me to loop only through the length of searchString. Yes, it is recursive, but the recursive call is only on the portion of the string that has not yet been iterated over, so the end result number of iterations is the length of the string.
I was told by my interviewer that the time complexity in the worst case is exponential.
Can anyone help me clarify this? If I am wrong, I need to understand why.
I believe that your interviewer was incorrect here. Here’s how I’d argue why the time complexity isn’t exponential:
Each call to the function either makes zero or one recursive call.
Each recursive call reduces the length of the string by at least one.
This bounds the total number of recursive calls at O(n), where n is the length of the input string. Each individual recursive call does a polynomial amount of work, so the total work done is some polynomial.
I think the reason your interviewer was confused here is that the code given above - which I think is supposed to check if a string can be decomposed into one or more words - doesn’t work correctly in all cases. In particular, notice that the recursion works by always optimistically grabbing the first prefix it finds that’s a word and assuming that what it grabbed is the right way to split the word apart. But imagine you have a word like “applesauce.” If you pull off “a” and try to recursively form “pplesauce,” you’ll incorrectly report that the word isn’t a compound because theres no way to decompose “pplesauce.” To fix this, you’d need to change the recursive call to something like this:
if (SetContainsString(...)) return true;
This way, if you pick the wrong split, you’ll go on to check the other possible splits you can make. That variant on the code does take exponential time in the worst case because it can potentially revisit the same substrings multiple times, and I think that’s what the interviewer incorrectly thought you were doing.

Julia: Using split to construct string arrays with multiple columns

I have a tuple which contains two pieces of information separated by # that looks like x = ("aa#b", "a#c", "a#d"). I can use a comprehension to transform this data into an array in the following way [split(x[i], "#")[j] for i in 1:length(x), j in 1:2].
However, this seems inefficient since I am effectively running the split command twice. Is there a preferred way of handling this case?
Thank you
function hashsplit(x)
out = Array{SubString{String},2}(2,length(x))
for (ind,j) in enumerate(x)
out[:,ind] = split(j,"#")
end
return out
end
Should be faster. Else a simple way with a list comprehension would be
[(split(x[i], "#")...) for i in eachindex(x)] (for a vector of tuples)
cat(2,ans...) or reduce(hcat, ans) if you want a matrix.
Despite it being an old question, here it is my proposed solution
String.(reduce(hcat,split.(a,'#')))
I checked with the #time operator. Compared with the list comprehension proposed by Alexander, which takes 0.042378 seconds (90.93 k allocations: 4.330 MiB),
the amount of time required for the same operation with the proposed solution is 0.000028 seconds (34 allocations: 1.500 KiB).

Lua: Fastest Way to Read Data

Here's my program.
local t = {}
local match = string.gmatch
local insert = table.insert
val = io.read("*a")
for num in match(val, "%d+") do
insert(t, num)
end
I'm wondering if there is a faster way to load a large (16MB+) array of integers than this. Considering the data is composed of line after line of a single number can this be made faster? Should I be looking at io.read("*n") instead?
Given that your file size is 16MB, your loading routine's performance will be dominated by file IO. How long it takes you to process the loaded data will generally be irrelevant next to that.
Just try it; profile how long it takes to just load the file (stopping the script after io.read), then profile how long the whole script takes. The latter will be longer, but it's only going to be by some relatively small percentage, not vast amounts.
Loading the whole file at once the way you're doing will almost certainly be faster than doing it piecemeal. Filesystems like reading entire blocks of data all at once, rather than bits at a time. Beyond that, how to process the text is relatively irrelevant.
I'm not sure if its faster, but read("*n") is much simpler...
local t = { }
while true do
local n = io.stdin:read("*n")
if n == nil then break end
table.insert ( t , n )
end
Probably, this would be faster:
local t = {}
local match = string.match
for line in io.lines() do
t[#t+1] = match(line, '%d+')
end
Don't forget to convert strings to numbers.

Resources