Julia: Using split to construct string arrays with multiple columns - string

I have a tuple which contains two pieces of information separated by # that looks like x = ("aa#b", "a#c", "a#d"). I can use a comprehension to transform this data into an array in the following way [split(x[i], "#")[j] for i in 1:length(x), j in 1:2].
However, this seems inefficient since I am effectively running the split command twice. Is there a preferred way of handling this case?
Thank you

function hashsplit(x)
out = Array{SubString{String},2}(2,length(x))
for (ind,j) in enumerate(x)
out[:,ind] = split(j,"#")
end
return out
end
Should be faster. Else a simple way with a list comprehension would be
[(split(x[i], "#")...) for i in eachindex(x)] (for a vector of tuples)
cat(2,ans...) or reduce(hcat, ans) if you want a matrix.

Despite it being an old question, here it is my proposed solution
String.(reduce(hcat,split.(a,'#')))
I checked with the #time operator. Compared with the list comprehension proposed by Alexander, which takes 0.042378 seconds (90.93 k allocations: 4.330 MiB),
the amount of time required for the same operation with the proposed solution is 0.000028 seconds (34 allocations: 1.500 KiB).

Related

Why is string creation so slow in Julia?

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.
This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?
This is some reproducible code:
function chop(s; nc=80)
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
s = "A"^500000
chop(s)
It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]
Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.
I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl
Can someone help me understand why this code above is so slow nonetheless?
Weirdly, the below is much faster, just by adding s = collect(s):
function chop(s; nc=80)
s = collect(s) #this line is new
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):
chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')
The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.
How the solution works
How does the given solution work:
findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
finally all is joined with \n as separator.
What is wrong in the OP solutions
First attempt:
the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)
Second attempt:
doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String
What would be a fast solution
If you want something faster than regex and correct you can use this code:
function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
#assert nc > 0
isempty(s) && return s
sz = sizeof(s)
cu = codeunits(s)
buf_sz = sz + div(sz, nc)
buf = Vector{UInt8}(undef, buf_sz)
start = 1
buf_loc = 1
while true
stop = min(nextind(s, start, nc), sz + 1)
copyto!(buf, buf_loc, cu, start, stop - start)
buf_loc += stop - start
if stop == sz + 1
resize!(buf, buf_loc - 1)
break
else
start = stop
buf[buf_loc] = UInt8('\n')
buf_loc += 1
end
end
return String(buf)
end
String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.
You could operate on bytes
function chop2(s; nc=80)
b = transcode(UInt8, s)
nr = ceil(Int64, length(b)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(b))
dat = UInt8[]
for i in 1:nr
append!(dat, #view(b[l(i):r(i)]))
i < nr && push!(dat, UInt8('\n'))
end
String(dat)
end
and the benchmarks (around 5000x faster):
#btime chop($s);
1.531 s (6267 allocations: 1.28 MiB)
julia> #btime chop2($s);
334.100 μs (13 allocations: 1.57 MiB)
Notes:
this code could be still made slightly faster by pre-allocating dat but I tried to bi similar to the original.
when having unicode characters neither yours nor this approach will not work as you cannot cut a unicode character in the middle
With the help of a colleage we figured out the main reason that makes the provided implementation so slow.
It turns out length(::String) has time complexity O(n) in Julia, and the results are not cached, so the longer the string, the more calls to length which itself takes longer the longer the input. See this Reddit post for a good discussion of the phenomenon:
Collecting the string into a vector resolves the bottleneck, because length of a vector is O(1) instead of O(n).
This is of course by no means the best way to solve the general problem, but it's a one line change that speeds up the code as provided.
This has similar performance to the version by #PrzemyslawSzufel, but is much simpler.
function chop3(s; nc=80)
L = length(s)
join((#view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end
I didn't choose firstindex(s), lastindex(s) as strings may not have arbitrary indices, but it makes no difference anyway.
#btime chop3(s) setup=(s=randstring(10^6)) # 1.625 ms (18 allocations: 1.13 MiB)
#btime chop2(s) setup=(s=randstring(10^6)) # 1.599 ms (14 allocations: 3.19 MiB)
Update: Based on suggestions by #BogumiłKamiński, working with ASCII strings, this version with sizeof is even 60% faster.
function chop3(s; nc=80)
L = sizeof(s)
join((#view s[i:min(i+nc-1,L)] for i=1:nc:L), '\n')
end

TypeError when applying sum to a list of strings [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Finding item at index i of an iterable (syntactic sugar)?

I know that maps, range, filters etc. in python3 return iterables, and only calculate value when required. Suppose that there is a map M. I want to print the i^th element of M.
One way would be to iterate till i^th value, and print it:
for _ in range(i):
next(M)
print(next(M))
The above takes O(i) time, where I have to find the i^th value.
Another way is to convert to a list, and print the i^th value:
print(list(M)[i])
This however, takes O(n) time and O(n) space (where n is the size of the list from which the map M is created). However, this suits the so-called "Pythonic way of writing one-liners."
I was wondering if there is a syntactic sugar to minimise writing in the first way? (i.e., if there is a way which takes O(i) time, no extra space, and is more suited to the "Pythonic way of writing".)
You can use islice:
from itertools import islice
i = 3
print(next(islice(iterable), i, i + 1))
This outputs '3'.
It actually doesn't matter what you use as the stop argument, as long as you call next once.
Thanks to #DeepSpace for the reference to the official docs, I found the following:
from more_itertools import nth
print(nth(M, i))
It prints the element at i^th index of the iterable.

Generate ngrams with Julia

To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.:
julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"
julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
To generate a trigram I could use the same collect(zip(...)) idiom to get:
julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")
But I have to manually add in the 3rd list to zip through, is there an idiomatic way such that I can do any order of n-gram?
e.g. I'll like to avoid doing this to extract 5-gram:
julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
By changing the output slightly and using SubArrays instead of Tuples, little is lost, but it is possible to avoid allocations and memory copying. If the underlying word list is static, this is OK and faster (in my benchmarks too). The code:
ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]
and the output:
julia> ngram(s,5)
SubString{String}["the","lazy","fox","jumps","over"]
SubString{String}["lazy","fox","jumps","over","the"]
SubString{String}["fox","jumps","over","the","brown"]
SubString{String}["jumps","over","the","brown","dog"]
julia> ngram(s,5)[1][3]
"fox"
For larger word lists the memory requirements are substantially smaller also.
Also note using a generator allows processing the ngrams one-by-one faster and with less memory and might be enough for the desired processing code (counting something or passing through some hash). For example, using #Gnimuc's solution without the collect i.e. just partition(s, n, 1).
Here's a clean one-liner for n-grams of any length.
ngram(s, n) = collect(zip((drop(s, k) for k = 0:n-1)...))
It uses a generator comprehension to iterate over the number of elements, k, to drop. Then, using the splat (...) operator, it unpacks the Drops into zip, and finally collects the Zip into an Array.
julia> ngram(s, 2)
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")
julia> ngram(s, 5)
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")
As you can see, this is very similar to your solution - only a simple comprehension was added to iterate over the number of elements to drop, so that the length could be dynamic.
Another way is to use Iterators.jl's partition():
ngram(s,n) = collect(partition(s, n, 1))

How to compute effectively string length of cell array of strings

I have a cell array in Matlab:
strings = {'one', 'two', 'three'};
How can I efficiently calculate the length of all three strings? Right now I use a for loop:
lengths = zeros(3,1);
for i = 1:3
lengths(i) = length(strings{i});
end
This is however unusable slow when you have a large amount of strings (I've got 480,863 of them). Any suggestions?
You can also use:
cellfun(#length, strings)
It will not be faster, but makes the code clearer.
Regarding the slowness, you should first run the profiler to check where the bottleneck is. Only then should you optimize.
Edit: I just recalled that 'length' used to be a built-in function in cellfun in older Matlab versions. So it might actually be faster! Try
cellfun('length',strings)
Edit(2) : I have to admit that my first answer was a wild guess. Following #Rodin s comment, I decided to check out the speedup.
Here is the code of the benchmark:
First, the code that generates a lot of strings and saves to disk:
function GenerateCellStrings()
strs = cell(1,10000);
for i=1:10000
strs{i} = GenerateRandomString();
end
save strs;
end
function st = GenerateRandomString()
MAX_STR_LENGTH = 1000;
n = randi(MAX_STR_LENGTH);
st = char(randi([97 122], 1,n ));
end
Then, the benchmark itself:
function CheckRunTime()
load strs;
tic;
disp('Loop:');
for i=1:numel(strs)
n = length(strs{i});
end
toc;
disp('cellfun (String):');
tic;
cellfun('length',strs);
toc;
disp('cellfun (function handle):');
tic;
cellfun(#length,strs);
toc;
end
And the results are:
Loop:
Elapsed time is 0.010663 seconds.
cellfun (String):
Elapsed time is 0.000313 seconds.
cellfun (function handle):
Elapsed time is 0.006280 seconds.
Wow!! The 'length' syntax is about 30 times faster than a loop! I can only guess why it becomes so fast. Maybe the fact that it recognizes length specifically. Might be JIT optimization.
Edit(3) - I found out the reason for the speedup. It is indeed recognition of length specifically. Thanks to #reve_etrange for the info.
Keep an array of the lengths of said strings, and update that array when you update the strings. This will allow you O(1) time access to string lengths. Since you are updating it at the same time you generate or load strings, it shouldn't slow things down much, since integer array operations are (generally) faster than string operations.

Resources