Here's my program.
local t = {}
local match = string.gmatch
local insert = table.insert
val = io.read("*a")
for num in match(val, "%d+") do
insert(t, num)
end
I'm wondering if there is a faster way to load a large (16MB+) array of integers than this. Considering the data is composed of line after line of a single number can this be made faster? Should I be looking at io.read("*n") instead?
Given that your file size is 16MB, your loading routine's performance will be dominated by file IO. How long it takes you to process the loaded data will generally be irrelevant next to that.
Just try it; profile how long it takes to just load the file (stopping the script after io.read), then profile how long the whole script takes. The latter will be longer, but it's only going to be by some relatively small percentage, not vast amounts.
Loading the whole file at once the way you're doing will almost certainly be faster than doing it piecemeal. Filesystems like reading entire blocks of data all at once, rather than bits at a time. Beyond that, how to process the text is relatively irrelevant.
I'm not sure if its faster, but read("*n") is much simpler...
local t = { }
while true do
local n = io.stdin:read("*n")
if n == nil then break end
table.insert ( t , n )
end
Probably, this would be faster:
local t = {}
local match = string.match
for line in io.lines() do
t[#t+1] = match(line, '%d+')
end
Don't forget to convert strings to numbers.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)
import itertools
printable = 'abcdefghijklmnopqrstuvwxz'
all_possibilites = ([''.join(i) for i in itertools.product(printable, repeat = 3)])
comparison = ['zd']
if comparison in all_possibilities:
print("match")
This is a snippet of my code. my intention is to generate every single combination of the alphabet. The snippet here has a limit of three characters. With the limit too large python returns memory error. My question is:
Is there a way to remove from memory the combinations that did not match in order for the only limitation to be time, instead of memory? say if the character limit was 5? Any further reading on this would be helpful too.
The main fault is that you're first creating the full list and then trying to filter it out. That's an issue because you bring the full list in memory.
You'd be better adding your condition inside the list-comprehension, keeping any elements you'll actually need:
all_posibilities = ["".join(i) for i in itertools.product(printable, repeat = 5) if 'a' or 'b' in i]
## ^^ place your condition here
print(len(all_posibilities))
Alternatively, if you're looking to iterate through all_posibilities in the end, it makes sense to create a generator to further limit the memory footprint:
all_posibilities = ("".join(i) for i in itertools.product(printable, repeat = 5) if 'a' or 'b' in i)
for i in all_posibilities:
# do things
I am trying to write a program to analyze data from a simulation. Since the simulation software I am using is what is running the Lua program, I am not sure if this is the right place to ask this question, but I am probably making a programming error.
I am struggling with the difference between using the simple and complete I/O models. I have a block of code, which works, and looks like this:
io.output([[filename_and_location]])
function segment.other_actions
if ion_splat ~= 0 then io.write(ion_px_mm, "\n") end
io.close()
end
Note: ion_splat and ion_px_mm are pre-determined variables that take on number values. This code is run over and over again throughout the simulation.
Then I decided to try achieving the same thing using the complete I/O model like this:
f = io.open([[file_name_and_location]],"w")
function segment.other_actions ()
if ion_splat ~= 0 then f:write(ion_py_mm, "\n") end
f:close()
end
end
This runs, but takes a lot longer than the other way. Why is that?
Example 1:
for i = 1, 1000 do
io.output("test.txt")
io.write("some data to be written\n")
io.close()
end
Example 2:
for i = 1, 1000 do
local f = io.open("test.txt", "w")
f:write("some data to be written\n")
f:close()
end
There is no measurable difference in the execution time.
The latter approach is usually preferable because the used file is identified explicitly.
I have code that is structurally similar to the following in Matlab:
bestConfiguration = 0;
bestConfAwesomeness = 0;
for i=1:X
% note that providing bestConfAwesomeness to the function helps it stop if it sees the current configuration is getting hopeless anyway
[configuration, awesomeness] = expensive_function(i, bestConfAwesomeness);
if awesomeness > bestConfAwesomeness
bestConfAwesomeness = awesomeness;
bestConfiguration = configuration;
end
end
There is a bit more to it but the basic structure is the above. X can get very large. I am trying to make this code run in parallel, since expensive_function() takes a long time to run.
The problem is that Matlab won't let me just change for to parfor because it doesn't like that I'm updating the best configuration in the loop.
So far what I've done is:
[allConfigurations, allAwesomeness] = deal(cell(1, X));
parfor i=1:X
% note that this is not ideal because I am forced to use 0 as the best awesomeness in all cases
[allConfigurations{i}, allAwesomeness{i}] = expensive_function(i, 0);
end
for i=1:X
configuration = allConfigurations{i};
awesomeness = allAwesomeness{i};
if awesomeness > bestConfAwesomeness
bestConfAwesomeness = awesomeness;
bestConfiguration = configuration;
end
endfor
This is better in terms of time it takes to run; however, for large inputs it takes huge amounts of memory because all the configurations are always saved. Another problem is that using parfor forces me to always provide 0 as the best configuration even though better ones might be known.
Does Matlab provide a better way of doing this?
Basically, if I didn't have to use Matlab and could manage the threads myself, I'd have one central thread which gives jobs to workers (i.e. make them run expensive_function(i)) and once a worker returns, look at the data it produced and compare it to the best found so far and update it accordingly. There would be no need to save all the configurations which seems to be the only way to make parfor work.
Is there a way to do the above in Matlab?
Using the bestConfAwesomeness each time round the loop means that the iterations of your loop are not order-independent, hence why PARFOR is unhappy. One approach you could take is to use SPMD and have each worker perform expensiveFunction in parallel, and then communicate to update bestConfAwesomeness. Something like this:
bestConfiguration = 0;
bestConfAwesomeness = 0;
spmd
for idx = 1:ceil(X/numlabs)
myIdx = labindex + ((idx-1) * numlabs);
% should really guard against myIdx > X here.
[thisConf, thisAwesome] = expensiveFunction(myIdx, bestConfAwesomeness);
% Now, we must communicate to see if who is best
[bestConfiguration, bestAwesomeness] = reduceAwesomeness(...
bestConfiguration, bestConfAwesomeness, thisConf, thisAwesome);
end
end
function [bestConf, bestConfAwesome] = reduceAwesomeness(...
bestConf, bestConfAwesome, thisConf, thisAwesome)
% slightly lazy way of doing this, could be optimized
% but probably not worth it if conf & awesome both scalars.
allConfs = gcat(bestConf);
allAwesome = gcat(thisAwesome);
[maxThisTime, maxLoc] = max(allAwesome);
if maxThisTime > bestConfAwesome
bestConfAwesome = maxThisTime;
bestConf = allConfs(maxLoc);
end
end
I'm not sure that the kind of control over your threads is possible with Matlab. However, since X is very large, it may be worth doing the following, which costs you one more iteration of expensiveFunction:
%# calculate awesomeness
parfor i=1:X
[~,awesomeness(i)] = expensiveFunction(i);
end
%# find the most awesome i
[mostAwesome,mostAwesomeIdx] = min(awesomeness);
%# get the corresponding configuration
bestConfiguration = expensiveFunction(mostAwesomeIdx);
I have a cell array in Matlab:
strings = {'one', 'two', 'three'};
How can I efficiently calculate the length of all three strings? Right now I use a for loop:
lengths = zeros(3,1);
for i = 1:3
lengths(i) = length(strings{i});
end
This is however unusable slow when you have a large amount of strings (I've got 480,863 of them). Any suggestions?
You can also use:
cellfun(#length, strings)
It will not be faster, but makes the code clearer.
Regarding the slowness, you should first run the profiler to check where the bottleneck is. Only then should you optimize.
Edit: I just recalled that 'length' used to be a built-in function in cellfun in older Matlab versions. So it might actually be faster! Try
cellfun('length',strings)
Edit(2) : I have to admit that my first answer was a wild guess. Following #Rodin s comment, I decided to check out the speedup.
Here is the code of the benchmark:
First, the code that generates a lot of strings and saves to disk:
function GenerateCellStrings()
strs = cell(1,10000);
for i=1:10000
strs{i} = GenerateRandomString();
end
save strs;
end
function st = GenerateRandomString()
MAX_STR_LENGTH = 1000;
n = randi(MAX_STR_LENGTH);
st = char(randi([97 122], 1,n ));
end
Then, the benchmark itself:
function CheckRunTime()
load strs;
tic;
disp('Loop:');
for i=1:numel(strs)
n = length(strs{i});
end
toc;
disp('cellfun (String):');
tic;
cellfun('length',strs);
toc;
disp('cellfun (function handle):');
tic;
cellfun(#length,strs);
toc;
end
And the results are:
Loop:
Elapsed time is 0.010663 seconds.
cellfun (String):
Elapsed time is 0.000313 seconds.
cellfun (function handle):
Elapsed time is 0.006280 seconds.
Wow!! The 'length' syntax is about 30 times faster than a loop! I can only guess why it becomes so fast. Maybe the fact that it recognizes length specifically. Might be JIT optimization.
Edit(3) - I found out the reason for the speedup. It is indeed recognition of length specifically. Thanks to #reve_etrange for the info.
Keep an array of the lengths of said strings, and update that array when you update the strings. This will allow you O(1) time access to string lengths. Since you are updating it at the same time you generate or load strings, it shouldn't slow things down much, since integer array operations are (generally) faster than string operations.