Readable, controllable iterators? - python-3.x

I'm trying to craft an LL(1) parser for a deterministic context-free grammar. One of the things I'd like to be able to use, because it would enable much simpler, less greedy and more maintainable parsing of literal records like numbers, strings, comments and quotations is k tokens of lookahead, instead of just 1 token of lookahead.
Currently, my solution (which works but which I feel is suboptimal) is like (but not) the following:
for idx, tok in enumerate(toklist):
if tok == "blah":
do(stuff)
elif tok == "notblah":
try:
toklist[idx + 1]
except:
whatever()
else:
something(else)
(You can see my actual, much larger implementation at the link above.)
Sometimes, like if the parser finds the beginning of a string or block comment, it would be nice to "jump" the iterator's current counter, such that many indices in the iterator would be skipped.
This can in theory be done with (for example) idx += idx - toklist[idx+1:].index(COMMENT), however in practice, each time the loop repeats, the idx and obj are reinitialised with toklist.next(), overwriting any changes to the variables.
The obvious solution is a while True: or while i < len(toklist): ... i += 1, but there are a few glaring problems with those:
Using while on an iterator like a list is really C-like and really not Pythonic, besides the fact it's horrendously unreadable and unclear compared to an enumerate on the iterator. (Also, for while True:, which may sometimes be desirable, you have to deal with list index out of range.)
For each cycle of the while, there are two ways to get the current token:
using toklist[i] everywhere (ugly, when you could just iterate)
assigning toklist[i] to a shorter, more readable, less typo-vulnerable name each cycle. this has the disadvantage of hogging memory and being slow and inefficient.
Perhaps it can be argued that a while loop is what I should use, but I think while loops are for doing things until a condition is no longer true, and for loops are for iterating and looping finitely over an iterator, and a(n iterative LL) parser should clearly implement the latter.
Is there a clean, Pythonic, efficient way to control and change arbitrarily the iterator's current index?
This is not a dupe of this because all those answers use complicated, unreadable while loops, which is what I don't want.

Is there a clean, Pythonic, efficient way to control and change arbitrarily the iterator's current index?
No, there isn't. You could implement your own iterator type though; it wouldn't operate at the same speed (being implemented in Python), but it's doable. For example:
from collections.abc import Iterator
class SequenceIterator(Iterator):
def __init__(self, seq):
self.seq = seq
self.idx = 0
def __next__(self):
try:
ret = self.seq[self.idx]
except IndexError:
raise StopIteration
else:
self.idx += 1
return ret
def seek(self, offset):
self.idx += offset
To use it, you'd do something like:
# Created outside for loop so you have name to call seek on
myseqiter = SequenceIterator(myseq)
for x in myseqiter:
if test(x):
# do stuff with x
else:
# Seek somehow, e.g.
myseqiter.seek(1) # Skips the next value
Adding behaviors like providing the index as well as value is left as an exercise.

Related

What's wrong with my python recursive code?

Sorry for my ugly English.
This is one of my homework.
I'm making function that finds the max integer in any list, tuple, integer..
like "max_val((5, (1,2), [[1],[2]])) returns 5"
When I ran my code, there was no syntax error. I ran as many various cases I can.
But the homework system told me this code was incorrect.
Anyone give me hint?
numList = []
def max_val(t):
if type(t) is int:
numList.append(t)
else:
for i in range(len(t)):
if t[i] is int:
numList.append(t[i])
else:
max_val(t[i])
return max(numList)
Your code gives wrong results when called several times:
>>> max_val((5,4,3))
5
>>> max_val((2, 1))
5
That's because numList is a global variable that you don't "reset" between calls of your function.
You can simplify your code quite a bit, without needing that global variable:
def max_val(t):
if isinstance(t, int):
return t # t is the only element, so it's by definition the biggest
else:
# Assuming max_val works correctly for an element of t,
# return the largest result
return max(max_val(element) for element in t)
As explained in L3viathan's answer, the main issue with your code is that numList is a global variable. Here is a simple way to fix it without changing the logic of your code:
def max_val(t):
numList = [] # local variable
max_val_helper(t, numList) # fill numList with elements from t
return max(numList)
def max_val_helper(t, numList): # this function modifies its second argument and doesn't return a value
if type(t) is int:
numList.append(t)
else:
for i in range(len(t)):
max_val_helper(t[i], numList)
The function max_val_helper is recursive and appends all numbers in the nested iterables to its argument numList. This function doesn't have a return value; the effect of calling it is that it modifies its argument. This kind of function is sometimes called a "procedure".
The function max_val, on the other hand, is a "pure" function: it returns a value without any side-effect, like modifying its argument or a global variable. It creates a local variable numList, and passes this local variable to max_val_helper which fills it with the numberss from the nested iterables.
The code suggested in L3viathan's answer is arguably more elegant than this one, but I think it's important to understand why your code didn't work properly and how to fix it.
It's also good practice to differentiate between functions with side-effects (like modifying an argument, modifying a global variable, or calls to print) and functions without side-effects.

TypeError when applying sum to a list of strings [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Why is my merge sort algorithm not working?

I am implementing the merge sort algorithm in Python. Previously, I have implemented the same algorithm in C, it works fine there, but when I implement in Python, it outputs an unsorted array.
I've already rechecked the algorithm and code, but to my knowledge the code seems to be correct.
I think the issue is related to the scope of variables in Python, but I don't have any clue for how to solve it.
from random import shuffle
# Function to merge the arrays
def merge(a,beg,mid,end):
i = beg
j = mid+1
temp = []
while(i<=mid and j<=end):
if(a[i]<a[j]):
temp.append(a[i])
i += 1
else:
temp.append(a[j])
j += 1
if(i>mid):
while(j<=end):
temp.append(a[j])
j += 1
elif(j>end):
while(i<=mid):
temp.append(a[i])
i += 1
return temp
# Function to divide the arrays recursively
def merge_sort(a,beg,end):
if(beg<end):
mid = int((beg+end)/2)
merge_sort(a,beg,mid)
merge_sort(a,mid+1,end)
a = merge(a,beg,mid,end)
return a
a = [i for i in range(10)]
shuffle(a)
n = len(a)
a = merge_sort(a, 0, n-1)
print(a)
To make it work you need to change merge_sort declaration slightly:
def merge_sort(a,beg,end):
if(beg<end):
mid = int((beg+end)/2)
merge_sort(a,beg,mid)
merge_sort(a,mid+1,end)
a[beg:end+1] = merge(a,beg,mid,end) # < this line changed
return a
Why:
temp is constructed to be no longer than end-beg+1, but a is the initial full array, if you managed to replace all of it, it'd get borked quick. Therefore we take a "slice" of a and replace values in that slice.
Why not:
Your a luckily was not getting replaced, because of Python's inner workings, that is a bit tricky to explain but I'll try.
Every variable in Python is a reference. a is a reference to a list of variables a[i], which are in turn references to a constantant in memory.
When you pass a to a function it makes a new local variable a that points to the same list of variables. That means when you reassign it as a=*** it only changes where a points. You can only pass changes outside either via "slices" or via return statement
Why "slices" work:
Slices are tricky. As I said a points to an array of other variables (basically a[i]), that in turn are references to a constant data in memory, and when you reassign a slice it goes trough the slice element by element and changes where those individual variables are pointing, but as a inside and outside are still pointing to same old elements the changes go through.
Hope it makes sense.
You don't use the results of the recursive merges, so you essentially report the result of the merge of the two unsorted halves.

Python: NameError: name "string" is not defined, not via input()

The function below checks to see if the first 9 digits of string (n) equate to the 10th character (an integer from 1-9 or X for 10).
def isISBN(n):
checkSum = 0
for i in range(9):
checkSum = checkSum + (eval(n[i])*(i+1))
if checkSum%11 == eval(n[9]) or (checkSum%11 == 10 and n[9] == 'X'): return True
else: return False
When I run the function for n='020103803X' I get an error:
NameError: name 'X' is not defined
I've searched for this problem and found that most people's issues were with input() or raw_input(), but as I am not using input(), I'm confused as to why I can't test if a character is a specific string. This is my first post as Python beginner, please tell if I'm breaking rules or what extra info I should include.
The problem is with your use of eval: eval('X') is the same as doing X (without the quotes). python sees that as a variable reference, and you have no variable named X.
There is no reason to use eval here. What are you hoping to accomplish? Perhaps you should be checking to see if the character is a digit?
if checkSum%11 == n[9].isdigit() or (checkSum%11 == 10 and n[9] == 'X'): return True
You're trying to get a response from
eval('X')
This is illegal, as you have no symbol 'X' defined.
If you switch the order of your if check, you can pass legal ISBNs. However, it still fails on invalid codes with an X at the end.
def isISBN(n):
checkSum = 0
for i in range(9):
checkSum = checkSum + (eval(n[i])*(i+1))
if (checkSum%11 == 10 and n[9] == 'X') or \
checkSum%11 == eval(n[9]):
return True
else:
return False
Note also that you can short-cut that return logic by simply returning the expression value:
return (checkSum%11 == 10 and n[9] == 'X') or \
checkSum%11 == eval(n[9])
Eval is not the proper usage, nor is the way you use it correct. For example, see Wikipedia which shows the use. You probably want to use a try: except: pair.
try:
int(n[i]
except:
print "this character is not a digit"
A call to eval is sometimes used by inexperienced programmers for all
sorts of things. In most cases, there are alternatives which are more
flexible and do not require the speed penalty of parsing code.
For instance, eval is sometimes used for a simple mail merge facility,
as in this PHP example:
$name = 'John Doe';
$greeting = 'Hello';
$template = '"$greeting,
$name! How can I help you today?"';
print eval("return $template;");
Although this works, it can cause some security problems (see §
Security risks), and will be much slower than other possible
solutions. A faster and more secure solution would be changing the
last line to echo $template; and removing the single quotes from the
previous line, or using printf.
eval is also sometimes used in applications needing to evaluate math
expressions, such as spreadsheets. This is much easier than writing an
expression parser, but finding or writing one would often be a wiser
choice. Besides the fixable security risks, using the language's
evaluation features would most likely be slower, and wouldn't be as
customizable.
Perhaps the best use of eval is in bootstrapping a new language (as
with Lisp), and in tutoring programs for languages[clarification
needed] which allow users to run their own programs in a controlled
environment.
For the purpose of expression evaluation, the major advantage of eval
over expression parsers is that, in most programming environments
where eval is supported, the expression may be arbitrarily complex,
and may include calls to functions written by the user that could not
have possibly been known in advance by the parser's creator. This
capability allows you to effectively augment the eval() engine with a
library of functions that you can enhance as needed, without having to
continually maintain an expression parser. If, however, you do not
need this ultimate level of flexibility, expression parsers are far
more efficient and lightweight.
Thanks everyone. I don't know how I didn't think of using int(). The reason I used eval() was because the past few programs I wrote required something like
x = eval(input("Input your equation: "))
Anyways the function works now.
def isISBN(n):
checkSum = 0
for i in range(9):
checkSum = checkSum + (int(n[i])*(i+1))
if n[9] == 'X':
if checkSum%11 == 10: return True
else: return False
elif checkSum%11 == int(n[9]): return True
else: return False

Python 3.x: Test if generator has elements remaining

When I use a generator in a for loop, it seems to "know", when there are no more elements yielded. Now, I have to use a generator WITHOUT a for loop, and use next() by hand, to get the next element. My problem is, how do I know, if there are no more elements?
I know only: next() raises an exception (StopIteration), if there is nothing left, BUT isn't an exception a little bit too "heavy" for such a simple problem? Isn't there a method like has_next() or so?
The following lines should make clear, what I mean:
#!/usr/bin/python3
# define a list of some objects
bar = ['abc', 123, None, True, 456.789]
# our primitive generator
def foo(bar):
for b in bar:
yield b
# iterate, using the generator above
print('--- TEST A (for loop) ---')
for baz in foo(bar):
print(baz)
print()
# assign a new iterator to a variable
foobar = foo(bar)
print('--- TEST B (try-except) ---')
while True:
try:
print(foobar.__next__())
except StopIteration:
break
print()
# assign a new iterator to a variable
foobar = foo(bar)
# display generator members
print('--- GENERATOR MEMBERS ---')
print(', '.join(dir(foobar)))
The output is as follows:
--- TEST A (for loop) ---
abc
123
None
True
456.789
--- TEST B (try-except) ---
abc
123
None
True
456.789
--- GENERATOR MEMBERS ---
__class__, __delattr__, __doc__, __eq__, __format__, __ge__, __getattribute__, __gt__, __hash__, __init__, __iter__, __le__, __lt__, __name__, __ne__, __new__, __next__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__, close, gi_code, gi_frame, gi_running, send, throw
Thanks to everybody, and have a nice day! :)
This is a great question. I'll try to show you how we can use Python's introspective abilities and open source to get an answer. We can use the dis module to peek behind the curtain and see how the CPython interpreter implements a for loop over an iterator.
>>> def for_loop(iterable):
... for item in iterable:
... pass # do nothing
...
>>> import dis
>>> dis.dis(for_loop)
2 0 SETUP_LOOP 14 (to 17)
3 LOAD_FAST 0 (iterable)
6 GET_ITER
>> 7 FOR_ITER 6 (to 16)
10 STORE_FAST 1 (item)
3 13 JUMP_ABSOLUTE 7
>> 16 POP_BLOCK
>> 17 LOAD_CONST 0 (None)
20 RETURN_VALUE
The juicy bit appears to be the FOR_ITER opcode. We can't dive any deeper using dis, so let's look up FOR_ITER in the CPython interpreter's source code. If you poke around, you'll find it in Python/ceval.c; you can view it here. Here's the whole thing:
TARGET(FOR_ITER)
/* before: [iter]; after: [iter, iter()] *or* [] */
v = TOP();
x = (*v->ob_type->tp_iternext)(v);
if (x != NULL) {
PUSH(x);
PREDICT(STORE_FAST);
PREDICT(UNPACK_SEQUENCE);
DISPATCH();
}
if (PyErr_Occurred()) {
if (!PyErr_ExceptionMatches(
PyExc_StopIteration))
break;
PyErr_Clear();
}
/* iterator ended normally */
x = v = POP();
Py_DECREF(v);
JUMPBY(oparg);
DISPATCH();
Do you see how this works? We try to grab an item from the iterator; if we fail, we check what exception was raised. If it's StopIteration, we clear it and consider the iterator exhausted.
So how does a for loop "just know" when an iterator has been exhausted? Answer: it doesn't -- it has to try and grab an element. But why?
Part of the answer is simplicity. Part of the beauty of implementing iterators is that you only have to define one operation: grab the next element. But more importantly, it makes iterators lazy: they'll only produce the values that they absolutely have to.
Finally, if you are really missing this feature, it's trivial to implement it yourself. Here's an example:
class LookaheadIterator:
def __init__(self, iterable):
self.iterator = iter(iterable)
self.buffer = []
def __iter__(self):
return self
def __next__(self):
if self.buffer:
return self.buffer.pop()
else:
return next(self.iterator)
def has_next(self):
if self.buffer:
return True
try:
self.buffer = [next(self.iterator)]
except StopIteration:
return False
else:
return True
x = LookaheadIterator(range(2))
print(x.has_next())
print(next(x))
print(x.has_next())
print(next(x))
print(x.has_next())
print(next(x))
The two statements you wrote deal with finding the end of the generator in exactly the same way. The for-loop simply calls .next() until the StopIteration exception is raised and then it terminates.
http://docs.python.org/tutorial/classes.html#iterators
As such I don't think waiting for the StopIteration exception is a 'heavy' way to deal with the problem, it's the way that generators are designed to be used.
It is not possible to know beforehand about end-of-iterator in the general case, because arbitrary code may have to run to decide about the end. Buffering elements could help revealing things at costs - but this is rarely useful.
In practice the question arises when one wants to take only one or few elements from an iterator for now, but does not want to write that ugly exception handling code (as indicated in the question). Indeed it is non-pythonic to put the concept "StopIteration" into normal application code. And exception handling on python level is rather time-consuming - particularly when it's just about taking one element.
The pythonic way to handle those situations best is either using for .. break [.. else] like:
for x in iterator:
do_something(x)
break
else:
it_was_exhausted()
or using the builtin next() function with default like
x = next(iterator, default_value)
or using iterator helpers e.g. from itertools module for rewiring things like:
max_3_elements = list(itertools.islice(iterator, 3))
Some iterators however expose a "length hint" (PEP424) :
>>> gen = iter(range(3))
>>> gen.__length_hint__()
3
>>> next(gen)
0
>>> gen.__length_hint__()
2
Note: iterator.__next__() should not be used by normal app code. That's why they renamed it from iterator.next() in Python2. And using next() without default is not much better ...
This may not precisely answer your question, but I found my way here looking to elegantly grab a result from a generator without having to write a try: block. A little googling later I figured this out:
def g():
yield 5
result = next(g(), None)
Now result is either 5 or None, depending on how many times you've called next on the iterator, or depending on whether the generator function returned early instead of yielding.
I strongly prefer handling None as an output over raising for "normal" conditions, so dodging the try/catch here is a big win. If the situation calls for it, there's also an easy place to add a default other than None.

Resources