StopIteration in generators - python-3.x

I'm learning python's generators, iterators, iterables, and I can't explain why the following is not working. I want to create, as an exercise, a simple version of the function zip. Here's what i did:
def myzip(*collections):
iterables = tuple(iter(collection) for collection in collections)
yield tuple(next(iterable) for iterable in iterables)
test = myzip([1,2,3],(4,5,6),{7,8,9})
print(next(test))
print(next(test))
print(next(test))
What I do is:
I have collections which is a tuple of some collections
I create a new tuple iterables where, for each collection (which is iterable), I get the iterator using iter
Then, I create a new tuple where, on each iterable, I call next. This tuple is then yield.
So I expect that at the first execution the object iterables is created (and stored). Then in each iteration (including the first one) I call next on every iterable stored before and return it.
However this is what I get:
(1, 4, 8)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-108-424963a58e58> in <module>()
8
9 print(next(test))
---> 10 print(next(test))
StopIteration:
So I see that the first iteration is fine and the result is correct. However, the second iteration raise a StopIteration exception and I don't understand why: each iterable still has some values, so none of the nexts return StopIteration. In fact, this works:
def myziptest(*collections):
iterables = tuple(iter(collection) for collection in collections)
for _ in range(3):
print(tuple(next(iterable) for iterable in iterables))
test = myziptest([1,2,3],(4,5,6),{7,8,9})
Output:
(1, 4, 8)
(2, 5, 9)
(3, 6, 7)
So what is going on?
Thanks a lot

Here's a working solution
def myzip(*collections):
iterables = tuple(iter(collection) for collection in collections)
while True:
try:
yield tuple([next(iterable) for iterable in iterables])
except StopIteration:
# one of the iterables has no more left.
break
test = myzip([1,2,3],(4,5,6),{7,8,9})
print(next(test))
print(next(test))
print(next(test))
The difference between this code and yours is that your code only yields one result. Meaning, calling next more than once will give you a StopIteration.
Think of yield x as putting x into a queue, and next as popping from that queue. And when you try to pop from an empty queue, you get the Stopiteration. You can pop only as many as you put.

Related

cython implementation of groupby failing with NameError

I am attempting to speed up dozens of calls I make to pandas groupby using cython optimised functions. These incldue straight groupby, groupby with ranking and others. I have one that does a groupby that runs in my notebook, but not when called I get a NameError.
Here is the test code from my notebook (in 3 cells there)
%%cython
def _set_indices(keys_as_int, n_keys):
import numpy
cdef int i, j, k
cdef object[:, :] indices = [[i for i in range(0)] for _ in range(n_keys)]
for j, k in enumerate(keys_as_int):
indices[k].append(j)
return [([numpy.array(elt) for elt in indices])]
def group_by(keys):
_, first_occurrences, keys_as_int = np.unique(keys, return_index=True, return_inverse=True)
n_keys = max(keys_as_int) + 1
indices = [[] for _ in range(max(keys_as_int) + 1)]
print(str(keys_as_int) + str(n_keys) + str(indices))
indices = _set_indices(keys_as_int, n_keys)
return indices
%%timeit
result = group_by(['1', '2', '3', '1', '3'])
print(str(result))
The error I get is:
<ipython-input-20-3f8635aec47f> in group_by(keys)
4 indices = [[] for _ in range(max(keys_as_int) + 1)]
5 print(str(keys_as_int) + str(n_keys) + str(indices))
----> 6 indices = _set_indices(keys_as_int, n_keys)
7 return indices
NameError: name '_set_indices' is not defined
Can someone explain if this is due to notebook or if I have done something wrong with the way cython is used, I am new to it.
Also any hints to get a strongly type, with minimum cache hits solution are most welcome.
You need to put your _set_indices function in the same cell, or you need to explicitly import it. From the Compiling with a Jupyter Notebook documentation:
Note that each cell will be compiled into a separate extension module.
After compilation, you do have a global name _set_indices, but that doesn't make it available as a global in the separate extension module for the group_by() function.
You'll need to put the two function definitions into the same cell, or create a separate module for the utility functions.
Note that there is also another issue with the code; you can't just create a typed memory view from a list of integers:
Traceback (most recent call last):
File "so58378716.pyx", line 22, in init so58378716
result = group_by(['1', '2', '3', '1', '3'])
File "so58378716.pyx", line 19, in so58378716.group_by
indices = _set_indices(keys_as_int, n_keys)
File "so58378716.pyx", line 6, in so58378716._set_indices
cdef object[:, :] indices = [[i for i in range(0)] for _ in range(n_keys)]
File "stringsource", line 654, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'list'
You'd have to create an actual numpy array, or use a cython.view.array object, or an array.array.

Why reduce function asking for arguments

I have written these lines of code with reduce built in function but it show an error for given arguments.
Error:
TypeError Traceback (most recent call last)
in
4
5 lst = [1,2,3]
----> 6 reduce(d_n, lst)
TypeError: d_n() takes 1 positional argument but 2 were given
from functools import reduce
def d_n(digit):
return(digit)
lst = [1,2,3]
reduce(d_n, lst)
reduce(...)
reduce(function, sequence[, initial]) -> value
Apply a function of two arguments cumulatively to the items of a sequence,
from left to right, so as to reduce the sequence to a single value.
For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5). If initial is present, it is placed before the items
of the sequence in the calculation, and serves as a default when the
sequence is empty.
Key point: a function of two arguments
Your d_n() function takes only one argument, which makes it incompatible with reduce

Perform a frequency distribution count on a generator, and return values that are greater than n

Is there a way to perform a count on a generator object that is pointing to a list of lists? If so, can I make the count operation output a generator object (of counted items) of previous generator object? I then would like to get a frequency count. I am using generators to conserve memory and prevent crashes. My real data set/list is enormous!
I have a generator object, 'gen_list', created from a list of lists, I'll just show you what the list looks like if the generator object was printed:
In [1]: ll = [(('color'), ('blue')), (('food'), ('grapes')), (('color'), ('blue'))]
# create generator object 'test2'
In [2]: genobj = (each for each in ll)
# create a generator object with counted items
In [3]: count = (test2.count((i), i) for i in test2)
# list count
In [4]: list(count)
This creates the error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-72-83b1c94e3edd> in <module>()
----> 1 list(count)
<ipython-input-70-829ea68a1314> in <genexpr>(.0)
----> 1 count = (test2.count((i), i) for i in test2)
AttributeError: 'generator' object has no attribute 'count'
So I am stuck here. If I can resolve this, I can move onto getting a frequency count (in the form of a generator object) which would look something like:
[(2, ('color', 'blue')), (1, ('food', 'grapes')), (2, ('color', 'blue'))]
Then I would only want save items with values greater than 2, for visual analysis.

How to make a tuple including a numpy array hashable?

One way to make a numpy array hashable is setting it to read-only. This has worked for me in the past. But when I use such a numpy array in a tuple, the whole tuple is no longer hashable, which I do not understand. Here is the sample code I put together to illustrate the problem:
import numpy as np
npArray = np.ones((1,1))
npArray.flags.writeable = False
print(npArray.flags.writeable)
keySet = (0, npArray)
print(keySet[1].flags.writeable)
myDict = {keySet : 1}
First I create a simple numpy array and set it to read-only. Then I add it to a tuple and check if it is still read-only (which it is).
When I want to use the tuple as key in a dictionary, I get the error TypeError: unhashable type: 'numpy.ndarray'.
Here is the output of my sample code:
False
False
Traceback (most recent call last):
File "test.py", line 10, in <module>
myDict = {keySet : 1}
TypeError: unhashable type: 'numpy.ndarray'
What can I do to make my tuple hashable and why does Python show this behavior in the first place?
You claim that
One way to make a numpy array hashable is setting it to read-only
but that's not actually true. Setting an array to read-only just makes it read-only. It doesn't make the array hashable, for multiple reasons.
The first reason is that an array with the writeable flag set to False is still mutable. First, you can always set writeable=True again and resume writing to it, or do more exotic things like reassign its shape even while writeable is False. Second, even without touching the array itself, you could mutate its data through another view that has writeable=True.
>>> x = numpy.arange(5)
>>> y = x[:]
>>> x.flags.writeable = False
>>> x
array([0, 1, 2, 3, 4])
>>> y[0] = 5
>>> x
array([5, 1, 2, 3, 4])
Second, for hashability to be meaningful, objects must first be equatable - == must return a boolean, and must be an equivalence relation. NumPy arrays don't do that. The purpose of hash values is to quickly locate equal objects, but when your objects don't even have a built-in notion of equality, there's not much point to providing hashes.
You're not going to get hashable tuples with arrays inside. You're not even going to get hashable arrays. The closest you can get is to put some other representation of the array's data in the tuple.
The fastest way to hash a numpy array is likely tostring.
In [11]: %timeit hash(y.tostring())
What you could do is rather than use a tuple define a class:
class KeySet(object):
def __init__(self, i, arr):
self.i = i
self.arr = arr
def __hash__(self):
return hash((self.i, hash(self.arr.tostring())))
Now you can use it in a dict:
In [21]: ks = KeySet(0, npArray)
In [22]: myDict = {ks: 1}
In [23]: myDict[ks]
Out[23]: 1

Difference between map and list iterators in python3

I ran into unexpected behaviour when working with map and list iterators in python3. In this MWE I first generate a map of maps. Then, I want the first element of each map in one list, and the remaining parts in the original map:
# s will be be a map of maps
s=[[1,2,3],[4,5,6]]
s=map(lambda l: map(lambda t:t,l),s)
# uncomment to obtain desired output
# s = list(s) # s is now a list of maps
s1 = map(next,s)
print(list(s1))
print(list(map(list,s)))
Running the MWE as is in python 3.4.2 yields the expected output for s1:
s1 = ([1,4]),
but the empty list [] for s. Uncommenting the marked line yields the correct output, s1 as above, but with the expected output for s as well:
s=[[2,3],[5,6]].
The docs say that map expects an iterable. To this day, I saw no difference between map and list iterators. Could someone explain this behaviour?
PS: Curiously enough, if I uncomment the first print statement, the initial state of s is printed. So it could also be that this behaviour has something to do with a kind of lazy(?) evaluation of maps?
A map() is an iterator; you can only iterate over it once. You could get individual elements with next() for example, but once you run out of items you cannot get any more values.
I've given your objects a few easier-to-remember names:
>>> s = [[1, 2, 3], [4, 5, 6]]
>>> map_of_maps = map(lambda l: map(lambda t: t, l), s)
>>> first_elements = map(next, map_of_maps)
Iterating over first_elements here will in turn iterate over map_of_maps. You can only do so once, so once we run out of elements any further iteration will fail:
>>> next(first_elements)
1
>>> next(first_elements)
4
>>> next(first_elements)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
list() does exactly the same thing; it takes an iterable argument, and will iterate over that object to create a new list object from the results. But if you give it a map() that is already exhausted, there is nothing to copy into the new list anymore. As such, you get an empty result:
>>> list(first_elements)
[]
You need to recreate the map() from scratch:
>>> map_of_maps = map(lambda l: map(lambda t: t, l), s)
>>> first_elements = map(next, map_of_maps)
>>> list(first_elements)
[1, 4]
>>> list(first_elements)
[]
Note that a second list() call on the map() object resulted in an empty list object, once again.

Resources