Why is chaining iterables this complicated? Simplify this code - python-3.x

I want to chain multiple iterables, everything with lazy evaluation (speed is crucial), to do the following:
read many integers from a single huge line of stdin
split() that line
convert the resulting strings to int
compute the diff between successive ints
... and some further things not shown here
The real example is more complex, here's a simplified example:
Here's a sample line of stdin:
2 13 4 16 16 15 22 17 8 8 7 6
(For debugging purposes, instream below might point to sys.stdin, or an opened filehandle)
You can't simply chain generators since map() returns a (lazily-evaluated) list:
import itertools
gen1 = map(int, (map(str.split, instream))) # CAN'T CHAIN DIRECTLY
The least complicated working solution I found is this, can it surely not be simplified?
gen1 = map(int, itertools.chain.from_iterable(itertools.chain(map(str.split, instream))))
Why the hell do I need to chain itertools.chain.from_iterable(itertools.chain just to process the result from map(str.split, instream) - it sort of defeats the purpose?
Is manually defining my generators faster?

An explicit ("manual") generator expression should be preferred over using map and filter. It is more readable to most people, and more flexible.
If I understand your question, this generator expression does what you need:
gen1 = ( int(x) for line in instream for x in line.split() )

You could build your generator by hand:
import string
def gen1(stream):
# presuming that stream is of type io.TextIOBase
s = ""
c = stream.read(1)
while len(c)>0:
if (c not in string.digits):
if len(s) > 0:
i = int(s)
yield i
s = ""
else:
s += c
c = stream.read(1)
if len(s) > 0:
i = int(s)
yield i
import io
g = gen1(io.StringIO("12 45 6 7 88"))
for x in g: # dangerous if stream is unlimited
print(x)
Which is certainly not the most beautiful code, but it does what you want.
Explanations:
If your input is indefinitely long you have to read it in chunks (or character wise).
Whenever you encounter a non-digit (whitespace), you convert the characters you have read until that point into an integer and yield it.
You also have to consider what happens when you reach the EOF.
My implementation is probably not very well performed, due to the fact that I'm reading char-wise. Using chunks one could speed it up significantly.
EDIT as to why your approach will never work:
map(str.split, instream)
does simply not do what you appear to think it does. map applies the given function str.split to each element of the iterator given as the second parameter. In your case that is a stream, i.e. a file object, in the case of sys.stdin specifically a io.TextIOBase object. Which indeed can be iterated over. Line by line, which emphatically is NOT what you want! In effect you iterate over your input line by line and split each line into words. The map generator iterates over (many) lists of words NOT over A list of words. Which is why you have to chain them together to get a single list to iterate on.
Also, the itertools.chain() in itertools.chain.from_iterable(itertools.chain(map(...))) is redundant. itertools.chain chains its arguments (each an inalterable object) together into one iterator. You only give it one argument so there is nothing to chain together, it basically returns the map object unchanged.
itertools.chain.from_iterable() on the other hand takes one argument, which is expected to be an iterator of iterators (e.g. a list of lists) and flattens it into one iterator (list).
EDIT2
import io, itertools
instream = io.StringIO("12 45 \n 66 7 88")
gen1 = itertools.chain.from_iterable(map(str.split, instream))
gen2 = map(int, gen1)
list(gen2)
returns
[12, 45, 66, 7, 88]

Related

list of one string element getting converted to list of characters

I have a program which receives input from another program and use it for further operations. The input can be a list, set, tuple but for further operations a list is needed. So I am converting input to list.
The problem arises when input my program receives is a list/set/tuple with just one element like below. The
import itertools
def not_mine(c):
d = {'John':['mid', 'forward'],
'Lana':['mid'],
'Jacob':['defence', 'mid'],
'Ian':['goal', 'mid']}
n = itemgetter(*c)(d)
n = list(set(itertools.chain.from_iterable(n)))
return n
def mine(c):
name = not_mine(c)
name_1 = list(name)
print(name_1)
mine(['Jacob', 'Ian'])
['defence', 'goal', 'mid']
mine(['Lana'])
['i', 'm', 'd']
Is there any way to prevent the second case? It should be a list of one element ['mid'].
Iterators
The function set uses the first argument as an iterator to create a sequence of items. str is natively an iterator. In other words, you can loop over a str and you'll assign to the for variable each character in the string per iteration.
for whatami in "hi!":
print(whatami)
h
i
!
If you want to treat a single string input as a single item, explicitly pass an iterator argument to set (list works the same way, BTW) with a single item in it. Tuple is, also, an iterator. Let's try to use it to prove our theory
t1 = ('ourstring', )
print(f"t1 is of type {type(t1)}")
s1 = set(t1)
print(s1)
t1 is of type <class 'tuple'>
{'ourstring'}
It works!
What we've done with ('ourstring', ) is explicitly define a tuple with one item. There's a familiar delimiter, ,, used to say "this tuple is instantiated with only one item".
Input
To separate situations between ingesting a list of items and one string item, you can consider two approaches.
The most straight-forward way is to agree on a delimiter in the input such as comma separated values. firstvalue,secondvalue,etc. The down side of this is that you'll quickly run into limitations of what kind of data you can receive.
To ease your development, argparse is strongly recommended command line arguments. It is a built-in, battle-hardened package made for this type of task. The docs's first example even shows a multi-value field.

Python, perplexity about "is" operator on integers [duplicate]

Why does the following behave unexpectedly in Python?
>>> a = 256
>>> b = 256
>>> a is b
True # This is an expected result
>>> a = 257
>>> b = 257
>>> a is b
False # What happened here? Why is this False?
>>> 257 is 257
True # Yet the literal numbers compare properly
I am using Python 2.5.2. Trying some different versions of Python, it appears that Python 2.3.3 shows the above behaviour between 99 and 100.
Based on the above, I can hypothesize that Python is internally implemented such that "small" integers are stored in a different way than larger integers and the is operator can tell the difference. Why the leaky abstraction? What is a better way of comparing two arbitrary objects to see whether they are the same when I don't know in advance whether they are numbers or not?
Take a look at this:
>>> a = 256
>>> b = 256
>>> id(a) == id(b)
True
>>> a = 257
>>> b = 257
>>> id(a) == id(b)
False
Here's what I found in the documentation for "Plain Integer Objects":
The current implementation keeps an array of integer objects for all integers between -5 and 256. When you create an int in that range you actually just get back a reference to the existing object.
So, integers 256 are identical, but 257 are not. This is a CPython implementation detail, and not guaranteed for other Python implementations.
Python's “is” operator behaves unexpectedly with integers?
In summary - let me emphasize: Do not use is to compare integers.
This isn't behavior you should have any expectations about.
Instead, use == and != to compare for equality and inequality, respectively. For example:
>>> a = 1000
>>> a == 1000 # Test integers like this,
True
>>> a != 5000 # or this!
True
>>> a is 1000 # Don't do this! - Don't use `is` to test integers!!
False
Explanation
To know this, you need to know the following.
First, what does is do? It is a comparison operator. From the documentation:
The operators is and is not test for object identity: x is y is true
if and only if x and y are the same object. x is not y yields the
inverse truth value.
And so the following are equivalent.
>>> a is b
>>> id(a) == id(b)
From the documentation:
id
Return the “identity” of an object. This is an integer (or long
integer) which is guaranteed to be unique and constant for this object
during its lifetime. Two objects with non-overlapping lifetimes may
have the same id() value.
Note that the fact that the id of an object in CPython (the reference implementation of Python) is the location in memory is an implementation detail. Other implementations of Python (such as Jython or IronPython) could easily have a different implementation for id.
So what is the use-case for is? PEP8 describes:
Comparisons to singletons like None should always be done with is or
is not, never the equality operators.
The Question
You ask, and state, the following question (with code):
Why does the following behave unexpectedly in Python?
>>> a = 256
>>> b = 256
>>> a is b
True # This is an expected result
It is not an expected result. Why is it expected? It only means that the integers valued at 256 referenced by both a and b are the same instance of integer. Integers are immutable in Python, thus they cannot change. This should have no impact on any code. It should not be expected. It is merely an implementation detail.
But perhaps we should be glad that there is not a new separate instance in memory every time we state a value equals 256.
>>> a = 257
>>> b = 257
>>> a is b
False # What happened here? Why is this False?
Looks like we now have two separate instances of integers with the value of 257 in memory. Since integers are immutable, this wastes memory. Let's hope we're not wasting a lot of it. We're probably not. But this behavior is not guaranteed.
>>> 257 is 257
True # Yet the literal numbers compare properly
Well, this looks like your particular implementation of Python is trying to be smart and not creating redundantly valued integers in memory unless it has to. You seem to indicate you are using the referent implementation of Python, which is CPython. Good for CPython.
It might be even better if CPython could do this globally, if it could do so cheaply (as there would a cost in the lookup), perhaps another implementation might.
But as for impact on code, you should not care if an integer is a particular instance of an integer. You should only care what the value of that instance is, and you would use the normal comparison operators for that, i.e. ==.
What is does
is checks that the id of two objects are the same. In CPython, the id is the location in memory, but it could be some other uniquely identifying number in another implementation. To restate this with code:
>>> a is b
is the same as
>>> id(a) == id(b)
Why would we want to use is then?
This can be a very fast check relative to say, checking if two very long strings are equal in value. But since it applies to the uniqueness of the object, we thus have limited use-cases for it. In fact, we mostly want to use it to check for None, which is a singleton (a sole instance existing in one place in memory). We might create other singletons if there is potential to conflate them, which we might check with is, but these are relatively rare. Here's an example (will work in Python 2 and 3) e.g.
SENTINEL_SINGLETON = object() # this will only be created one time.
def foo(keyword_argument=None):
if keyword_argument is None:
print('no argument given to foo')
bar()
bar(keyword_argument)
bar('baz')
def bar(keyword_argument=SENTINEL_SINGLETON):
# SENTINEL_SINGLETON tells us if we were not passed anything
# as None is a legitimate potential argument we could get.
if keyword_argument is SENTINEL_SINGLETON:
print('no argument given to bar')
else:
print('argument to bar: {0}'.format(keyword_argument))
foo()
Which prints:
no argument given to foo
no argument given to bar
argument to bar: None
argument to bar: baz
And so we see, with is and a sentinel, we are able to differentiate between when bar is called with no arguments and when it is called with None. These are the primary use-cases for is - do not use it to test for equality of integers, strings, tuples, or other things like these.
I'm late but, you want some source with your answer? I'll try and word this in an introductory manner so more folks can follow along.
A good thing about CPython is that you can actually see the source for this. I'm going to use links for the 3.5 release, but finding the corresponding 2.x ones is trivial.
In CPython, the C-API function that handles creating a new int object is PyLong_FromLong(long v). The description for this function is:
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object. So it should be possible to change the value of 1. I suspect the behaviour of Python in this case is undefined. :-)
(My italics)
Don't know about you but I see this and think: Let's find that array!
If you haven't fiddled with the C code implementing CPython you should; everything is pretty organized and readable. For our case, we need to look in the Objects subdirectory of the main source code directory tree.
PyLong_FromLong deals with long objects so it shouldn't be hard to deduce that we need to peek inside longobject.c. After looking inside you might think things are chaotic; they are, but fear not, the function we're looking for is chilling at line 230 waiting for us to check it out. It's a smallish function so the main body (excluding declarations) is easily pasted here:
PyObject *
PyLong_FromLong(long ival)
{
// omitting declarations
CHECK_SMALL_INT(ival);
if (ival < 0) {
/* negate: cant write this as abs_ival = -ival since that
invokes undefined behaviour when ival is LONG_MIN */
abs_ival = 0U-(unsigned long)ival;
sign = -1;
}
else {
abs_ival = (unsigned long)ival;
}
/* Fast path for single-digit ints */
if (!(abs_ival >> PyLong_SHIFT)) {
v = _PyLong_New(1);
if (v) {
Py_SIZE(v) = sign;
v->ob_digit[0] = Py_SAFE_DOWNCAST(
abs_ival, unsigned long, digit);
}
return (PyObject*)v;
}
Now, we're no C master-code-haxxorz but we're also not dumb, we can see that CHECK_SMALL_INT(ival); peeking at us all seductively; we can understand it has something to do with this. Let's check it out:
#define CHECK_SMALL_INT(ival) \
do if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) { \
return get_small_int((sdigit)ival); \
} while(0)
So it's a macro that calls function get_small_int if the value ival satisfies the condition:
if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS)
So what are NSMALLNEGINTS and NSMALLPOSINTS? Macros! Here they are:
#ifndef NSMALLPOSINTS
#define NSMALLPOSINTS 257
#endif
#ifndef NSMALLNEGINTS
#define NSMALLNEGINTS 5
#endif
So our condition is if (-5 <= ival && ival < 257) call get_small_int.
Next let's look at get_small_int in all its glory (well, we'll just look at its body because that's where the interesting things are):
PyObject *v;
assert(-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS);
v = (PyObject *)&small_ints[ival + NSMALLNEGINTS];
Py_INCREF(v);
Okay, declare a PyObject, assert that the previous condition holds and execute the assignment:
v = (PyObject *)&small_ints[ival + NSMALLNEGINTS];
small_ints looks a lot like that array we've been searching for, and it is! We could've just read the damn documentation and we would've know all along!:
/* Small integers are preallocated in this array so that they
can be shared.
The integers that are preallocated are those in the range
-NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
*/
static PyLongObject small_ints[NSMALLNEGINTS + NSMALLPOSINTS];
So yup, this is our guy. When you want to create a new int in the range [NSMALLNEGINTS, NSMALLPOSINTS) you'll just get back a reference to an already existing object that has been preallocated.
Since the reference refers to the same object, issuing id() directly or checking for identity with is on it will return exactly the same thing.
But, when are they allocated??
During initialization in _PyLong_Init Python will gladly enter in a for loop to do this for you:
for (ival = -NSMALLNEGINTS; ival < NSMALLPOSINTS; ival++, v++) {
Check out the source to read the loop body!
I hope my explanation has made you C things clearly now (pun obviously intented).
But, 257 is 257? What's up?
This is actually easier to explain, and I have attempted to do so already; it's due to the fact that Python will execute this interactive statement as a single block:
>>> 257 is 257
During complilation of this statement, CPython will see that you have two matching literals and will use the same PyLongObject representing 257. You can see this if you do the compilation yourself and examine its contents:
>>> codeObj = compile("257 is 257", "blah!", "exec")
>>> codeObj.co_consts
(257, None)
When CPython does the operation, it's now just going to load the exact same object:
>>> import dis
>>> dis.dis(codeObj)
1 0 LOAD_CONST 0 (257) # dis
3 LOAD_CONST 0 (257) # dis again
6 COMPARE_OP 8 (is)
So is will return True.
It depends on whether you're looking to see if 2 things are equal, or the same object.
is checks to see if they are the same object, not just equal. The small ints are probably pointing to the same memory location for space efficiency
In [29]: a = 3
In [30]: b = 3
In [31]: id(a)
Out[31]: 500729144
In [32]: id(b)
Out[32]: 500729144
You should use == to compare equality of arbitrary objects. You can specify the behavior with the __eq__, and __ne__ attributes.
As you can check in source file intobject.c, Python caches small integers for efficiency. Every time you create a reference to a small integer, you are referring the cached small integer, not a new object. 257 is not an small integer, so it is calculated as a different object.
It is better to use == for that purpose.
I think your hypotheses is correct. Experiment with id (identity of object):
In [1]: id(255)
Out[1]: 146349024
In [2]: id(255)
Out[2]: 146349024
In [3]: id(257)
Out[3]: 146802752
In [4]: id(257)
Out[4]: 148993740
In [5]: a=255
In [6]: b=255
In [7]: c=257
In [8]: d=257
In [9]: id(a), id(b), id(c), id(d)
Out[9]: (146349024, 146349024, 146783024, 146804020)
It appears that numbers <= 255 are treated as literals and anything above is treated differently!
There's another issue that isn't pointed out in any of the existing answers. Python is allowed to merge any two immutable values, and pre-created small int values are not the only way this can happen. A Python implementation is never guaranteed to do this, but they all do it for more than just small ints.
For one thing, there are some other pre-created values, such as the empty tuple, str, and bytes, and some short strings (in CPython 3.6, it's the 256 single-character Latin-1 strings). For example:
>>> a = ()
>>> b = ()
>>> a is b
True
But also, even non-pre-created values can be identical. Consider these examples:
>>> c = 257
>>> d = 257
>>> c is d
False
>>> e, f = 258, 258
>>> e is f
True
And this isn't limited to int values:
>>> g, h = 42.23e100, 42.23e100
>>> g is h
True
Obviously, CPython doesn't come with a pre-created float value for 42.23e100. So, what's going on here?
The CPython compiler will merge constant values of some known-immutable types like int, float, str, bytes, in the same compilation unit. For a module, the whole module is a compilation unit, but at the interactive interpreter, each statement is a separate compilation unit. Since c and d are defined in separate statements, their values aren't merged. Since e and f are defined in the same statement, their values are merged.
You can see what's going on by disassembling the bytecode. Try defining a function that does e, f = 128, 128 and then calling dis.dis on it, and you'll see that there's a single constant value (128, 128)
>>> def f(): i, j = 258, 258
>>> dis.dis(f)
1 0 LOAD_CONST 2 ((128, 128))
2 UNPACK_SEQUENCE 2
4 STORE_FAST 0 (i)
6 STORE_FAST 1 (j)
8 LOAD_CONST 0 (None)
10 RETURN_VALUE
>>> f.__code__.co_consts
(None, 128, (128, 128))
>>> id(f.__code__.co_consts[1], f.__code__.co_consts[2][0], f.__code__.co_consts[2][1])
4305296480, 4305296480, 4305296480
You may notice that the compiler has stored 128 as a constant even though it's not actually used by the bytecode, which gives you an idea of how little optimization CPython's compiler does. Which means that (non-empty) tuples actually don't end up merged:
>>> k, l = (1, 2), (1, 2)
>>> k is l
False
Put that in a function, dis it, and look at the co_consts—there's a 1 and a 2, two (1, 2) tuples that share the same 1 and 2 but are not identical, and a ((1, 2), (1, 2)) tuple that has the two distinct equal tuples.
There's one more optimization that CPython does: string interning. Unlike compiler constant folding, this isn't restricted to source code literals:
>>> m = 'abc'
>>> n = 'abc'
>>> m is n
True
On the other hand, it is limited to the str type, and to strings of internal storage kind "ascii compact", "compact", or "legacy ready", and in many cases only "ascii compact" will get interned.
At any rate, the rules for what values must be, might be, or cannot be distinct vary from implementation to implementation, and between versions of the same implementation, and maybe even between runs of the same code on the same copy of the same implementation.
It can be worth learning the rules for one specific Python for the fun of it. But it's not worth relying on them in your code. The only safe rule is:
Do not write code that assumes two equal but separately-created immutable values are identical (don't use x is y, use x == y)
Do not write code that assumes two equal but separately-created immutable values are distinct (don't use x is not y, use x != y)
Or, in other words, only use is to test for the documented singletons (like None) or that are only created in one place in the code (like the _sentinel = object() idiom).
For immutable value objects, like ints, strings or datetimes, object identity is not especially useful. It's better to think about equality. Identity is essentially an implementation detail for value objects - since they're immutable, there's no effective difference between having multiple refs to the same object or multiple objects.
is is the identity equality operator (functioning like id(a) == id(b)); it's just that two equal numbers aren't necessarily the same object. For performance reasons some small integers happen to be memoized so they will tend to be the same (this can be done since they are immutable).
PHP's === operator, on the other hand, is described as checking equality and type: x == y and type(x) == type(y) as per Paulo Freitas' comment. This will suffice for common numbers, but differ from is for classes that define __eq__ in an absurd manner:
class Unequal:
def __eq__(self, other):
return False
PHP apparently allows the same thing for "built-in" classes (which I take to mean implemented at C level, not in PHP). A slightly less absurd use might be a timer object, which has a different value every time it's used as a number. Quite why you'd want to emulate Visual Basic's Now instead of showing that it is an evaluation with time.time() I don't know.
Greg Hewgill (OP) made one clarifying comment "My goal is to compare object identity, rather than equality of value. Except for numbers, where I want to treat object identity the same as equality of value."
This would have yet another answer, as we have to categorize things as numbers or not, to select whether we compare with == or is. CPython defines the number protocol, including PyNumber_Check, but this is not accessible from Python itself.
We could try to use isinstance with all the number types we know of, but this would inevitably be incomplete. The types module contains a StringTypes list but no NumberTypes. Since Python 2.6, the built in number classes have a base class numbers.Number, but it has the same problem:
import numpy, numbers
assert not issubclass(numpy.int16,numbers.Number)
assert issubclass(int,numbers.Number)
By the way, NumPy will produce separate instances of low numbers.
I don't actually know an answer to this variant of the question. I suppose one could theoretically use ctypes to call PyNumber_Check, but even that function has been debated, and it's certainly not portable. We'll just have to be less particular about what we test for now.
In the end, this issue stems from Python not originally having a type tree with predicates like Scheme's number?, or Haskell's type class Num. is checks object identity, not value equality. PHP has a colorful history as well, where === apparently behaves as is only on objects in PHP5, but not PHP4. Such are the growing pains of moving across languages (including versions of one).
It also happens with strings:
>>> s = b = 'somestr'
>>> s == b, s is b, id(s), id(b)
(True, True, 4555519392, 4555519392)
Now everything seems fine.
>>> s = 'somestr'
>>> b = 'somestr'
>>> s == b, s is b, id(s), id(b)
(True, True, 4555519392, 4555519392)
That's expected too.
>>> s1 = b1 = 'somestrdaasd ad ad asd as dasddsg,dlfg ,;dflg, dfg a'
>>> s1 == b1, s1 is b1, id(s1), id(b1)
(True, True, 4555308080, 4555308080)
>>> s1 = 'somestrdaasd ad ad asd as dasddsg,dlfg ,;dflg, dfg a'
>>> b1 = 'somestrdaasd ad ad asd as dasddsg,dlfg ,;dflg, dfg a'
>>> s1 == b1, s1 is b1, id(s1), id(b1)
(True, False, 4555308176, 4555308272)
Now that's unexpected.
What’s New In Python 3.8: Changes in Python behavior:
The compiler now produces a SyntaxWarning when identity checks (is and
is not) are used with certain types of literals (e.g. strings, ints).
These can often work by accident in CPython, but are not guaranteed by
the language spec. The warning advises users to use equality tests (==
and !=) instead.

Python For Loop - Switching from PHP

I am new to python and usually use PHP. Can someone please clear me structure of For loop for python.
numbers = int(x) for x in numbers
In PHP everything realted to For loop used to inside the body. I can't understand why methods are before for loop in python.
First of all the statement is missing brackets:
numbers = [int(x) for x in numbers]
This is called a list comprehension and it is the equivalent of:
numbers = []
for x in numbers:
numbers.append(int(x))
Note that you can also use comprehensions as generator expressions, in which case the [] become ():
numbers = (int(x) for x in numbers)
which is the equivalent of:
def numbers(N):
for x in N:
yield int(x)
This means that the for loop will only execute one yield at the time as it is being processed. In other words, while the first example builds a list in memory, a generator returns one element at a time when executed. This is great to process large lists where you can generate one element a time without getting everything into memory (e.g. processing a file line-by-line).
So as you can see comprehensions and generator expressions are a great way to reduce the amount of code required to process lists, tuples and any other iterable.

pyspark fold method output

I'm surprised at this output from fold, I can't imagine what it's doing.
I would expect that something.fold(0, lambda a,b: a+1) would return the number of elements in something, since the fold starts at 0 and adds 1 for each element.
sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
8
I'm coming from Scala, where fold works as the way I've described. So how is fold supposed to work in pyspark? Thanks for your thoughts.
To understand what's going on here, let's look at the definition of Spark's fold operation. Since you're using PySpark, I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also browse the source on GitHub):
def fold(self, zeroValue, op):
"""
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value."
The function C{op(t1, t2)} is allowed to modify C{t1} and return it
as its result value to avoid object allocation; however, it should not
modify C{t2}.
>>> from operator import add
>>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
15
"""
def func(iterator):
acc = zeroValue
for obj in iterator:
acc = op(obj, acc)
yield acc
vals = self.mapPartitions(func).collect()
return reduce(op, vals, zeroValue)
(For comparison, see the Scala implementation of RDD.fold).
Spark's fold operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for every partition rather than one value for each non-empty partition. This means that the result of fold is sensitive to the number of partitions:
>>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
100
>>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
50
>>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
1
In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1.
It seems that Spark's fold() operation actually requires the fold function to be commutative in addition to associative. There are actually other places in Spark that impose this requirement, such as the fact that the ordering of elements within a shuffled partition can be non-deterministic across runs (see SPARK-5750).
I've opened a Spark JIRA ticket to investigate this issue: https://issues.apache.org/jira/browse/SPARK-6416.
Lets me try to give simple examples to explain fold method of spark. I will be using pyspark here.
rdd1 = sc.parallelize(list([]),1)
Above line is going to create an empty rdd with one partition
rdd1.fold(10, lambda x,y:x+y)
This yield output as 20
rdd2 = sc.parallelize(list([1,2,3,4,5]),2)
Above line is going to create rdd with values 1 to 5 and will be having a total of 2 partitions
rdd2.fold(10, lambda x,y:x+y)
This yields output as 45
So in above case for sake of simplicity what is happening here is you are having zeroth element as 10. So the sum that you would otherwise get of all numbers in the RDD, is now added by 10(i.e. zeroth element+all other elements => 10+1+2+3+4+5 = 25). Also now we have two partitions(i.e. number of partitions*zeroth element => 2*10 = 20)
Final output that fold emits is 25+20 = 45
Using similar process its clear why the fold operation on rdd1 yielded 20 as output.
Reduce fails when we have empty list something like rdd1.reduce(lambda x,y:x+y)
ValueError: Can not reduce() empty RDD
Fold can be used if we think we can have empty list in the rdd
rdd1.fold(0, lambda x,y:x+y)
As expected this will yield output as 0.

whats another way to write python3 zip [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Ive been working on a code that reads lines in a file document and then the code organizes them. However, i got stuck at one point and my friend told me what i could use. the code works but it seems that i dont know what he is doing at line 7 and 8 FROM THE BOTTOM. I used #### so you guys know which lines it is.
So, essentially how can you re-write those 2 lines of codes and why do they work? I seem to not understand dictionaries
from sys import argv
filename = input("Please enter the name of a file: ")
file_in=(open(filename, "r"))
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
animaldictionary = dict()
for line in file_in:
if '\n' == line[-1]:
line = line[:-1]
(a, b, c) = line.split(':')
ac = (a,c)
if ac not in animaldictionary:
animaldictionary[ac] = 0
animaldictionary[ac] += 1
alla = []
for key, value in animaldictionary:
if key not in alla:
alla.append(key)
print ("alla:",alla)
allc = []
for key, value in animaldictionary:
if value not in allc:
allc.append(value)
print("allc", allc)
for a in sorted(alla):
print('%9s'%a,end=' '*13)
for c in sorted(allc):
ac = (a,c)
valc = 0
if ac in animaldictionary:
valc = animaldictionary[ac]
print('%4d'%valc,end=' '*19)
print()
print("="*60)
print("Animals that visited both stations at least 3 times: ")
for a in sorted(alla):
x = 'false'
for c in sorted(allc):
ac = (a,c)
count = 0
if ac in animaldictionary:
count = animaldictionary[ac]
if count >= 3:
x = 'true'
if x is 'true':
print('%6s'%a, end=' ')
print("")
print("="*60)
print("Average of the number visits in each month for each station:")
#(alla, allc) =
#for s in zip(*animaldictionary.keys()):
# (alla,allc).append(s)
#print(alla, allc)
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys())) ##### how else can you write this
##### how else can you rewrite the next code
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0) for a in alla for ac in ((a,c,),))//12)))for c in sorted(allc)]))
print("="*60)
print("Month with the maximum number of visits for each station:")
print("Station Month Number")
print("1")
print("2")
The two lines you indicated are indeed rather confusing. I'll try to explain them as best I can, and suggest alternative implementations.
The first one computes values for alla and allc:
(alla,allc,) = (set(s) for s in zip(*animaldictionary.keys()))
This is nearly equivalent to the loops you've already done above to build your alla and allc lists. You can skip it completely if you want. However, lets unpack what it's doing, so you can actually understand it.
The innermost part is animaldictionary.keys(). This returns an iterable object that contains all the keys of your dictionary. Since the keys in animaldictionary are two-valued tuples, that's what you'll get from the iterable. It's actually not necessary to call keys when dealing with a dictionary in most cases, since operations on the keys view are usually identical to doing the same operation on the dictionary directly.
Moving on, the keys gets wrapped up by a call to the zip function using zip(*keys). There's two things happening here. First, the * syntax unpacks the iterable from above into separate arguments. So if animaldictionary's keys were ("a1", "c1), ("a2", "c2"), ("a3", "c3") this would call zip with those three tuples as separate arguments. Now, what zip does is turn several iterable arguments into a single iterable, yielding a tuple with the first value from each, then a tuple with the second value from each, and so on. So zip(("a1", "c1"), ("a2", "c2"), ("a3", "c3")) would return a generator yielding ("a1", "a2", "a3") followed by ("c1", "c2", "c3").
The next part is a generator expression that passes each value from the zip expression into the set constructor. This serves to eliminate any duplicates. set instances can also be useful in other ways (e.g. finding intersections) but that's not needed here.
Finally, the two sets of a and c values get assigned to variables alla and allc. They replace the lists you already had with those names (and the same contents!).
You've already got an alternative to this, where you calculate alla and allc as lists. Using sets may be slightly more efficient, but it probably doesn't matter too much for small amounts of data. Another, more clear, way to do it would be:
alla = set()
allc = set()
for key in animaldict: # note, iterating over a dict yields the keys!
a, c = key # unpack the tuple key
alla.add(a)
allc.add(c)
The second line you were asking about does some averaging and combines the results into a giant string which it prints out. It is really bad programming style to cram so much into one line. And in fact, it does some needless stuff which makes it even more confusing. Here it is, with a couple of line breaks added to make it all fit on the screen at once.
print('\n'.join(['\t'.join((c,str(sum(animaldictionary.get(ac,0)
for a in alla for ac in ((a,c,),))//12)
)) for c in sorted(allc)]))
The innermost piece of this is for ac in ((a,c,),). This is silly, since it's a loop over a 1-element tuple. It's a way of renaming the tuple (a,c) to ac, but it is very confusing and unnecessary.
If we replace the one use of ac with the tuple explicitly written out, the new innermost piece is animaldictionary.get((a,c),0). This is a special way of writing animaldictionary[(a, c)] but without running the risk of causing a KeyError to be raised if (a, c) is not in the dictionary. Instead, the default value of 0 (passed in to get) will be returned for non-existant keys.
That get call is wrapped up in this: (getcall for a in alla). This is a generator expression that gets all the values from the dictionary with a given c value in the key
(with a default of zero if the value is not present).
The next step is taking the average of the values in the previous generator expression: sum(genexp)//12. This is pretty straightforward, though you should note that using // for division always rounds down to the next integer. If you want a more precise floating point value, use just /.
The next part is a call to '\t'.join, with an argument that is a single (c, avg) tuple. This is an awkward construction that could be more clearly written as c+"\t"+str(avg) or "{}\t{}".format(c, avg). All of these result in a string containing the c value, a tab character and the string form of the average calcualted above.
The next step is a list comprehension, [joinedstr for c in sorted(allc)] (where joinedstr is the join call in the previous step). Using a list comprehension here is a bit odd, since there's no need for a list (a generator expression would do just as well).
Finally, the list comprehension is joined with newlines and printed: print("\n".join(listcomp)). This is straightforward.
Anyway, this whole mess can be rewritten in a much clearer way, by using a few variables and printing each line separately in a loop:
for c in sorted(allc):
total_values = sum(animaldictionary.get((a,c),0) for a in alla)
average = total_values // 12
print("{}\t{}".format(c, average))
To finish, I have some general suggestions.
First, your data structure may not be optimal for the uses you are making of you data. Rather than having animaldict be a dictionary with (a,c) keys, it might make more sense to have a nested structure, where you index each level separately. That is, animaldict[a][c]. It might even make sense to have a second dictionaries containing the same values indexed in the reverse order (e.g. one is indexed [a][c] while another is indexed [c][a]). With this approach you might not need the alla and allc lists for iterating (you'd just loop over the contents of the main dictionary directly).
My second suggestion is about code style. Many of your variables are named poorly, either because their names don't have any meaning (e.g. c) or where the names imply a meaning that is incorrect. The most glaring issue is your key and value variables, which in fact unpack two pieces of the key (AKA a and c). In other situations you can get keys and values together, but only when you are iterating over a dictionary's items() view rather than on the dictionary directly.

Resources