Why does list insert, append and extend work like this? [duplicate] - python-3.x

This question already has answers here:
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 4 months ago.
def append_arr(arr):
t_arr = arr
print('arr before',arr)
t_arr.extend(arr)
print('arr affter',arr)
arr = ['a','b','c']
append_arr(arr)
I had list a, assign b=a, and change list b by functions (append, insert, extend)
I didn't touch list a any more, but when b change, a is also change follow b.
How to change b by (append, insert, extend) and a not change
def test():
arr_m = ['a','b','c']
print('arr_m before', arr_m)
append_arr(arr_m)
print('arr_m affter', arr_m)
test()
arr_m before ['a', 'b', 'c']
arr before ['a', 'b', 'c']
arr affter ['a', 'b', 'c', 'a', 'b', 'c']
arr_m affter ['a', 'b', 'c', 'a', 'b', 'c']
I don't know why arr_m change too

You are expecting t_arr = arr to make a new list which has the same contents, but all it does is make a second name which refers to the same list.
It works like this because:
A lot of programming languages were designed based on how computers work. It is normal to think of lists as things, but at the low level in a computer a list is more like a place - the spot in memory where all of the list's contents are stored. To this day many programmers are used to "place-oriented programming" (though almost nobody calls it that), and Python was written way back in 1991 or so.
Until recently, languages where everything was immutable, or where assignment did a copy by default, were often far less efficient - it takes some tricky optimizations to turn code that semantically always makes a copy into machine operations that only copy the minimum of data only when they need to. So back when Python was new, it would've been a lot slower and would've used a lot more memory if it had been designed to make copies all the time.
Programmer culture has not yet finished understanding "the value of values" (a great video by Rich Hickey, the creator of the language Clojure which unlike Python doesn't behave this way) - so we've still got a lot of place-oriented shared-mutable-state-by-default languages in widespread use and most people don't see it as a big deal or major problem.
But now you might be thinking "but why doesn't that happen with integers and strings!?" And you'd be right, if you did t_string = string and then t_string += string, it would just work exactly like you expected the list to work. And that's because:
Integers were historically a special case because they're so small in a computer, and so fundamental to computer operations, that it was often more efficient, or in any case efficient enough, to implement them as immutable values. But perhaps also people had an easier time understanding intuitively that it would be insane if x = 0; y = x; x += 1 turned the zero in y into a one as well.
Strings were deliberately made immutable, even though they're basically lists, because by the time Python was being made, the world of programming had seen so many pitfalls (and reactions like yours from new programmers) to strings being mutable that "strings should be immutable" managed to overcome the tradition of mutable state enough, especially for languages that were aiming to be more human-friendly like Python, despite the memory and speed costs.
The linked duplicate question will show you how to make it instead do what you want, and has some other decent elaborations on why it works this way.

Related

How would I know if Python creates a new sublist in memory for: `for item in nums[1:]`

I'm not asking for an answer to the question, but rather how I, on my own, could have gotten the answer.
Original Question:
Does the following code cause Python to make a new list of size (len(nums) - 1) in memory that then gets iterated over?
for item in nums[1:]:
# do stuff with item
Original Answer
A similarish question is asked here and there is a subcomment by Srinivas Reddy Thatiparthy saying that a new sublist is created.
But, there is no detail given about how he arrived at this answer, which I think makes it very different from what I'm looking for.
Question:
How could I have figured out on my own what the answer to my question is?
I've had similar questions before. For instance, I learned that if I do my_function(nums[1:]), I don't pass in a "slice" but rather a completely new, different sublist! I found this out by just testing whether the original list passed into my_function was modified post-function (it wasn't).
But I don't see an immediate way to figure out if Python is making a new sublist for the for loop example. Please help me to know how to do this.
side note
By the way, this is the current solution I'm using from the original stackoverflow post solutions:
for indx, item in enumerate(nums):
if indx == 0:
continue
# do stuff w items
In general, the easy way to learn if you have a new chunk of data or just a new reference to an existing chunk of data is to modify the data through one reference, and then see if it is also modified through the other. (It sounds like that's "the hard way" you did, but I would recommend it as a general technique.) Some psuedocode would look like:
function areSameRef(thing1, thing2){
thing1.modify()
return thing1.equals(thing2) //make sure this is not just a referential equality check
}
It is very rare that this will fail, and essentially requires behind-the-scenes optimizations where data isn't cloned immediately but only when modified. In this case the fact that the underlying data is the same is being hidden from you, and in most cases, you should just trust that whoever did the hiding knows what they're doing. Exceptions are if they did it wrong, or if you have some complex performance issues. For that you may need to turn to more language-specific debugging or profiling tools. (See below for more)
Do also be careful about cases where part of the data may be shared - for instance, look up cons lists and tail sharing. In those cases if you do something like:
function foo(list1, list2){
list1.append(someElement)
return list1.length == list2.length
}
will return false - the element is only added to the first list, but something like
function bar(list1, list2){
list1.set(someIndex, someElement)
return list1.get(someIndex)==list2.get(someIndex)
}
will return true (though in practice, lists created that way usually don't have an interface that allows mutability.)
I don't see a question in part 2, but yes, your conclusion looks valid to me.
EDIT: More on actual memory usage
As you pointed out, there are situations where that sort of test won't work because you don't actually have two references, as in the for i in [nums 1:] case. In that case I would say turn to a profiler, but you couldn't really trust the results.
The reason for that comes down to how compilers/interpreters work, and the contract they fulfill in the language specification. The general rule is that the interpreter is allowed to re-arrange and modify the execution of your code in any way that does not change the results, but may change the memory or time performance. So, if the state of your code and all the I/O are the same, it should not be possible for foo(5) to return 6 in one interpreter implementation/execution and 7 in another, but it is valid for them to take very different amounts of time and memory.
This matters because a lot of what interpreters and compilers do is behind-the-scenes optimizations; they will try to make your code run as fast as possible and with as small a memory footprint as possible, so long as the results are the same. However, it can only do so when it can prove that the changes will not modify the results.
This means that if you write a simple test case, the interpreter may optimize it behind the scenes to minimize the memory usage and give you one result - "no new list is created." But, if you try to trust that result in real code, the real code may be too complex for the compiler to tell if the optimization is safe, and it may fail. It can also depend upon the specific interpreter version, environmental variables or available hardware resources.
Here's an example:
def foo(x : int):
l = range(9999)
return 5
def bar(x:int):
l = range(9999)
if (x + 1 != (x*2+2)/2):
return l[x]
else:
return 5
I can't promise this for any particular language, but I would usually expect foo and bar to have much different memory usages. In foo, any moderately-well-created interpreter should be able to tell that l is never referenced before it goes out of scope, and thus can freely skip actually allocating any memory at all as a safe operation. In bar (unless I failed at arithmetic), l will never be used either - but knowing that requires some reasoning about the condition of the if statement. It takes a much smarter interpreter to recognize that, so even though these two code snippets might look the same logically, they can have very different behind-the-scenes performances.
EDIT: As has been pointed out to my, Python specifically may not be able to optimize either of these, given the dynamic nature of the language; the range function and the list type may both have been re-assigned or altered from elsewhere in the code. Without specific expertise in the python optimization world I can't say what they do or don't do. Anyway I'm leaving this here for edification on the general concept of optimizations, but take my error as a case lesson in "reasoning about optimization is hard".
All of that being said: FWIW, I strongly suspect that the python interpreter is smart enough to recognize that for i in nums[1:] should not actually allocate new memory, but just iterate over a slice. That looks to my eyes to be a relatively simple, safe and valuable transformation on a very common use case, so I would expect the (highly optimized) python interpreter to handle it.
EDIT2: As a final (opinionated) note, I'm less confident about that in Python than I am in almost any other language, because Python syntax is so flexible and allows so many strange things. This makes it much more difficult for the python interpreter (or a human, for that matter) to say anything with confidence, because the space of "legal python code" is so large. This is a big part of why I prefer much stricter languages like Rust, which force the programmer to color inside the lines but result in much more predictable behaviors.
EDIT3: As a post-final note, usually for things like this it's best to trust that the execution environment is handling these sorts of low-level optimizations. Nine times out of ten, don't try to solve this kind of performance problem until something actually breaks.
As for knowing how list slice works, from the language reference Sequence Types — list, tuple, range, we know that
s[i:j] - The slice of s from i to j is defined as the sequence of
items with index k such that i <= k < j.
So, the slice creates a new sequence but we don't know whether that sequence is a list or whether there is some clever way that the same list object somehow represents both of these sequences. That's not too surprising with the python language spec where lists are described as part of the general discussion of sequences and the spec never really tries to cover all of the details for object implementation.
That's because in the end, something like nums[1:] is really just syntactic sugar for nums.__getitem__(slice(1, None)), meaning that lists get to decide for themselves what slicing means. And you need to go to the source for the implementation. See the list_subscript function in listobject.c.
But we can experiment. Looking at the doucmentation for The for statement,
for_stmt ::= "for" target_list "in" starred_list ":" suite
["else" ":" suite]
The starred_list expression is evaluated once; it should yield an iterable object.
So, nums[1:] is an expression that must yield an iterable object and we can assign that object to an intermediate variable.
nums = [1 ,2, 3]
tmp = nums[1:]
for item in tmp:
pass
tmp[0] = "new stuff"
assert id(nums) != id(tmp), "List slice creates a new object"
assert type(tmp) == type(nums), "List slice creates a new list"
assert 999 not in nums, "List slice doesn't affect original"
Run that, and if neither assertion error is raised, you know that a new list was created.
Other sequence-like objects may work radically different. In a numpy array, for instance, two array objects may indeed reference the same memory. In this example, that final assert will be raised because the slice is another view into the same array. Yes, this can keep you up all night.
import numpy as np
nums = np.array([1,2,3])
tmp = nums[1:]
for item in tmp:
pass
tmp[0] = 999
assert id(nums) != id(tmp), "array slice creates a new object"
assert type(tmp) == type(nums), "array slice creates a new list"
assert 999 not in nums, "array slice doesn't affect original"
You can use the new Walrus operator := to capture the temporary object created by Python for the slice. A little investigation demonstrates that they aren't the same object.
import sys
print(sys.version)
a = list(range(1000))
for i in (b := a[1:]):
b[0] = 906
print(b is a)
print(a[:10])
print(b[:10])
print(sys.getsizeof(a))
print(sys.getsizeof(b))
Generates the following output:
3.11.0 (main, Nov 4 2022, 00:14:47) [GCC 7.5.0]
False
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[906, 2, 3, 4, 5, 6, 7, 8, 9, 10]
8056
8048
See for yourself on the Godbolt Compiler Explorer where you can also see the compiler generated code.

Assignment expressions without parentheses in set comprehension [duplicate]

I was surprised to discover recently that while dicts are guaranteed to preserve insertion order in Python 3.7+, sets are not:
>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> d
{'a': 1, 'b': 2, 'c': 3}
>>> d['d'] = 4
>>> d
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
>>> s = {'a', 'b', 'c'}
>>> s
{'b', 'a', 'c'}
>>> s.add('d')
>>> s
{'d', 'b', 'a', 'c'}
What is the rationale for this difference? Do the same efficiency improvements that led the Python team to change the dict implementation not apply to sets as well?
I'm not looking for pointers to ordered-set implementations or ways to use dicts as stand-ins for sets. I'm just wondering why the Python team didn't make built-in sets preserve order at the same time they did so for dicts.
Sets and dicts are optimized for different use-cases. The primary use of a set is fast membership testing, which is order agnostic. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance.
While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes.
dictobject.c - master, v3.5.9
setobject.c - master, v3.5.9
issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.
It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change.
Sets remain unordered. (Why? The usage patterns are different. Also, different implementation.)
– Guido van Rossum
Sets use a different algorithm that isn't as amendable to retaining insertion order.
Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future.
– Raymond Hettinger
A detailed discussion about whether to compactify sets for 3.7, and why it was decided against, can be found in the python-dev mailing lists.
In summary, the main points are: different usage patterns (insertion ordering dicts such as **kwargs is useful, less so for sets), space savings for compacting sets are less significant (because there are only key + hash arrays to densify, as opposed to key + hash + value arrays), and the aforementioned linear probing optimization which sets currently use is incompatible with a compact implementation.
I will reproduce Raymond's post below which covers the most important points.
On Sep 14, 2016, at 3:50 PM, Eric Snow wrote:
Then, I'll do same to sets.
Unless I've misunderstood, Raymond was opposed to making a similar
change to set.
That's right. Here are a few thoughts on the subject before people
starting running wild.
For the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for
the key/value/hash arrays being more than offset by the improved
density of key/value/hash arrays. However for sets, the net was much
less favorable because we still need the indices and overallocation
but can only offset the space cost by densifying only two of the three
arrays. In other words, compacting makes more sense when you have
wasted space for keys, values, and hashes. If you lose one of those
three, it stops being compelling.
The use pattern for sets is different from dicts. The former has more hit or miss lookups. The latter tends to have fewer missing key
lookups. Also, some of the optimizations for the set-to-set operations
make it difficult to retain set ordering without impacting
performance.
I pursued alternative path to improve set performance. Instead of compacting (which wasn't much of space win and incurred the cost of an
additional indirection), I added linear probing to reduce the cost of
collisions and improve cache performance. This improvement is
incompatible with the compacting approach I advocated for
dictionaries.
For now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well.
The docs already link to a recipe for creating an OrderedSet (
https://code.activestate.com/recipes/576694/ ) but it seems like the
uptake has been nearly zero. Also, now that Eric Snow has given us a
fast OrderedDict, it is easier than ever to build an OrderedSet from
MutableSet and OrderedDict, but again I haven't observed any real
interest because typical set-to-set data analytics don't really need
or care about ordering. Likewise, the primary use of fast membership
testings is order agnostic.
That said, I do think there is room to add alternative set implementations to PyPI. In particular, there are some interesting
special cases for orderable data where set-to-set operations can be
sped-up by comparing entire ranges of keys (see
https://code.activestate.com/recipes/230113-implementation-of-sets-using-sorted-lists
for a starting point). IIRC, PyPI already has code for set-like bloom
filters and cuckoo hashing.
I understanding that it is exciting to have a major block of code accepted into the Python core but that shouldn't open to floodgates to
engaging in more major rewrites of other datatypes unless we're sure
that it is warranted.
– Raymond Hettinger
From [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered, Sept 2016.
Discussions
Your question is germane and has already been heavily discussed on python-devs. R. Hettinger shared a list of rationales in that thread. The state of the issue appeared open-ended, shortly after this detailed reply from T. Peters. Some time later (c. 2022), the discussion reignited elsewhere on python-ideas.
In short, the implementation of modern dicts that preserves insertion order is unique and not considered appropriate with sets. In particular, dicts are used everywhere to run Python (e.g. __dict__ in namespaces of objects). A major motivation behind the modern dict was to reduce size, making Python more memory-efficient overall. In contrast, sets are less prevalent than dicts within Python's core and thus dissuade such a refactoring. See also R. Hettinger's talk on the modern dict implementation.
Perspectives
The unordered nature of sets in Python parallels the behavior of mathematical sets. Order is not guaranteed.
The corresponding mathematical concept is unordered and it would be weird to impose
such as order - R. Hettinger
If order of any kind were introduced to sets in Python, then this behavior would comply with a completely separate mathematical structure, namely an ordered set (or Oset). Osets play a separate roll in mathematics, particularly in combinatorics. One practical application of Osets is observed in changing of bells.
Having unordered sets are consistent with a very generic and ubiquitous data structure that unpins most modern math, i.e. Set Theory. I submit, unordered sets in Python are good to have.
See also related posts that expand on this topic:
Converting a list to a set changes element order
Get unique values from a list in python
Does Python have an ordered set

Haskell Parsec: primitive for greedy many?

I'm only just starting to learn the Parsec library, and I was wondering if there is any primitive in the library that can do the following: given a parser let a = char 'a', and a string aaab, would return Right ['a', 'a', 'a'], with "b" remaining, i.e., would parse as much as it can, but no more. I feel this is so necessary that it has to exist in some form or another in the library.
You want to use many a, which will parse as many a's as it can.

Python3: Is there a memory efficient way working with itertools.product

I am looking for a way to overcome a huge memory issue.
I have about 100 lists, each containing an average of 10 elements.
And I need to create all possible combinations and work with them one by one (I don't need all at once).
Currently my code looks like this:
import itertools
l1 = ['a','b','c']
l2 = ['a','b','c']
l3 = ['a','b','c']
all_lists = [l1,l2,l3]
combination_list = [item for item in itertools.product(*all_lists)]
for c in combination_list:
print(c) #do something with c
Sadly if I try to use more than 10 lists I get a memory error.
Any idea how I can overcome that memory issue?
Is there way saving the combinations one-by-one in a file and accessing them that way, too?
Edit: I should have said, that I need to access that combinations later on again, storing them in a dict as a key and assigning them a value.
Yep, use the iterator straight up instead of putting it in a list.
for c in itertools.product(*all_lists):
print(c) #do something with c
Looking at the doc for product you see it's just making an iterator.
EDIT
If you want to reuse the combinations later, you are better off just enumerating them again (so you don't have to store them).
combination_list = lambda: itertools.product(*all_lists)

How do I get Mathematica to thread a 2-variable function over two lists, using functional programming techniques?

Lets say I have a function f[x_, y_], and two lists l1, l2. I'd like to evaluate f[x,y] for each pair x,y with x in l1 and y in l2, and I'd like to do it without having to make all pairs of the form {l1[[i]],l2[[j]]}.
Essentially, what I want is something like Map[Map[f[#1, #2]&, l1],l2] where #1 takes values from l1 and #2 takes values from l2, but this doesn't work.
(Motivation: I'm trying to implement some basic Haskell programs in Mathematica. In particular, I'd like to be able to code the Haskell program
isMatroid::[[Int]]->Bool
isMatroid b =and[or[sort(union(xs\\[x])[y]'elem'b|y<-ys]|xs<-b,ys<-b, xs<-x]
I think I can do the rest of it, if I can figure out the original question, but I'd like the code to be Haskell-like. Any suggestions for implementing Haskell-like code in Mathematica would be appreciated.)
To evaluate a function f over all pairs from two lists l1 and l2, use Outer:
In[1]:= Outer[f, {a,b}, {x,y,z}]
Out[1]:= {{f[a,x],f[a,y],f[a,z]}, {f[b,x],f[b,y],f[b,z]}}
Outer by default works at the lowest level of the provided lists; you can also specify a level with an additional argument:
In[2]:= Outer[f, {{1, 2}, {3, 4}}, {{a, b}, {c, d}}, 1]
Out[2]:= {{f[{1,2},{a,b}], f[{1,2},{c,d}]}, {f[{3,4},{a,b}], f[{3,4},{c,d}]}}
Note that this produces a nested list; you can Flatten it if you like.
My original answer pointed to Thread and MapThread, which are two ways to apply a function to corresponding pairs from lists, e.g. MapThread[f,{a,b},{1,2}] == {f[a,1], f[b,2]}.
P.S. I think as you're learning these things, you'll find the documentation very helpful. There are a lot of general topic pages, for example, applying functions to lists and list manipulation. These are generally linked to in the "more about" section at the bottom of specific documentation. This makes it a lot easier to find things when you don't know what they'll be called.
To pick up on OP's request for suggestions about implementing Haskell-like code in Mathematica. A couple of things you'll have to deal with are:
Haskell evaluates lazily, by default Mathematica does not, it's very eager. You'll need to wrestle with Hold[] and its relatives to write lazily evaluating functions, but it can be done. You can also subvert Mathematica's evaluation process and tinker with Prolog and Epilog and such like.
Haskell's type system and type checking are probably more rigorous than Mathematica's defaults, but Mathematica does have the features to implement strict type checking.
I'm sure there's a lot more but I'm not terribly familiar with Haskell.
In[1]:= list1 = Range[1, 5];
In[2]:= list2 = Range[6, 10];
In[3]:= (f ## #) & /# Transpose[{list1, list2}]
Out[3]= {f[1, 6], f[2, 7], f[3, 8], f[4, 9], f[5, 10]}

Resources