Related
I've noticed that many operations on lists that modify the list's contents will return None, rather than returning the list itself. Examples:
>>> mylist = ['a', 'b', 'c']
>>> empty = mylist.clear()
>>> restored = mylist.extend(range(3))
>>> backwards = mylist.reverse()
>>> with_four = mylist.append(4)
>>> in_order = mylist.sort()
>>> without_one = mylist.remove(1)
>>> mylist
[0, 2, 4]
>>> [empty, restored, backwards, with_four, in_order, without_one]
[None, None, None, None, None, None]
What is the thought process behind this decision?
To me, it seems hampering, since it prevents "chaining" of list processing (e.g. mylist.reverse().append('a string')[:someLimit]). I imagine it might be that "The Powers That Be" decided that list comprehension is a better paradigm (a valid opinion), and so didn't want to encourage other methods - but it seems perverse to prevent an intuitive method, even if better alternatives exist.
This question is specifically about Python's design decision to return None from mutating list methods like .append. Novices often write incorrect code that expects .append (in particular) to return the same list that was just modified.
For the simple question of "how do I append to a list?" (or debugging questions that boil down to that problem), see Why does "x = x.append([i])" not work in a for loop?.
To get modified versions of the list, see:
For .sort: How can I get a sorted copy of a list?
For .reverse: How can I get a reversed copy of a list (avoid a separate statement when chaining a method after .reverse)?
The same issue applies to some methods of other built-in data types, e.g. set.discard (see How to remove specific element from sets inside a list using list comprehension) and dict.update (see Why doesn't a python dict.update() return the object?).
The same reasoning applies to designing your own APIs. See Is making in-place operations return the object a bad idea?.
The general design principle in Python is for functions that mutate an object in-place to return None. I'm not sure it would have been the design choice I'd have chosen, but it's basically to emphasise that a new object is not returned.
Guido van Rossum (our Python BDFL) states the design choice on the Python-Dev mailing list:
I'd like to explain once more why I'm so adamant that sort() shouldn't
return 'self'.
This comes from a coding style (popular in various other languages, I
believe especially Lisp revels in it) where a series of side effects
on a single object can be chained like this:
x.compress().chop(y).sort(z)
which would be the same as
x.compress()
x.chop(y)
x.sort(z)
I find the chaining form a threat to readability; it requires that the
reader must be intimately familiar with each of the methods. The
second form makes it clear that each of these calls acts on the same
object, and so even if you don't know the class and its methods very
well, you can understand that the second and third call are applied to
x (and that all calls are made for their side-effects), and not to
something else.
I'd like to reserve chaining for operations that return new values,
like string processing operations:
y = x.rstrip("\n").split(":").lower()
There are a few standard library modules that encourage chaining of
side-effect calls (pstat comes to mind). There shouldn't be any new
ones; pstat slipped through my filter when it was weak.
I can't speak for the developers, but I find this behavior very intuitive.
If a method works on the original object and modifies it in-place, it doesn't return anything, because there is no new information - you obviously already have a reference to the (now mutated) object, so why return it again?
If, however, a method or function creates a new object, then of course it has to return it.
So l.reverse() returns nothing (because now the list has been reversed, but the identfier l still points to that list), but reversed(l) has to return the newly generated list because l still points to the old, unmodified list.
EDIT: I just learned from another answer that this principle is called Command-Query separation.
One could argue that the signature itself makes it clear that the function mutates the list rather than returning a new one: if the function returned a list, its behavior would have been much less obvious.
If you were sent here after asking for help fixing your code:
In the future, please try to look for problems in the code yourself, by carefully studying what happens when the code runs. Rather than giving up because there is an error message, check the result of each calculation, and see where the code starts working differently from what you expect.
If you had code calling a method like .append or .sort on a list, you will notice that the return value is None, while the list is modified in place. Study the example carefully:
>>> x = ['e', 'x', 'a', 'm', 'p', 'l', 'e']
>>> y = x.sort()
>>> print(y)
None
>>> print(x)
['a', 'e', 'e', 'l', 'm', 'p', 'x']
y got the special None value, because that is what was returned. x changed, because the sort happened in place.
It works this way on purpose, so that code like x.sort().reverse() breaks. See the other answers to understand why the Python developers wanted it that way.
To fix the problem
First, think carefully about the intent of the code. Should x change? Do we actually need a separate y?
Let's consider .sort first. If x should change, then call x.sort() by itself, without assigning the result anywhere.
If a sorted copy is needed instead, use y = x.sorted(). See How can I get a sorted copy of a list? for details.
For other methods, we can get modified copies like so:
.clear -> there is no point to this; a "cleared copy" of the list is just an empty list. Just use y = [].
.append and .extend -> probably the simplest way is to use the + operator. To add multiple elements from a list l, use y = x + l rather than .extend. To add a single element e wrap it in a list first: y = x + [e]. Another way in 3.5 and up is to use unpacking: y = [*x, *l] for .extend, y = [*x, e] for .append. See also How to allow list append() method to return the new list for .append and How do I concatenate two lists in Python? for .extend.
.reverse -> First, consider whether an actual copy is needed. The built-in reversed gives you an iterator that can be used to loop over the elements in reverse order. To make an actual copy, simply pass that iterator to list: y = list(reversed(x)). See How can I get a reversed copy of a list (avoid a separate statement when chaining a method after .reverse)? for details.
.remove -> Figure out the index of the element that will be removed (using .index), then use slicing to find the elements before and after that point and put them together. As a function:
def without(a_list, value):
index = a_list.index(value)
return a_list[:index] + a_list[index+1:]
(We can translate .pop similarly to make a modified copy, though of course .pop actually returns an element from the list.)
See also A quick way to return list without a specific element in Python.
(If you plan to remove multiple elements, strongly consider using a list comprehension (or filter) instead. It will be much simpler than any of the workarounds needed for removing items from the list while iterating over it. This way also naturally gives a modified copy.)
For any of the above, of course, we can also make a modified copy by explicitly making a copy and then using the in-place method on the copy. The most elegant approach will depend on the context and on personal taste.
As we know list in python is a mutable object and one of characteristics of mutable object is the ability to modify the state of this object without the need to assign its new state to a variable. we should demonstrate more about this topic to understand the root of this issue.
An object whose internal state can be changed is mutable. On the other hand, immutable doesn’t allow any change in the object once it has been created. Object mutability is one of the characteristics that makes Python a dynamically typed language.
Every object in python has three attributes:
Identity – This refers to the address that the object refers to in the computer’s memory.
Type – This refers to the kind of object that is created. For example integer, list, string etc.
Value – This refers to the value stored by the object. For example str = "a".
While ID and Type cannot be changed once it’s created, values can be changed for Mutable objects.
let us discuss the below code step-by-step to depict what it means in Python:
Creating a list which contains name of cities
cities = ['London', 'New York', 'Chicago']
Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(cities)))
Output [1]: 0x1691d7de8c8
Adding a new city to the list cities
cities.append('Delhi')
Printing the elements from the list cities, separated by a comma
for city in cities:
print(city, end=', ')
Output [2]: London, New York, Chicago, Delhi
Printing the location of the object created in the memory address in hexadecimal format
print(hex(id(cities)))
Output [3]: 0x1691d7de8c8
The above example shows us that we were able to change the internal state of the object cities by adding one more city 'Delhi' to it, yet, the memory address of the object did not change. This confirms that we did not create a new object, rather, the same object was changed or mutated. Hence, we can say that the object which is a type of list with reference variable name cities is a MUTABLE OBJECT.
While the immutable object internal state can not be changed. For instance, consider the below code and associated error message with it, while trying to change the value of a Tuple at index 0
Creating a Tuple with variable name foo
foo = (1, 2)
Changing the index 0 value from 1 to 3
foo[0] = 3
TypeError: 'tuple' object does not support item assignment
We can conclude from the examples why mutable object shouldn't return anything when executing operations on it because it's modifying the internal state of the object directly and there is no point in returning new modified object. unlike immutable object which should return new object of the modified state after executing operations on it.
First of All, I should tell that what I am suggesting is without a doubt, a bad programming practice but if you want to use append in lambda function and you don't care about the code readability, there is way to just do that.
Imagine you have a list of lists and you want to append a element to each inner lists using map and lambda. here is how you can do that:
my_list = [[1, 2, 3, 4],
[3, 2, 1],
[1, 1, 1]]
my_new_element = 10
new_list = list(map(lambda x: [x.append(my_new_element), x][1], my_list))
print(new_list)
How it works:
when lambda wants to calculate to output, first it should calculate the [x.append(my_new_element), x] expression. To calculate this expression the append function will run and the result of expression will be [None, x] and by specifying that you want the second element of the list the result of [None,x][1] will be x
Using custom function is more readable and the better option:
def append_my_list(input_list, new_element):
input_list.append(new_element)
return input_list
my_list = [[1, 2, 3, 4],
[3, 2, 1],
[1, 1, 1]]
my_new_element = 10
new_list = list(map(lambda x: append_my_list(x, my_new_element), my_list))
print(new_list)
I am trying to find the mean value of the dataframe's elements in corresponding to particular column when either of the condition is true. For example:
Using Statistics
df = DataFrame(value, xi, xj)
resulted_mean = []
for i in range(ncol(df))
push!(resulted_mean, mean(df[:value], (:xi == i | :xj == i)))
Here, I am checking when either xi or xj is equal to i then find the mean of the all the corresponding values stored in [:value] column. This mean will later be pushed to the array -> resulted_mean
However, this code is not producing the desired output.
Please suggest the optimal approach to fix this code snippet.
Thanks in advance.
I agree with Bogumił's comment, you should really consult the Julia documentation to get a basic understanding of the language, and then run through the DataFrames tutorials. I will however annotate your code to point out some of the issues so you might be able to target your learning a bit better:
Using Statistics
Julia (like most other languages) is case sensitive, so writing Usingis not the same as the reserved keyword using which is used to bring package definitions into your namespace. The relevant docs entry is here
Note also that you are using the DataFrames package, so to make your code reproducible you would have had to do using DataFrames, Statistics.
df = DataFrame(value, xi, xj)
It's unclear what this line is supposed to do as the arguments passed to the constructor are undefined, but assuming value, xi and xj are vectors of numbers, this isn't a correct way to construct a DataFrame:
julia> value = rand(10); xi = repeat(1:2, 5); xj = rand(1:2, 10);
julia> df = DataFrame(value, xi, xj)
ERROR: MethodError: no method matching DataFrame(::Vector{Float64}, ::Vector{Int64}, ::Vector{Int64})
You can read about constructors in the docs here, the most common approach for a DataFrame with only few columns like here would probably be:
julia> df = DataFrame(value = value, xi = xi, xj = xj)
10×3 DataFrame
Row │ value xi xj
│ Float64 Int64 Int64
─────┼────────────────────────
1 │ 0.539533 1 2
2 │ 0.652752 2 1
3 │ 0.481461 1 2
...
Then you have
resulted_mean = []
I would say in this case the overall approach of preallocating a vector and pushing to it in a loop isn't ideal as it adds a lot of verbosity for no reason (see below), but as a general remark you should avoid untyped arrays in Julia:
julia> resulted_mean = []
Any[]
Here the Any means that the array can hold values of any type (floating point numbers, integers, strings, probability distributions...), which means the compiler cannot anticipate what the actual content will be from looking at the code, leading to suboptimal machine code being generated. In doing so, you negate the main advantage that Julia has over e.g. base Python: the rich type system combined with a lot of compiler optimizations allow generation of highly efficient machine code while keeping the language dynamic. In this case, you know that you want to push the results of the mean function to the results vector, which will be a floating point number, so you should use:
julia> resulted_mean = Float64[]
Float64[]
That said, I wouldn't recommend pushing in a loop here at all (see below).
Your loop is:
for i in range(ncol(df))
...
A few issues with this:
Loops in Julia require an end, unlike in Python where their end is determined based on code indentation
range is a different function in Julia than in Python:
julia> range(5)
ERROR: ArgumentError: At least one of `length` or `stop` must be specified
You can learn about functions using the REPL help mode (type ? at the REPL prompt to access it):
help?> range
search: range LinRange UnitRange StepRange StepRangeLen trailing_zeros AbstractRange trailing_ones OrdinalRange AbstractUnitRange AbstractString
range(start[, stop]; length, stop, step=1)
Given a starting value, construct a range either by length or from start to stop, optionally with a given step (defaults to 1, a UnitRange). One of length or stop is required. If length, stop, and step are all specified, they must
agree.
...
So you'd need to do something like
julia> range(1, 5, step = 1)
1:1:5
That said, for simple ranges like this you can use the colon operator: 1:5 is the same as `range(1, 5, step = 1).
You then iterate over integers from 1 to ncol(df) - you might want to check whether this is what you're actually after, as it seems unusual to me that the values in the xi and xj columns (on which you filter in the loop) would be related to the number of columns in your DataFrame (which is 3).
In the loop, you do
push!(resulted_mean, mean(df[:value], (:xi == i | :xj == i)))
which again has a few problems: first of all you are passing the subsetting condition for your DataFrame to the mean function, which doesn't work:
julia> mean(rand(10), rand(Bool, 10))
ERROR: MethodError: objects of type Vector{Float64} are not callable
The subsetting condition itself has two issues as well: when you write :xi, there is no way for Julia to know that you are referring to the DataFrame column xi, so all you're doing is comparing the Symbol :xi to the value of i, which will always return false:
julia> :xi == 2
false
Furthermore, note that | has a higher precedence than ==, so if you want to combine two equality checks with or you need brackets:
julia> 1 == 1 | 2 == 2
false
julia> (1 == 1) | (2 == 2)
true
More things could be said about your code snippet, but I hope this gives you an idea of where your gaps in understanding are and how you might go about closing them.
For completeness, here's how I would approach your problem - I'm interpreting your code to mean "calculate the mean of the value column, grouped by each value of xi and xj, but only where xi equals xj":
julia> combine(groupby(df[df.xi .== df.xj, :], [:xi, :xj], sort = true), :value => mean => :resulted_mean)
2×3 DataFrame
Row │ xi xj resulted_mean
│ Int64 Int64 Float64
─────┼─────────────────────────────
1 │ 1 1 0.356811
2 │ 2 2 0.977041
This is probably the most common analysis pattern for DataFrames, and is explained in the tutorial that Bogumił mentioned as well as in the DataFrames docs here.
As I said up front, if you want to use Julia productively, I recommend that you spend some time reading the documentation both for the language itself as well as for any of the key packages you're using. While Julia has some similarities to Python, and some bits in the DataFrames package have an API that resemble things you might have seen in R, it is a language in its own right that is fundamentally different from both Python and R (or any other language for that matter), and there's no way around familiarizing yourself with how it actually works.
The best way to build efficiently is to understand the toolkit one is building with. However, while trying to understand the core functions of python, it occurred to me that the map function gave similar, if not the same, results as a generic generator expression.
Take the next bit of code as a simplified example.
These two objects, mapped and generated, behave astoundingly similar in whatever situation you throw them.
def concatenate(string1 = "", string2 = ""):
return string1.join(" ", string2)
foo = ["One", "Two"]
bar = ["Blue", "Green"]
mapped = map(concatenate, foo, bar)
generated = (concatenate(string1 = a, string2 = b) for a, b in zip(foo, bar))
Okay, I know that it is a longer line of code, but I find it hard to believe that's all of map's reason of existence, so in my quest to understand python.
What does map still do in python? Is it really just a relic of olden times, and if not, where can I best put this tool to use?
The reason both exist is because list comprehension returns a new list where as map (in Python3) returns a generator thus map is more memory efficient if you don't need the resulting list right away.
This can be considered a technicality though since often when you use list comprehension to do something a map can do, you overwrite the original variable:
a = [1, 2, 3]
# The following line creates a new list but
# since we assign that list to `a` we give the old list to
# the garbage collector
a = [x**2 for x in a]
# or
a = list(map(lambda x: x**2, a))
# both of which are basically the same.
The power of map can come into play if you aren't working with the same variable:
a = [1, 2, 3] # If we want to save this then we don't want to overwrite it
b = [x**2 for x in a] # A full new list is now in b
c = map(lambda x: x**2, a) # c is just a generator object.
print(a) # [1, 2, 3]
print(b) # [1, 4, 9]
print(c) # <map object ...>
for x in c:
print(x)
# We never create a full list from c, we just use each object as we go thus we save memory.
In the previous example b is a whole new list of memory where as c is just a generator object. When working with such a small a, b and c are probably pretty close in memory but if a was large c would be significantly less memory than b.
Most common use cases of map involve the first case, thus map has no real benefit but in the second case it is much more beneficial to use map.
For the record "one way to do something" from the zen is sorta vague. I could implement a merge sort the right way (i.e. perfect speed and space) but my code my look completely different then someone else who implemented merge sort. "one way" doesn't really mean only one way to code it, it more implies there is only one methodology to be used or one process.
This might be a simple question. However, I wanted to get some clarifications of how the following code works.
a = np.arange(8)
a
array([1,2,3,4,5,6,7])
Example Function = a[0:-1]+a[1:]/2.0
In the Example Function, I want to draw your attention to the plus sign between the array a[0:-1]+a[1:]. How does that work? What does that look like?
For instance, is the plus sign (addition) adding the first index of each array? (e.g 1+2) or add everything together? (e.g 1+2+2+3+3+4+4+5+5+6+6+7)
Then, I assume /2.0 is just dividing it by 2...
A numpy array uses vector algebra in that you can only add two arrays if they have the same dimensions as you are adding element by element
a = [1,2,3,4,5]
b = [1,1,1]
a+b # will throw an error
whilst
a = [1,2,3,4,5]
b = [1,1,1,1,1]
a+b # is ok
The division is also element by element.
Now to your question about the indexing
a = [1,2,3,4,5]
a[0:-1]= [1,2,3,4]
a[1:] = [2,3,4,5]
or more generally a[index_start: index_end] is inclusive at the start_index but exclusive at the end_index - unless you are given a a[start_index:]where it includes everything up to and including the last element.
My final tip is just to try and play around with the structures - there is no harm in trying different things, the computer will not explode with a wrong value here or there. Unless you trying to do so of course.
If arrays have identical shapes, they can be added:
new_array = first_array.__add__(second_array)
This simple operation adds each value from first_array to each value in second_array and puts result into new_array.
I want to chain multiple iterables, everything with lazy evaluation (speed is crucial), to do the following:
read many integers from a single huge line of stdin
split() that line
convert the resulting strings to int
compute the diff between successive ints
... and some further things not shown here
The real example is more complex, here's a simplified example:
Here's a sample line of stdin:
2 13 4 16 16 15 22 17 8 8 7 6
(For debugging purposes, instream below might point to sys.stdin, or an opened filehandle)
You can't simply chain generators since map() returns a (lazily-evaluated) list:
import itertools
gen1 = map(int, (map(str.split, instream))) # CAN'T CHAIN DIRECTLY
The least complicated working solution I found is this, can it surely not be simplified?
gen1 = map(int, itertools.chain.from_iterable(itertools.chain(map(str.split, instream))))
Why the hell do I need to chain itertools.chain.from_iterable(itertools.chain just to process the result from map(str.split, instream) - it sort of defeats the purpose?
Is manually defining my generators faster?
An explicit ("manual") generator expression should be preferred over using map and filter. It is more readable to most people, and more flexible.
If I understand your question, this generator expression does what you need:
gen1 = ( int(x) for line in instream for x in line.split() )
You could build your generator by hand:
import string
def gen1(stream):
# presuming that stream is of type io.TextIOBase
s = ""
c = stream.read(1)
while len(c)>0:
if (c not in string.digits):
if len(s) > 0:
i = int(s)
yield i
s = ""
else:
s += c
c = stream.read(1)
if len(s) > 0:
i = int(s)
yield i
import io
g = gen1(io.StringIO("12 45 6 7 88"))
for x in g: # dangerous if stream is unlimited
print(x)
Which is certainly not the most beautiful code, but it does what you want.
Explanations:
If your input is indefinitely long you have to read it in chunks (or character wise).
Whenever you encounter a non-digit (whitespace), you convert the characters you have read until that point into an integer and yield it.
You also have to consider what happens when you reach the EOF.
My implementation is probably not very well performed, due to the fact that I'm reading char-wise. Using chunks one could speed it up significantly.
EDIT as to why your approach will never work:
map(str.split, instream)
does simply not do what you appear to think it does. map applies the given function str.split to each element of the iterator given as the second parameter. In your case that is a stream, i.e. a file object, in the case of sys.stdin specifically a io.TextIOBase object. Which indeed can be iterated over. Line by line, which emphatically is NOT what you want! In effect you iterate over your input line by line and split each line into words. The map generator iterates over (many) lists of words NOT over A list of words. Which is why you have to chain them together to get a single list to iterate on.
Also, the itertools.chain() in itertools.chain.from_iterable(itertools.chain(map(...))) is redundant. itertools.chain chains its arguments (each an inalterable object) together into one iterator. You only give it one argument so there is nothing to chain together, it basically returns the map object unchanged.
itertools.chain.from_iterable() on the other hand takes one argument, which is expected to be an iterator of iterators (e.g. a list of lists) and flattens it into one iterator (list).
EDIT2
import io, itertools
instream = io.StringIO("12 45 \n 66 7 88")
gen1 = itertools.chain.from_iterable(map(str.split, instream))
gen2 = map(int, gen1)
list(gen2)
returns
[12, 45, 66, 7, 88]