copy in numpy(python 3) - python-3.x

I've just learnt copy,shallow copy,and deep copy in python,and I created a list b,then make c equal b.I know it's reasonable to find that the same element share the identical 'id'.Then I think I'll get the similar result in numpy when I make the nearly same steps,however,it shows that the same element has different 'id', I can't figure out how that happens in numpy.

You don't need a duplicate reference to produce the result.
import numpy as np
a = np.array([[10, 10], [2, 3], [4, 5]])
for x, y in zip(a, a):
print(id(x), ',', id(y))
# 52949424 , 52949464
# 52949624 , 52951424
# 52949464 , 52949424
My guess is that when zip iterates over arrays, it triggers indexing in numpy which then returns copied row. Remember that [] in numpy is not like that for list.
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html
You may try this to see why messing with id is not a good idea for numpy.
a[0] is a[0] # False
a[0] is a[[0]] # False

Related

Why is taking a slice of a list which is assigned to another list not changing the original?

I have a class that is a representation of a mathematical tensor. The tensor in the class, is stored as a single list, not lists inside another list. That means [[1, 2, 3], [4, 5, 6]] would be stored as [1, 2, 3, 4, 5, 6].
I've made a __setitem__() function and a function to handle taking slices of this tensor while it's in single list format. For example slice(1, None, None) would become slice(3, None, None) for the list mentioned above. However when I assign this slice a new value, the original tensor isn't updated.
Here is what the simplified code looks like
class Tensor:
def __init__(self, tensor):
self.tensor = tensor # Here I would flatten it, but for now imagine it's already flattened.
def __setitem__(self, slices, value):
slices = [slices]
temp_tensor = self.tensor # any changes to temp_tensor should also change self.tensor.
for s in slices: # Here I would call self.slices_to_index(), but this is to keep the code simple.
temp_tensor = temp_tensor[slice]
temp_tensor = value # In my mind, this should have also changed self.tensor, but it hasn't.
Maybe i'm just being stupid and can't see why this isn't working. Maybe my actual questions isn't just ' why doesn't this work?' but also 'is there a better way to do this?'. Thanks for any help you can give me.
NOTES:
Each 'dimension' of the list must have the same shape, so [[1, 2, 3], [4, 5]] isn't allowed.
This code is massively simplified as there are many other helper functions and stuff like that.
in __init__() I would flatten the list but as I just said to keep things simple I left that out, along with self.slice_to_index().
You should not think of python variables as in c++ or java. Think of them as labels you place on values. Check this example:
>>> l = []
>>> l.append
<built-in method append of list object at 0x7fbb0d40cf88>
>>> l.append(10)
>>> l
[10]
>>> ll = l
>>> ll.append(10)
>>> l
[10, 10]
>>> ll
[10, 10]
>>> ll = ["foo"]
>>> l
[10, 10]
As you can see, ll variable first points to the same l list but later we just make it point to another list. Modifying the later ll won't modify the original list pointed by l.
So, in your case if you want self.tensor to point to a new value, just do it:
class Tensor:
def __init__(self, tensor):
self.tensor = tensor # Here I would flatten it, but for now imagine it's already flattened.
def __setitem__(self, slices, value):
slices = [slices]
temp_tensor = self.tensor # any changes to the list pointed by temp_tensor will be reflected in self.tensor since it is the same list
for s in slices:
temp_tensor = temp_tensor[slice]
self.tensor = value

Why is a copy of a pandas object altering one column on the original object? (Slice copy)

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).
However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.
How is it possible? Why one column, and not both or none of them?
Here is the code:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)
# Copy data
processed_data = data[:]
print(data[:10])
# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0
# gpa column has changed
print(data[:10])
On the other hand, if I change processed_data = data[:] to processed_data = data.copy() it works fine.
Here, the original data edited:
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.
This is valid for Python lists. Slicing creates shallow copies.
In [44]: lst = [[1, 2], 3, 4]
In [45]: lst2 = lst[:]
In [46]: lst2[1] = 100
In [47]: lst # unchanged
Out[47]: [[1, 2], 3, 4]
In [48]: lst2[0].append(3)
In [49]: lst # changed
Out[49]: [[1, 2, 3], 3, 4]
However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.
In [50]: arr = np.array([1, 2, 3])
In [51]: arr2 = arr[:]
In [52]: arr2[0] = 100
In [53]: arr
Out[53]: array([100, 2, 3])
If you have a DataFrame with a single dtype, the behaviour you see is the same:
In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
In [63]: df
Out[63]:
0 1 2
0 1 2 3
1 4 5 6
In [64]: df2 = df[:]
In [65]: df2.iloc[0, 0] = 100
In [66]: df
Out[66]:
0 1 2
0 100 2 3
1 4 5 6
But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard
to predict whether it will return a view or a copy (it depends on the
memory layout of the array, about which pandas makes no guarantees),
and therefore whether the __setitem__ will modify dfmi or a temporary
object that gets thrown out immediately afterward. That’s what
SettingWithCopy is warning you about!
In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa column the block is the column itself. This is not the case for gre -- you have other integer columns. When I add a string column to data and modify it in processed_data I see the same behavior. When I increase the number of float columns to 2 in data, changing gre in processed_data no longer affects original data.
In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy() and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc or .iloc.

How to randomly select from list of lists

I have a list of lists in python as follows:
a = [[1,1,2], [2,3,4], [5,5,5], [7,6,5], [1,5,6]]
for example, How would I at random select 3 lists out of the 6?
I tried numpy's random.choice but it does not work for lists.
Any suggestion?
numpy's random.choice doesn't work on 2-d array, so one alternative is to use lenght of the array to get the random index of 2-d array and then get the elements from that random index. see below example.
import numpy as np
random_count = 3 # number of random elements to find
a = [[1,1,2], [2,3,4], [5,5,5], [7,6,5], [1,5,6]] # 2-d data list
alist = np.array(a) # convert 2-d data list to numpy array
random_numbers = np.random.choice(len(alist), random_count) # fetch random index of 2-d array based on len
for item in random_numbers: # iterate over random indexs
print(alist[item]) # print random elememt through index
You can use the random library like this:
a = [[1,1,2], [2,3,4], [5,5,5], [7,6,5], [1,5,6]]
import random
random.choices(a, k=3)
>>> [[1, 5, 6], [2, 3, 4], [7, 6, 5]]
You can read more about the random library at this official page https://docs.python.org/3/library/random.html.

How to filter this type of data?

If I have some numpy arrays like
a = np.array([1,2,3,4,5])
b = np.array([4,5,7,8])
c = np.array([4,5])
I need to combine these arrays without repeating a number. My expected output is [1,2,3,4,5,7,8].
How do I combine them? Which function should I use?
One more approach you can give a try is using reduce from functools and union1d from numpy.
For eg -
from functools import reduce
reduce(np.union1d, (a, b, c))
Output -
array([1,2,3,4,5,7,8])
You can use numpy.concatenate with numpy.unique:
d = np.unique(np.concatenate((a,b,c)))
print(d)
Output:
[1 2 3 4 5 7 8]
Python has a datatype called set:
A set is an unordered collection with no duplicate elements
The easiest way to create a set out of your array would be unpacking your arrays into the set:
>>> import numpy as np
>>> a=np.array([1,2,3,4,5])
>>> b=np.array([4,5,7,8])
>>> c=np.array([4,5])
>>> {*a, *b, *c}
{1, 2, 3, 4, 5, 7, 8}
Please note, that the set is unordered. This is not the right answer for you, if the order of the elements in your array is important.

How can I fill an array with sets

Say I have an image of 2x2 pixels named image_array, each pixel color is identified by a tuple of 3 entries (RGB), so the shape of image_array is 2x2x3.
I want to create an np.array c which has the shape 2x2x1 and which last coordinate is an empty set.
I tried this:
import numpy as np
image = (((1,2,3), (1,0,0)), ((1,1,1), (2,1,2)))
image_array = np.array(image)
c = np.empty(image_array.shape[:2], dtype=set)
c.fill(set())
c[0][1].add(124)
print(c)
I get:
[[{124} {124}]
[{124} {124}]]
And instead I would like the return:
[[{} {124}]
[{} {}]]
Any idea ?
The object array has to be filled with separate set() objects. That means creating them individually, as I do with a list comprehension:
In [279]: arr = np.array([set() for _ in range(4)]).reshape(2,2)
In [280]: arr
Out[280]:
array([[set(), set()],
[set(), set()]], dtype=object)
That construction should highlight that fact that this array is closely related to a list, or list of lists.
Now we can do a set operation on one of those elements:
In [281]: arr[0,1].add(124) # more idiomatic than arr[0][1]
In [282]: arr
Out[282]:
array([[set(), {124}],
[set(), set()]], dtype=object)
Note that we cannot operate on more than one set at a time. The object array offers few advantages compared to a list.
This is a 2d array; the sets don't form a dimension. Contrast that with
In [283]: image = (((1,2,3), (1,0,0)), ((1,1,1), (2,1,2)))
...: image_array = np.array(image)
...:
In [284]: image_array
Out[284]:
array([[[1, 2, 3],
[1, 0, 0]],
[[1, 1, 1],
[2, 1, 2]]])
While it started with tuples, it made a 3d array of integers.
Try this:
import numpy as np
x = np.empty((2, 2), dtype=np.object)
x[0, 0] = set(1, 2, 3)
print(x)
[[{1, 2, 3} None]
[None None]]
For non-number types in numpy you should use np.object.
whenever you do fill(set()), this will fill the array with exactly same set, as they refer to the same set. To fix this, just make a set if there isnt one everytime you need to add to the set
c = np.empty(image_array.shape[:2], dtype=set)
if not c[0][1]:
c[0,1] = set([124])
else:
c[0,1].add(124)
print (c)
# [[None {124}]
# [None None]]
Try changing your line c[0][1].add to this.
c[0][1] = 124
print(c)

Resources