Here is my Code
my_array_1 = np.arange(25).reshape(5, 5)
print(my_array_1)
my_array_red = my_array_1[:, 1::2]
print(my_array_red)
my_array_blue = my_array_1[1::2, 0:3:2]
print(my_array_blue)
my_array_yellow = my_array_1[-1, :]
print(my_array_yellow)
print(id(my_array_1))
print(id(my_array_red))
print(id(my_array_yellow))
print(id(my_array_blue))
print(my_array_1.data)
print(my_array_red.data)
print(my_array_blue.data)
print(my_array_yellow.data)
Here is the Output:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
[[ 1 3]
[ 6 8]
[11 13]
[16 18]
[21 23]]
[[ 5 7]
[15 17]]
[20 21 22 23 24]
2606769150592
2606769282544
2607017647120
2606769282624
<memory at 0x0000025EFE56CA68>
<memory at 0x0000025EFE56CA68>
<memory at 0x0000025EFE56CA68>
<memory at 0x0000025EFE5A8F48>
Question :
Just check the last 4 lines of my output . why my_array_1.data.data , my_array_red.data, my_array_blue.data have same value , But where as my_array_yellow.data have a different value ?
I find the data value of the __array_interface__ to be more informative:
In [2]: my_array_1.__array_interface__['data']
Out[2]: (33691856, False)
In [3]: my_array_red.__array_interface__['data']
Out[3]: (33691864, False)
In [4]: my_array_blue.__array_interface__['data']
Out[4]: (33691896, False)
In [5]: my_array_yellow.__array_interface__['data']
Out[5]: (33692016, False)
Out[2] is the start of the data buffer.
red is 8 bytes larger - that is one element from the start.
blue is 40 bytes in - the next row
In [8]: my_array_1.strides
Out[8]: (40, 8)
yellow is 160 bytes in - that's the start of the last row (40 from the end)
In [9]: 2016-1856
Out[9]: 160
In [10]: my_array_1.nbytes
Out[10]: 200
The data addresses all differ, but are in the same ballpark. But they are harder to interpret.
In [11]: my_array_1.data
Out[11]: <memory at 0x7fa975369a68>
In [12]: my_array_red.data
Out[12]: <memory at 0x7fa975369b40>
In [13]: my_array_blue.data
Out[13]: <memory at 0x7fa975369c18>
In [14]: my_array_yellow.data
Out[14]: <memory at 0x7fa9710f11c8>
The data attribute can be used in an ndarray constructor:
Two elements from yellow:
In [17]: np.ndarray(2,dtype=my_array_1.dtype,buffer=my_array_yellow.data)
Out[17]: array([20, 21])
Same 2 elements, but with the original address, and an offset (as deduced above):
In [18]: np.ndarray(2,dtype=my_array_1.dtype,buffer=my_array_1.data, offset=160)
Out[18]: array([20, 21])
Actually the data display doesn't tell us anything about where the data buffer is located. It's the address of the memoryview object that references the buffer, not the address of the buffer itself. Call data again, and get a different memoryview object:
In [19]: my_array_1.data
Out[19]: <memory at 0x7fa975369cf0>
If I print these memoryview objects, I get the same pattern as you do:
In [22]: print(my_array_1.data)
<memory at 0x7fa970e37120>
In [23]: print(my_array_red.data)
<memory at 0x7fa970e37120>
In [24]: print(my_array_blue.data)
<memory at 0x7fa970e37120>
In [25]: print(my_array_yellow.data)
<memory at 0x7fa9710f17c8>
For 23 and 24, it's just reusing a memory slot, because with print there's no persistence. I'm not sure why yellow doesn't reuse it, except maybe the object is sufficiently different that it doesn't fit in the same space. In the Out[11] etc. cases, the ipython buffering hangs onto those objects, and thus there's not reuse.
It just reinforces the idea that there's nothing significant about the print display of these memoryview objects. It has nothing to do with the databuffer location. It's more like the id, just an arbitrary place in memory.
Related
I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.
I am learned about getsizeof() operator, and cannot
understand why:
import sys
A=[(1,2,3,4)]
B=[()]
print(sys.getsizeof(A))
print(sys.getsizeof(B))
both print 64 . This is size in bytes but why isn't it changing?
sys.getsizeof() The size that actually goes up in memory.
so, empty string (like "" ) occupies 49bytes.
print(sys.getsizeof(''))
print(sys.getsizeof(en))
# result
# 49
# 76
Yes! because you are passing both tuple. if want to check the size you need to pass the A values as list and B value as tuple. like this
`
import sys
A=[1,2,3,4]
B=[()]
print(sys.getsizeof(A))
88
print(sys.getsizeof(B))
64
`
here is some more examples:
import sys
a =[1, 2]
b =[1, 2, 3, 4]
c =[1, 2, 3, 4]
d =[2, 3, 1, 4, 66, 54, 45, 89]
print(sys.getsizeof(a))
print(sys.getsizeof(b))
print(sys.getsizeof(c))
print(sys.getsizeof(d))
it's just a minimal size of list object.
import sys
A=(1,2,3,4)
A_LIST=[(1,2,3,4)]
B=[()]
B_2=[(),()]
print(sys.getsizeof(A))
print(sys.getsizeof(A_LIST))
print(sys.getsizeof(B))
print(sys.getsizeof(B_2))
#72
#64
#64
#72
Here you can find out more about sizes
I've been working on my data by Python. My data is imported as a numpy array by using numpy.diff. But it turns out a wrong set of values.
import numpy as np
mydata = np.array([1285, 1328, 1277, 1293, 200, 1284, 1266, 1273, 1252, 1233, 1208, 1166, 1200, 1173,
1179])
print(np.diff(mydata))
And it shows:
[ 43 65485 16 64443 1084 65518 7 65515 65517 65511 65494 34
65509 6]
which is absolutely wrong!
Who can help me to deal with this problem?
The type of your array is likely an uint16. Indeed:
>>> my_data =np.array([25,14], dtype=np.uint16)
>>> np.diff(my_data)
array([65525], dtype=uint16)
This happens since unsiged integers can not represent negative numbers, and thus a wraparound is the result.
You can change the type of your array, for example to int32:
>>> np.diff(my_data.astype(np.int32))
array([-11], dtype=int32)
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).
However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.
How is it possible? Why one column, and not both or none of them?
Here is the code:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)
# Copy data
processed_data = data[:]
print(data[:10])
# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0
# gpa column has changed
print(data[:10])
On the other hand, if I change processed_data = data[:] to processed_data = data.copy() it works fine.
Here, the original data edited:
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.
This is valid for Python lists. Slicing creates shallow copies.
In [44]: lst = [[1, 2], 3, 4]
In [45]: lst2 = lst[:]
In [46]: lst2[1] = 100
In [47]: lst # unchanged
Out[47]: [[1, 2], 3, 4]
In [48]: lst2[0].append(3)
In [49]: lst # changed
Out[49]: [[1, 2, 3], 3, 4]
However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.
In [50]: arr = np.array([1, 2, 3])
In [51]: arr2 = arr[:]
In [52]: arr2[0] = 100
In [53]: arr
Out[53]: array([100, 2, 3])
If you have a DataFrame with a single dtype, the behaviour you see is the same:
In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
In [63]: df
Out[63]:
0 1 2
0 1 2 3
1 4 5 6
In [64]: df2 = df[:]
In [65]: df2.iloc[0, 0] = 100
In [66]: df
Out[66]:
0 1 2
0 100 2 3
1 4 5 6
But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard
to predict whether it will return a view or a copy (it depends on the
memory layout of the array, about which pandas makes no guarantees),
and therefore whether the __setitem__ will modify dfmi or a temporary
object that gets thrown out immediately afterward. That’s what
SettingWithCopy is warning you about!
In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa column the block is the column itself. This is not the case for gre -- you have other integer columns. When I add a string column to data and modify it in processed_data I see the same behavior. When I increase the number of float columns to 2 in data, changing gre in processed_data no longer affects original data.
In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy() and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc or .iloc.
Having an instance of the beta object, how do I get back the parameters a and b?
There are properties a and b, but it seems they mean something else as I expected:
>>> import scipy
>>> scipy.__version__
'0.19.1'
>>> from scipy import stats
>>> my_beta = stats.beta(a=1, b=5)
>>> my_beta.a, my_beta.b
(0.0, 1.0)
Is there a way to get the parameters of the distribution? I could always fit a huge rvs sample but that seems silly :)
When you create a "frozen" distribution with a call such as my_beta = stats.beta(a=1, b=5), the positional and keyword arguments are saved as the attributes args and kwds, respectively, on the returned object. So in your case, you can access those values in the dictionary my_beta.kwds:
In [10]: from scipy import stats
In [11]: my_beta = stats.beta(a=1, b=5)
In [12]: my_beta.kwds
Out[12]: {'a': 1, 'b': 5}
The attributes my_beta.a and my_beta.b are, as you guessed, something different. They define the end points of the support of the probability distribution:
In [13]: my_beta.a
Out[13]: 0.0
In [14]: my_beta.b
Out[14]: 1.0