Difference between aliasing,deep copy ,shallow copy pertaining to numpy - python-3.x

from numpy import *
arr1=array([1,2,3])
arr2=arr1 #aliasing
arr3=arr1.view() #shallow copy
arr4=arr1.copy() #deep copy
id(arr1) #120638624
id(arr2) #120638624
id(arr3) #120639004
id(arr4) #123894390
I know about shallow copy and deep copy as in C,C++ but what is it which is happening in python?
Look the c++ code . is it the same happen?
int main()
{
int arr[]={1,2,3};
int (&a)[3]=arr;//aliasing
int* b=arr;// shallow copy
int c[3];//deep copy
int i;
for(i=0;i<3;i++)
c[i]=arr[i];
}

You have aliasing and deep copy right (though copying array values in a for-loop is not usually considered a good way to do it).
On the other hand, a Numpy view is not a pointer. It's a much heavier duty thing, and a proper object instance in it's own right. Conceptually, it's the closest thing to an actual pointer-to-array that exists in Python (though the semantics are of course different), and can fulfill some of the same roles in your code. A view will never be as performant as a raw pointer, since the view needs to carry around a set of data, such as shape and strides, that may be different from that of its "parent" array.
On the other-other hand, both Numpy arrays and views wrap the __array_interface__, which in turn wraps a pointer to the underlying buffer that holds the actual data. So when you make a new view of an array, you do end up making a proper shallow copy of the underlying data, since you make a copy of the pointer to that data (albeit through several layers of wrapping and indirection).

Related

Extra [ ] Operator Added When Using 2 overloaded [ ] Operators

As a side project I'm writing a couple of classes to do matrix operations and operations on linear systems. The class LinearSystem holds pointers to Matrix class objects in a std::map. The Matrix class itself holds the 2d array ("matrix") in a double float pointer. To I wrote 2 overloaded [ ] operators, one to return a pointer to a matrix object directly from the LinearSystem object, and another to return a row (float *) from a matrix object.
Both of these operators work perfectly on their own. I'm able to use LinearSystem["keyString"] to get a matrix object pointer from the map. And I'm able to use Matrix[row] to get a row (float *) and Matrix[row][col] to get a specific float element from a matrix objects' 2d array.
The trouble comes when I put them together. My limited understanding (rising senior CompSci major) tells me that I should have no problem using LinearSystem["keyString"][row][col] to get an element from a specific array within the Linear system object. The return types should look like LinearSystem->Matrix->float *->float. But for some reason it only works when I place an extra [0] after the overloaded key operator so the call looks like this: LinearSystem["keyString"][0][row][col]. And it HAS to be 0, anything else and I segfault.
Another interesting thing to note is that CLion sees ["keyString"] as overloaded and [row] as overloaded, but not [0], as if its calling the standard index operator, but on what is the question that has me puzzled. LinearSystem["keyString"] is for sure returning Matrix * which only has an overloaded [ ] operator. See the attached screenshot.
screenshot
Here's the code, let me know if more is needed.
LinearSystem [ ] and map declaration:
Matrix *myNameSpace::LinearSystem::operator[](const std::string &name) {
return matrices[name];
}
std::map<std::string, Matrix *> matrices;
Matrix [ ] and array declaration:
inline float *myNameSpace::Matrix::operator[](const int row) {
return elements[row];
}
float **elements;
Note, the above function is inline'd because I'm challenging myself to make the code as fast as possible and even with compiler optimizations, the overloaded [ ] was 15% to 30% slower than using Matrix.elements[row].
Please let me know if any more info is needed, this is my first post so it I'm sure its not perfect.
Thank you!
You're writing C in C++. You need to not do that. It adds complexity. Those stars you keep putting after your types. Those are raw pointers, and they should almost always be avoided. linearSystem["foo"] is a Matrix*, i.e. a pointer to Matrix. It is not a Matrix. And pointers index as arrays, so linearSystem["foo"][0] gets the first element of the Matrix*, treating it as an array. Since there's actually only one element, this works out and seems to do what you want. If that sounds confusing, that's because it is.
Make sure you understand ownership semantics. Raw pointers are frowned upon because they don't convey any information. If you want a value to be owned by the data structure, it should be a T directly (or a std::unique_ptr<T> if you need virtual inheritance), not a T*. If you're taking a function argument or returning a value that you don't want to pass ownership of, you should take/return a const T&, or T& if you need to modify the referent.
I'm not sure exactly what your data looks like, but a reasonable representation of a Matrix class (if you're going for speed) is as an instance variable of type std::vector<double>. std::vector is a managed array of double. You can preallocate the array to whatever size you want and change it whenever you want. It also plays nice with Rule of Three/Five.
I'm not sure what LinearSystem is meant to represent, but I'm going to assume the matrices are meant to be owned by the linear system (i.e. when the system is freed, the matrices should have their memory freed as well), so I'm looking at something like
std::map<std::string, Matrix> matrices;
Again, you can wrap Matrix in a std::unique_ptr if you plan to inherit from it and do dynamic dispatch. Though I might question your design choices if your Matrix class is intended to be subclassed.
There's no reason for Matrix::operator[] to return a raw pointer to float (in fact, "raw pointer to float" is a pretty pointless type to begin with). I might suggest having two overloads.
float myNameSpace::Matrix::operator[](int row) const;
float& myNameSpace::Matrix::operator[](int row);
Likewise, LinearSystem::operator[] can have two overloads: one constant and one mutable.
const Matrix& myNameSpace::LinearSystem::operator[](const std::string& name) const;
Matrix& myNameSpace::LinearSystem::operator[](const std::string& name);
References (T& as opposed to T*) are smart and will effectively dereference when needed, so you can call Matrix::operator[] on a Matrix&, whereas you can't call that on a Matrix* without acknowledging the layer of indirection.
There's a lot of bad C++ advice out there. If the book / video / teacher you're learning from is telling you to allocate float** everywhere, then it's a bad book / video / teacher. C++-managed data is going to be far less error-prone and will perform comparably to raw pointers (the C++ compiler is smarter than either you or I when it comes to optimization, so let it do its thing).
If you do find yourself really feeling the need to go low-level and allocate raw pointers everywhere, then switch to C. C is a language designed for low-level pointer manipulation. C++ is a higher-level managed-memory language; it just so happens that, for historical reasons, C++ has several hundred footguns in the form of C-style allocations placed sporadically throughout the standard.
In summary: Modern C++ almost never uses T*, new, or delete. Start getting comfortable with smart pointers (std::unique_ptr and std::shared_ptr) and references. You'll thank yourself later.

What's the most efficient: a reference to an ArrayBase or an ArrayView?

I'm doing improvement on a Rust codebase that uses the ndarray crate to manipulate arrays. I have one question I could not find an explicit answer in the documentation.
Is it more efficient to pass an instance of ArrayView as an argument to a function or should I use a reference to an Array instead? My intuition is that since ArrayView is a view of an array, when doing computations, it only passes a view of the array and does not grant ownership to the function (hence does not copy) the underlying data.
In short, is there any speed gain to expect from switching from passing instances of ArrayView to passing references of Array?
My goal is to avoid useless memory allocation/duplication which can be very costly when dealing with large arrays.
ArrayBase is a generic struct that can act as both an ArrayView and an Array, so I assume you mean a reference to the owned data, i.e. an Array.
Neither version will clone the array, so they should be approximately equally efficient. You can always benchmark to verify this.
As I see it, the difference is mostly that ArrayView will make the function more flexible – you can pass in parts of larger arrays, or an ArrayView created from a slice, whereas the variant that takes a reference to Array can only be called when you really have an Array of the desired size.

fastest way to serialize numpy ndarray?

I have a constant flow of medium sized ndarrays (each around 10-15mb in memory) on which I use ndarray.tobytes() before I send it to the next part of the pipeline.
Currently it takes about 70-100ms per array serialization.
I was wondering, is this the fastest that this could be done or is there a faster (maybe not as pretty) way to accomplish that?
clarification: arrays are images, next step in pipeline is some CPP function, I don't want to save them as a file.
There is no need to serialize them at all! You can let C++ read the memory directly. One way is to invoke a C++ function with the PyObject which is your NumPy array. Another is to let C++ allocate the NumPy array in the first place and populate the elements in Python before returning control to C++, for which I have some open source code built atop Boost Python that you can use: https://github.com/jzwinck/pccl/blob/master/NumPyArray.hpp
Your goal should be "zero copy" meaning you never copy the bytes of the array, you only copy references to the array or data within it plus the dimensions.

How to map a structure from a buffer like in C with a pointer and cast

In C, I can define many structures and structure of structures.
From a buffer, I can just set the pointer at the beginning of this structure to say this buffer represents this structure.
Of course, I do not want to copy anything, just mapping, otherwise I loose the benefit of the speed.
Is it possible in NodeJs ? How can I do ? How can I be sure it's a mapping and not creating a new object and copy information inside ?
Example:
struct House = {
uint8 door,
uint16BE kitchen,
etc...
}
var mybuff = Buffer.allocate(10, 0)
var MyHouse = new House(mybuff) // same as `House* MyHouse = (House*) mybuff`
console.log(MyHouse.door) // will display the value of door
console.log(MyHouse.kitchen) // will display the value of kitchen with BE function.
This is wrong but explain well what I am looking for.
This without copying anything.
And if I do MyHouse.door=56, mybuff contains know the 56. I consider mybuff as a pointer.
Edit after question update below
Opposed to C/C++, javascript uses pionters by default, so you don't have to do anything. It's the other way around, actually: You have to put some effort in if you want a copy of the current object.
In C, a struct is nothing more than a compile-time reference to different parts of data in the struct. So:
struct X {
int foo;
int bar;
}
is nothing more than saying: if you want bar from a variable with type X, just add the length of foo (length of int) to the base pointer.
In Javascript, we do not even have such a type. We can just say:
var x = {
foo: 1,
bar: 2
}
The lookup of bar will automatically be a pointer (we call them references in javascript) lookup. Because javascript does not have types, you can view an object as a map/dictionary with pointers to mixed types.
If you, for any reason, want to create a copy of a datastructure, you would have to iterate through the entire datastructure (recursively) and create a copy of the datastructure manually. The basic types are not pointer based. These include number (Javascript automatically differentiates between int and float under the hood), string and boolean.
Edit after question update
Although I am not an expert on this area, I do not think it is possible. The problem is, the underlying data representation (as in how the data is represented as bytes in memory) is different, because javascript does not have compile-time information about data structures. As I said before, javascript doesn't have classes/structs, just objects with fields, which basically behave (and may be implemented as) maps/dictionaries.
There are, however, some third party libraries to cope with these problems. There are two general approaches:
Unpack everything to javascript objects. The data will be copied, but you can work with it as normal javascript objects. You should use this if you read/write the data intensively, because the performance increase you get when working with normal javascript objects outweighs the advantage of not having to unpack the data. Link to example library
Leave all data in the buffer. When you need some of the data, compute the location of the data in the buffer at runtime, and read/write at this location accordingly. Because the struct data location computations are done in runtime, you should use this only when you have loads of data and only a few reads/writes to it. In this case the performance decrease of unpacking all data outweighs the few runtime computations that have to be done. Link to example library
As a side-note, if the amount of data you have to process isn't that much, I'd recommend to just unpack the data. It saves you the headache of having to use the library as interface to your data. Computers are fast enough nowadays to copy/process some amount of data in memory. Also, these third party libraries are just some examples. I recommend you do a little more research for libraries to decide which one suits your needs.

linux kernel function _copy_to_user, want clear understanding of that

I am using this function to copy some structures to the kernel.
But, the problem is that I have to copy three data structures which are part of a bigger data structure. NOTE: the 3 data structures are contiguous in the bigger data structure.
SO, In my copy user function I pass the pointer of the 1st data structure and give the length of all the 3 data structure. But, when I go to user-space and print the 1st element of the 2nd data structure it gives some other value.
SO, what am I doing wrong.
As, a solution I made 3 copt_to_user calls and to my surprise it works fine. Its the problem when I make a single copy_to_user call.
Please, let me know what could be the reason.
Hey guys thanks for the answer it was a alignment issue , but, going further, if I want to pad an internal structure how do I do it..?
Example-
structure d{
struct b;
struct c; //I want to make this structure a padded one, how to go about it?
struct d;
}
As mentioned in the comments, this really seems to be an alignment problem. Gcc will probably add some padding between the structures a, b and c in struct d. Depending on how you instantiated the one in userland, it could be a problem. You can force gcc to not generate padding, using __atribute__ ((packed)) on your structure, but unless this structure maps to hardware registers, it's usually a bad idea as it will lead to worse performance when accessing fields of that structure.
Another possible problem would be if your kernel is 64 bits and your userland program is 32 bits, in this case you need to use fixed size types to be sure to have the same layout.

Resources