python 3.x: which is more efficient: list of lists vs. dict? - python-3.x

Imaging you are doing a BFS over a grid (e.g. shortest distance between two cells). Two data structures can be used to host the visited info:
1) List of lists, i.e. data = [[False for _ in range(cols)] for _ in range(rows)]. Later we can access the data in a certain cell by data[r][c].
2) Dict, i.e. data = dict(). Later we can access the data in a certain cell by data[(r, c)].
My question is: which is computationally more efficient in such BFS scenario?
Coding wise it seems the dict approach saves more characters/lines. Memory wise the dict approach can potentially save some space for untouched cells, but can also waste some space for hashtable's extra space.
EDIT
#Peteris mentioned numpy arrays. The advantage over list of lists is obvious: numpy arrays operate on continuous blocks of memory, which allow faster addressing and more cache hits. However I'm not sure how they compare to hashtables (i.e. dict). If the algorithm touches relatively small number of elements, hashtables might provide more cache hits given it's potentially smaller memory footprint.
Also, the truth is that numpy arrays are unavailable to me. So I really need to compare list of lists against dict.

A 2D array
The efficient answer to storing 2D data is 2D arrays/matrixes allocated into a continuous area of memory (not like a list of lists). This avoids the multiple memory lookups required otherwise, and the calculation of a hash value at every lookup that's needed for a dict.
The standard way to do this in python is with the numpy library, here's a simple example
import numpy as np
data = np.zeros( (100, 200) ) # create a 100x200 array and initialize it with zeroes
data[10,20] = 1 # set element at coordinates 10,20 to 1

Related

View on portion of tensor

I have a multiple dimensions tensor, let's take this simple one as example:
out = torch.Tensor(3, 4, 5)
I have to get a portion/subpart of this tensor out[:,0,:] and then apply the method view(-1), but it's not possible:
out[:,0,:].view(-1)
RuntimeError: invalid argument 2: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Call .contiguous() before .view(). at ../aten/src/TH/generic/THTensor.cpp:203
A solution is to clone the subpart:
out[:,0,:].clone().view(-1)
Is there a better/faster solution than cloning?
What you did will work fine. That said, a more portable approach would be to use reshape which will return a view when possible, but will create a contiguous copy if necessary. That way it will do the fastest thing possible. In your case the data must be copied, but by always using reshape there are cases where a copy won't be produced.
So you could use
out[:,0,:].reshape(-1)
Gotcha
There's one important gotcha here. If you perform in-place operations on the output of reshape then that may or may not affect the original tensor, depending on whether or not a view or copy was returned.
For example, assuming out is already contiguous then in this case
>>> x = out[:,0,:].reshape(-1) # returns a copy
>>> x[0] = 10
>>> print(out[0,0,0].item() == 10)
False
x is a copy so changes to it don't affect out. But in this case
>>> x = out[:,:,0].reshape(-1) # returns a view
>>> x[0] = 10
>>> print(out[0,0,0].item() == 10)
True
x is a view, so in-place changes to x will change out as well.
Alternatives
A couple alternative are
out[:,0,:].flatten() # .flatten is just a special case of .reshape
and
out[:,0,:].contiguous().view(-1)
Though if you want the fastest approach I recommend against the latter method using contiguous().view since, in general, it is more likely than reshape or flatten to return a copy. This is because contiguous will create a copy even if the underlying data has the same number of bytes between subsequent entries. Therefore, there's a difference between
out[:,:,0].contiguous().view(-1) # creates a copy
and
out[:,:,0].flatten() # creates a non-contiguous view (b/c underlying data has uniform spacing of out.shape[2] values between entries)
where the contiguous().view approach forces a copy since out[:,:,0] is not contiguous, but flatten/reshape would create a view since the underlying data is uniformly spaced.
Sometimes contiguous() won't create a copy, for example compare
out[0,:,:].contiguous().view(-1) # creates a view b/c out[0,:,:] already is contiguous
and
out[0,:,:].flatten() # creates a view
which both produce a view of the original data without copying since out[0,:,:] is already contiguous.
If you want to ensure that the out is decoupled completely from its flattened counterpart then the original approach using .clone() is the way to go.

Random Index from a Tensor (Sampling with Replacement from a Tensor)

I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.

python incorrect size with getsizeof() and .nbytes with nested lists

I apologise if this is a duplicate issue, but I've been having some issues with .nsize and sys.getsizeof().
In particular, I have a list which contains numpy arrays, each array is a 3D representation of an image (row, column, RGB) and each of these images have different dimensions.
There are over 4000 images, and this may increase in the future, as I plan to use them for machine learning.
When I use .nsize with one image, I get the correct size, but when I try to evaluate the whole lot, I get an incorrect size:
# size of image 1 in bytes
print("size of first image: %d bytes" % images[0].nbytes)
# size of all images in bytes
print("total size of all images: %d bytes" % images.nbytes)
Result:
size of first image: 60066 bytes
total size of all images: 36600 bytes
Are the only ways around this to either loop through all the images or change to a monstrous 4D array instead of a list of 3D arrays? Is there another function which better evaluates size for this kind of nested setup?
I'm running Python 3.6.7.
Try running images.dtype. What does it return? If it's dtype('O'), that explains your problem: images is not a list, but is instead a Numpy array of type object, which is generally a Bad Idea™️. Technically, it'll be an 1D array holding a bunch of 3D arrays.
Numpy arrays are best suited to use with numerical data. They're flexible enough to hold arbitrary Python objects, but it greatly impairs both their functionality and their efficiency. Unless you have a clear reason why in mind, you should generally just use a plain Python list [] in these situations.
You may actually be best off converting images to a 4D array, as this is the only way that images.nbytes will work correctly. You can't do this if your images are all different sizes, but given that they all have the same shape (x, y, z) it's actually pretty straightforward:
images = np.array([a for a in images])
Now images.shape will be (n, x, y, z), where n is the total number of images. You can access the 3D array that represents the ith image by just indexing images:
image_i = images[i]
Alternatively, you can convert images to a normal Python list:
images = images.to_list()
If you don't want to bother with any of those conversions, you can always get the size of all the subarrays via iteration:
totalsize = sum(arr.nbytes for arr in images)

What is the best way to permanently store an array with 512 floats and 1 million records to facilitate fast search?

I have millions of images and for each image, I have converted them
into 512 numbers to represent what is in that image at a higher level
of abstraction than pixels. The dataset is like table with 512 fields
and a million rows, filled with floats.
When given a new image, I would like to be able to query through the 1
million records and return the records in order of "similarity".
Similarity can be defined as lowest sum of difference between the two
arrays of 512 elements.
What is the best way of permanently storing this data and performing the numerical calculations so that the "image search" is fast?
Just for background info: the 512 elements is the intermediate output features of a convolutional neural network used in image classification. I'm trying to return the most similar images when given a new image.
I'm pretty new to this - I hope the question makes sense.
I can store the database in many different ways... serialized in sql database, csv file... but what I'm not sure of is what is the best format for fast search later on.
My suggestion would be vectorization, possible in Python's Numpy, MATLAB, or Octave, etc. Basically, this means you could take a different between two matrices like so:
For instance, in Python3:
import numpy as np
pic1 = np.array([[1,2], [3,4]])
pic2 = np.array([[4,3], [2,1]])
diff = pic1 - pic2
dist = diff * diff
similarity = 1/ sum(sum(dist))
print(similarity)
This is fast because now your operation is O(num of pictures) rather than O(n * d^2), where d is the dimension of an edge of your image

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Resources