img[::a,::b] can reduce resolution of image in PIL,but why?
img[::a, ::b] resolution or size is x and y then you will get image which is x/a ,y/b
who knows why or how?
That's the syntax for a multi-dimensional slice. The result includes every ath pixel along the x-axis and every bth pixel along the y-axis.
Slice notation is not too difficult to understand. It looks like start:stop:step, where any of the values can be omitted to get a default. If you're omitting step at the end, the second colon is not required (you can just write start:stop). Slice notation is only allowed when indexing (e.g. foo[start:stop:step]). To make a slice outside that context, you can call the slice constructor (though you may need to pass None for omitted values rather than just skipping them).
The default step value is 1. The default start and stop depend on the sign of step. If step is positive, then start's default is 0 and stop's is the size of the object being sliced. If step is negative, start will default to one less than the size of the object (the largest valid index) and stop will default to an index "just before the first value" (this is not -1 as you might expect, because negative indexes wrap around to the end; the only way you could specify the index normally is -len(...)-1).
Not all Python objects allow multi-dimensional indexing (where the index is a tuple of indexes or slices). Normal lists don't support it (not even if they're nested). The PIL image does however, probably because it is replicating the behavior of a numpy multi-dimensional array.
Related
I'm using the ppm function in spatstat and looking for documentation on the default dummy points used in the quadrature scheme. The default.dummy help page says "If nd is missing, a default value (depending on the data pattern X) is computed by default.ngrid." I am now looking for information on default.ngrid but cannot find anything. For context, I am comparing null random models to models with covariates, where the points are coordinates of class ppp and the covariates are pixel images in a window. To generate the models, I'm using the ppm function with default parameters, as in:
ppm0 <- ppm(myPoints_ppp ~ 1)
ppm1 <- ppm(myPoints_ppp ~ covariate_A)
I'm trying to understand how the dummy points are being generated since I did not specify the nd argument. Thanks!
That reference is out of date. The default number of dummy points is determined by default.n.tiling. This is an internal, undocumented function.
The source code for default.n.tiling can be consulted to find out the exact rules, but here is a sketch:
Currently the default minimum number of grid points in each dimension is the greater of
10 * ceiling(2 * sqrt(npoints(X))/10)
and
spatstat.options('ndummy.min')
This determines a minimum acceptable number of dummy points. With the default value of spatstat.options('ndummy.min') = 32 this means that any point pattern with no more than 225 points will be given a minimum 32 x 32 grid of dummy points. A pattern with 226 to 400 points will have 40 x 40 dummy points. A pattern with 401 to 625 points will have 50 x 50 dummy points, and so on.
The final number of dummy points is determined by applying some other constraints (which depend on the context and user-specified parameters) and may be greater than the minimum number specified above.
This default can always be overruled by simply specifying a non-NULL value for the parameter nd.
I should also point out that the default settings are chosen to produce an acceptable result quickly, rather than to produce a highly accurate result. This is necessary because of CRAN's limits on the total time taken to check the package. If you're doing any serious methodological research, you should consider increasing spatstat.options('ndummy.min') or creating your own quadrature schemes. This will also improve the reproducibility of your results (since the default in spatstat could change.)
I am running into problems when computing the relative risk estimation (relrisk.ppp) of two point patterns: One with four marks in a rectangular region and the other with two marks in a circular region.
For the first pattern with four marks, I am able to get the relative risk and the resulting object in a large imlist with 4 elements corresponding to each mark.
However, for the second pattern, it gives a list of 10 elements, of which the first matrix v is empty with NA entries. I am breaking my head on what possibly could be wrong when the created point pattern objects seems to be identical. Any help will be appreciated. Thanks.
For your first dataset, the result is a list of image objects (a list of four objects of class im). For your second dataset, the result of relrisk.ppp is a single image (object of class im). This is the default behaviour when there are only two possible types of points (two possible mark values). See help(relrisk.ppp).
In all cases, you should just be able to plot and print the resulting object. You don't need to examine the internal data of the image.
More explanation: when there are only two possible types of points, the default behaviour of relrisk.ppp is to treat them as case-control data, where the points belonging to the first type are treated as controls (e.g. non-infected people), and the points of the second type are treated as cases (e.g. infected people). The ratio of intensities (cases divided by controls) is estimated as an image.
If you don't want this to happen, set the argument casecontrol=FALSE and then relrisk.ppp will always return a list of images, with one image for each possible mark. Each image gives the spatially-varying probability of that type of point.
It's all explained in help(relrisk.ppp) or in the book.
I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.
I'm making some classification experiments using sklearn. During the experiments, I'm building csr_matrix objects to store my data and used LogisticRegression classifier on these objects and get some results.
I dump the data using dump_svmlight_file and the model using joblib.
But when I then load the data using load_svmlight_file and the model, I obtained (very) different results.
I realized that if I dump the data setting the zero_based parameter to False, I retrieve my original results. What is exactly the effect of this parameter? Is it usual to have different results modifying the value of this parameter?
The docs are pretty explicit:
zero_based : boolean or “auto”, optional, default “auto”
Whether column indices in f are zero-based (True) or one-based (False). If column indices are one-based, they are transformed to zero-based to match Python/NumPy conventions. If set to “auto”, a heuristic check is applied to determine this from the file contents. Both kinds of files occur “in the wild”, but they are unfortunately not self-identifying. Using “auto” or True should always be safe.
Your observation seems odd, though. If you dump with zero_based=False and load with zero_based='auto' the heuristic should be able to detect the right format.
Also, if the wrong format would have been detected, the number of features would have changed, so there would be an error from your classifier.
I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.