Using dataloader to sample with replacement in pytorch

Using dataloader to sample with replacement in pytorch - pytorch

I have a dataset defined in the format:
class MyDataset(Dataset):
def __init__(self, N):
self.N = N
self.x = torch.rand(self.N, 10)
self.y = torch.randint(0, 3, (self.N,))
def __len__(self):
return self.N
def __getitem__(self, idx):
return self.x[idx], self.y[idx]
During the training, I would like to sample batches of m training samples, with replacement; e.g. the first iteration includes data indices [1, 5, 6], second iteration includes data points [12, 3, 5], and so on and so forth. So the total number of iterations is an input, rather than N/m
Is there a way to use dataloader to handle this? If not, is there any other method than something in the form of
for i in range(iter):
x = np.random.choice(range(N), m, replace=True)
to implement this?

You can use a RandomSampler, this is a utility that slides in between the dataset and dataloader:
>>> ds = MyDataset(N)
>>> sampler = RandomSampler(ds, replacement=True, num_samples=M)
Above, sampler will sample a total of M (replacement is necessary of course if num_samples > len(ds)). In your example M = iter*m.
You can then initialize a DataLoader with sampler:
>>> dl = DataLoader(ds, sampler=sampler, batch_size=2)
Here is a possible result with N = 2, M = 2*len(ds) = 4, and batch_size = 2:
>>> for x, y in dl:
... print(x, y)
tensor([[0.5541, 0.3596, 0.5180, 0.1511, 0.3523, 0.4001, 0.6977, 0.1218, 0.2458, 0.8735],
[0.0407, 0.2081, 0.5510, 0.2063, 0.1499, 0.1266, 0.1928, 0.0589, 0.2789, 0.3531]])
tensor([1, 0])
tensor([[0.5541, 0.3596, 0.5180, 0.1511, 0.3523, 0.4001, 0.6977, 0.1218, 0.2458, 0.8735],
[0.0431, 0.0452, 0.3286, 0.5139, 0.4620, 0.4468, 0.3490, 0.4226, 0.3930, 0.2227]])
tensor([1, 0])
tensor([[0.5541, 0.3596, 0.5180, 0.1511, 0.3523, 0.4001, 0.6977, 0.1218, 0.2458, 0.8735],
[0.5541, 0.3596, 0.5180, 0.1511, 0.3523, 0.4001, 0.6977, 0.1218, 0.2458, 0.8735]])
tensor([1, 1])

Related

How do i define a setter for a list with an index or slicing?

With the property and setter decorator I can define getter and setter functions. This is fine for primitives but how do I index a collection or a numpy array?
Setting values seems to work with an index, but the setter function doesn't get called. Otherwise the print function in the minimal example would be executed.
class Data:
def __init__(self):
self._arr = [0, 1, 2]
#property
def arr(self):
return self._arr
#arr.setter
def arr(self, value):
print("new value set") # I want this to be executed
self._arr = value
data = Data()
print(data.arr) # prints [0, 1, 2]
data.arr[2] = 5
print(data.arr) # prints [0, 1, 5]

If you want to do this just for one list of your class instance you can do this in a way by using the __set_item__ and __get_item__ dunder methods of the class:
class Data:
def __init__(self):
self._arr = [0, 1, 2]
#property
def arr(self):
return self._arr
#arr.setter
def arr(self, value):
print("new inner list set")
self._arr = value
def __setitem__(self, key, value):
print("new value set")
self._arr[key] = value
def __getitem__(self, key):
return self._arr[key]
data = Data()
print(data.arr)
data[2] = 5
print(data.arr)
data.arr = [42, 43]
print(data.arr)
Output:
[0, 1, 2]
new value set # by data[2] = 5 using __set_item__
[0, 1, 5]
new inner list set # by data.arr = [42, 43] using #arr.setter
[42, 43]
This would only work for one list member though, because the __set_item__ are working on the class instance itself, not the list that is a member of the class instance.

Generate a list with two unique elements with specific length [duplicate]

Simple question here:
I'm trying to get an array that alternates values (1, -1, 1, -1.....) for a given length. np.repeat just gives me (1, 1, 1, 1,-1, -1,-1, -1). Thoughts?

I like #Benjamin's solution. An alternative though is:
import numpy as np
a = np.empty((15,))
a[::2] = 1
a[1::2] = -1
This also allows for odd-length lists.
EDIT: Also just to note speeds, for a array of 10000 elements
import numpy as np
from timeit import Timer
if __name__ == '__main__':
setupstr="""
import numpy as np
N = 10000
"""
method1="""
a = np.empty((N,),int)
a[::2] = 1
a[1::2] = -1
"""
method2="""
a = np.tile([1,-1],N)
"""
method3="""
a = np.array([1,-1]*N)
"""
method4="""
a = np.array(list(itertools.islice(itertools.cycle((1,-1)), N)))
"""
nl = 1000
t1 = Timer(method1, setupstr).timeit(nl)
t2 = Timer(method2, setupstr).timeit(nl)
t3 = Timer(method3, setupstr).timeit(nl)
t4 = Timer(method4, setupstr).timeit(nl)
print 'method1', t1
print 'method2', t2
print 'method3', t3
print 'method4', t4
Results in timings of:
method1 0.0130500793457
method2 0.114426136017
method3 4.30518102646
method4 2.84446692467
If N = 100, things start to even out but starting with the empty numpy arrays is still significantly faster (nl changed to 10000)
method1 0.05735206604
method2 0.323992013931
method3 0.556654930115
method4 0.46702003479
Numpy arrays are special awesome objects and should not be treated like python lists.

use resize():
In [38]: np.resize([1,-1], 10) # 10 is the length of result array
Out[38]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1])
it can produce odd-length array:
In [39]: np.resize([1,-1], 11)
Out[39]: array([ 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, 1])

Use numpy.tile!
import numpy
a = numpy.tile([1,-1], 15)

use multiplication:
[1,-1] * n

If you want a memory efficient solution, try this:
def alternator(n):
for i in xrange(n):
if i % 2 == 0:
yield 1
else:
yield -1
Then you can iterate over the answers like so:
for i in alternator(n):
# do something with i

Maybe you're looking for itertools.cycle?
list_ = (1,-1,2,-2) # ,3,-3, ...
for n, item in enumerate(itertools.cycle(list_)):
if n==30:
break
print item

I'll just throw these out there because they could be more useful in some circumstances.
If you just want to alternate between positive and negative:
[(-1)**i for i in range(n)]
or for a more general solution
nums = [1, -1, 2]
[nums[i % len(nums)] for i in range(n)]

Roll of different amount along a single axis in a 3D matrix [duplicate]

I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?

Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]

numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop

In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr

I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])

By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))

Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.

How to Get a List of Class Attribute

there is a list of instances from the same class, and i want to extract a certain attribute of every instance and build up a new list
class Test:
def __init__(self, x):
self.x = x
l = [Test(1), Test(2), Test(3), Test(4)]
something like that, and i want to get a list which result is [1, 2, 3, 4]

The best way to do it would probably be like this:
class Test:
def __init__(self, x):
self.x = x
l = [Test(1), Test(2), Test(3), Test(4)]
res = [inst.x for inst in l] # [1, 2, 3, 4]
or just do it from the start:
l = [Test(1).x, Test(2).x, Test(3).x, Test(4).x]

Splitting a matrix using an array of indices

I have a matrix that I want to split up into two. The two new are sort of tangled together, but I do have a "start" and "stop" array indicating what rows belong to each new matrix.
I have given a small example below including my own solution which I do not find satisfying.
Is there a smarter way of splitting the matrix?
Note that there is a certain periodicity in this example, which in not the case in the real matrix.
import numpy as np
np.random.seed(1)
a = np.random.normal(size=[20,2])
print(a)
b_start = np.array([0, 5, 10, 15])
b_stop = np.array([2, 7, 12, 17])
c_start = np.array([2, 7, 12, 17])
c_stop = np.array([5, 10, 15, 20])
b = a[b_start[0]:b_stop[0], :]
c = a[c_start[0]:c_stop[0], :]
for i in range(1, len(b_start)):
b = np.append(b, a[b_start[i]:b_stop[i], :], axis=0)
c = np.append(c, a[c_start[i]:c_stop[i], :], axis=0)
print(b)
print(c)

You can use fancy indexing functionality of numpy.
index_b = np.array([np.arange(b_start[i], b_stop[i]) for i in range(b_start.size)])
index_c = np.array([np.arange(c_start[i], c_stop[i]) for i in range(c_start.size)])
b = a[index_b].reshape(-1, a.shape[1])
c = a[index_c].reshape(-1, a.shape[1])
This will give you the same output.
Test run:
import numpy as np
np.random.seed(1)
a = np.random.normal(size=[20,2])
print(a)
b_start = np.array([0, 5, 10, 15])
b_stop = np.array([2, 7, 12, 17])
c_start = np.array([2, 7, 12, 17])
c_stop = np.array([5, 10, 15, 20])
index_b = np.array([np.arange(b_start[i], b_stop[i]) for i in range(b_start.size)])
index_c = np.array([np.arange(c_start[i], c_stop[i]) for i in range(c_start.size)])
b = a[index_b].reshape(-1, a.shape[1])
c = a[index_c].reshape(-1, a.shape[1])
print(b)
print(c)
Output:
[[ 1.62434536 -0.61175641]
[-0.52817175 -1.07296862]
[ 1.46210794 -2.06014071]
[-0.3224172 -0.38405435]
[-1.10061918 1.14472371]
[ 0.90159072 0.50249434]
[-0.69166075 -0.39675353]
[-0.6871727 -0.84520564]]
[[ 0.86540763 -2.3015387 ]
[ 1.74481176 -0.7612069 ]
[ 0.3190391 -0.24937038]
[ 1.13376944 -1.09989127]
[-0.17242821 -0.87785842]
[ 0.04221375 0.58281521]
[ 0.90085595 -0.68372786]
[-0.12289023 -0.93576943]
[-0.26788808 0.53035547]
[-0.67124613 -0.0126646 ]
[-1.11731035 0.2344157 ]
[ 1.65980218 0.74204416]]
I did 100 runs of two approaches, running time is:
0.008551359176635742#python for loop
0.0034341812133789062#fancy indexing
And 10000 runs:
0.18994426727294922#python for loop
0.26583170890808105#fancy indexing

Congratulations on using np.append correctly. A lot of posters have problems with it.
But it is faster to collect values in a list, and do one concatenate. np.append makes a whole new array each time; list append just adds a pointer to the list in-place.
b = []
c = []
for i in range(1, len(b_start)):
b.append(a[b_start[i]:b_stop[i], :])
c.append(a[c_start[i]:c_stop[i], :])
b = np.concatenate(b, axis=0)
c = np.concatenate(c, axis=0)
or even
b = np.concatenate([a[i:j,:] for i,j in zip(b_start, b_stop)], axis=0)
The other answer does
idx = np.hstack([np.arange(i,j) for i,j in zip(b_start, b_stop)])
a[idx,:]
Based on previous SO questions I expect the two approaches to have about the same speed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using dataloader to sample with replacement in pytorch - pytorch

Related

How do i define a setter for a list with an index or slicing?

Generate a list with two unique elements with specific length [duplicate]

Roll of different amount along a single axis in a 3D matrix [duplicate]

How to Get a List of Class Attribute

Splitting a matrix using an array of indices

Categories

Resources