How to shuffle data in python keeping some n number of rows intact - python-3.x

I want to shuffle my data in such manner that each 4 rows remain intact. For example I have 16 rows then first 4 rows can go to last and then second four rows may go to third and so on in any particular order. I am trying to do thins in python

Reshape splitting the first axis into two with the later of length same as the group length = 4, giving us a 3D array and then use np.random.shuffle, which shuffles along the first axis. The reshaped version being a view into the original array, assigns back the results directly into it. Being in-situ, this should be pretty efficient (both memory-wise and on performance).
Hence, the implementation would be as simple as this -
def array_shuffle(a, n=4):
a3D = a.reshape(a.shape[0]//n,n,-1) # a is input array
np.random.shuffle(a3D)
Another variant of it would be to generate random permutations covering the length of the 3D array, then indexing into it with those and finally reshaping back to 2D.This makes a copy, but seems more performant than in-situ edits as shown in the previous method.
The implementation would be -
def array_permuted_indexing(a, n=4):
m = a.shape[0]//n
a3D = a.reshape(m, n, -1)
return a3D[np.random.permutation(m)].reshape(-1,a3D.shape[-1])
Step-by-step run on shuffling method -
1] Setup random input array and split into a 3D version :
In [2]: np.random.seed(0)
In [3]: a = np.random.randint(11,99,(16,3))
In [4]: a3D = a.reshape(a.shape[0]//4,4,-1)
In [5]: a
Out[5]:
array([[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23],
[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31],
[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50],
[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]])
2] Check the 3D array :
In [6]: a3D
Out[6]:
array([[[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23]],
[[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31]],
[[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]],
[[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]]])
3] Shuffle it along the first axis (in-situ) :
In [7]: np.random.shuffle(a3D)
In [8]: a3D
Out[8]:
array([[[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31]],
[[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]],
[[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23]],
[[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]]])
4] Verify the changes back in the original array :
In [9]: a
Out[9]:
array([[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31],
[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39],
[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23],
[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]])
Runtime test
In [102]: a = np.random.randint(11,99,(16000,3))
In [103]: df = pd.DataFrame(a)
# #piRSquared's soln1
In [106]: %timeit df.iloc[np.random.permutation(np.arange(df.shape[0]).reshape(-1, 4)).ravel()]
100 loops, best of 3: 2.88 ms per loop
# #piRSquared's soln2
In [107]: %%timeit
...: d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
...: pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
100 loops, best of 3: 3.48 ms per loop
# Array based soln-1
In [108]: %timeit array_shuffle(a, n=4)
100 loops, best of 3: 3.38 ms per loop
# Array based soln-2
In [109]: %timeit array_permuted_indexing(a, n=4)
10000 loops, best of 3: 125 µs per loop

Setup
Consider the dataframe df
df = pd.DataFrame(np.random.randint(10, size=(16, 4)), columns=list('WXYZ'))
df
W X Y Z
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
Option 1
Inspired by #B.M. and #Divakar
I'm using np.random.permutation because it returns a copy that is a permuted version of what was passed. This means I can then pass that directly to iloc and return what I need.
df.iloc[np.random.permutation(np.arange(16).reshape(-1, 4)).ravel()]
W X Y Z
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
Option 2
I'd add a level to the index that we can call on when shuffling
d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
d
W X Y Z
0 0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
1 4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
2 8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
3 12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
Then we can shuffle
pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
W X Y Z
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5

Below code in python does the magic
from random import shuffle
import numpy as np
from math import ceil
#creating sample dataset
d=[[i*4 +j for i in range(5)] for j in range(25)]
a = np.array(d, int)
print '--------------Input--------------'
print a
gl=4 #group length i.e number of rows needs to be intact
parts=ceil(1.0*len(a)/gl) #no of partitions based on grouplength for the given dataset
#creating partition list and shuffling it to use later
x = [i for i in range(int(parts))]
shuffle(x)
#Creates new dataset based on shuffled partition list
fg=x.pop(0)
f = a[gl*fg:gl*(fg+1)]
for i in x:
t=a[gl*i:(i+1)*gl]
f=np.concatenate((f, t), axis=0)
print '--------------Output--------------'
print f

Related

pandas dataframe slicing to a subset from row #y1 to row #y2

I can't see the forest for the trees right now:
I have a Pandas dataframe:
import pandas as pd
df = pd.DataFrame({'UTCs': [32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783],
'Temperature': [5, 7, 7, 9, 12, 9, 9, 4],
'Humidity': [50, 50, 48, 47, 46, 47, 48, 52],
'pressure': [998, 998, 999, 999, 999, 999, 1000, 1000]})
print(df)
UTCs Temperature Humidity pressure
0 32776 5 50 998
1 32777 7 50 998
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
7 32783 4 52 1000
Now I want to create a subset of all dataset columns for UTCs between 32778 and 32782
I can chose a subset with:
df_sub=df.iloc[2:7,:]
print(df_sub)
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000
But how can I do that with the condition like 'chose rows between UTCs=32778 and UTCs=32782'?
Something like
df_sub = df.iloc[df[df.UTCs == 32778] : df[df.UTCs == 32783], : ]
does not work.
Any hint for me?
Use between for boolean indexing:
df_sub = df[df['UTCs'].between(32778, 32783, inclusive='left')]
output:
UTCs Temperature Humidity pressure
2 32778 7 48 999
3 32779 9 47 999
4 32780 12 46 999
5 32781 9 47 999
6 32782 9 48 1000

Reconcile with np.fromiter and multidimensional arrays in Python

I am working on coming up with a multi-dimensional array in order to come up with the following result in jupyter notebook.
I have tried several codes but I seem not to be able to produce the forth column with the number range of 30 - 35. The closest I have gone is using this code:
import numpy as np
from itertools import chain
def fun(i):
return tuple(4*i + j for j in range(4))
a = np.fromiter(chain.from_iterable(fun(i) for i in range(6)), 'i', 6 * 4)
a.shape = 6, 4
print(repr(a))
I am expecting the following results:
array([[ 1, 2, 3, 30],
[ 4, 5, 6, 31],
[ 7, 8, 9, 32],
[10, 11, 12, 33],
[13, 14, 15, 34],
[20, 21, 22, 35]])
You can create a flat array with all your subsequent numbers like this:
import numpy as np
a = np.arange(1, 16)
print(a)
# output:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
Then you reshape it:
a = np.reshape(a, (5, 3))
print(a)
# output
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]
[13 14 15]]
Then you add a new row:
a = np.vstack([a, np.arange(20, 23)])
print(a)
# output:
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]
[13 14 15]
[20 21 22]]
You create the column to add:
col = np.arange(30, 36).reshape(-1, 1)
print(col)
# output:
[[30]
[31]
[32]
[33]
[34]
[35]]
You add it:
a = np.concatenate((a, col), axis=1)
print(a)
# output:
[[ 1 2 3 30]
[ 4 5 6 31]
[ 7 8 9 32]
[10 11 12 33]
[13 14 15 34]
[20 21 22 35]]

Creating a TXT file and seeking a position in Python

I have given the following variables:
signal1 = 'speed'
bins1 = [0, 10, 20, 30, 40]
signal2 = 'rpm'
bins2 = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
hist_result = [ [1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
]
I want to create a .TXT file which would look like this with tab separated values:
speed>= 0 10 20 30
speed< 10 20 30 40
rpm>= rpm<
0 500 1 4 5 12
500 1000 5 8 9 0
1000 1500 6 7 11 19
1500 2000 1 4 5 12
2000 2500 -5 8 9 0
2500 3000 -6 7 11 19
3000 3500 1 4 5 12
3500 4000 -5 8 9 0
4000 4500 -6 7 11 19
I have written the following code:
#!/usr/bin/env python3
import os
from datetime import datetime
import time
signal1 = 'speed'
bins1 = [0, 10, 20, 30, 40]
signal2 = 'rpm'
bins2 = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
hist_result = [ [1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
]
filename = f"{datetime.now().strftime('%Y%m%d_%H%M%S')}_{signal1}_results.TXT"
with open(filename, 'w') as f:
# write the bin1 range
f.write('\n\n\n')
f.write('\t\t\t\t')
f.write(signal1 + '>=')
for bin in bins1[:-1]:
f.write('\t' + str(bin))
f.write('\n')
f.write('\t\t\t\t')
f.write(signal1 + '<')
for bin in bins1[1:]:
f.write('\t' + str(bin))
f.write('\n')
# write the bin2 range
f.write('\t\t')
f.write(signal2 + '>=' + '\t' + signal2 + '<' + '\n')
f.write('\t\t')
# store the cursor position from where hist result will be written line by line
track_cursor_pos = []
curr = bins2[0]
for next in bins2[1:]:
f.write(str(curr) + '\t' + str(next))
track_cursor_pos.append(f.tell())
f.write('\n\t\t')
curr = next
f.write('\n')
print(track_cursor_pos)
i = 0
# Everything is fine until here
# Code below doesn't work as expected!?
for result in hist_result:
f.seek(track_cursor_pos[i], os.SEEK_SET)
for r in result:
f.write('\t' + str(r))
f.write('\n')
i += 1
But, this is producing the TXT file whose contents look like this:
speed>= 0 10 20 30
speed< 10 20 30 40
rpm>= rpm<
0 500 1 4 5 12
0 -5 8 9 0
00 -6 7 11 19
1 4 5 12
00 -5 8 9 0
00 -6 7 11 19
1 4 5 12
00 -5 8 9 0
00 -6 7 11 19
I think I am not using the f.seek() properly. Any suggestion would be appreciated. Thanks in advance.
You don't have to seek inside the file to print your data:
signal1 = 'speed'
bins1 = [0, 10, 20, 30, 40]
signal2 = 'rpm'
bins2 = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
hist_result = [ [1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
]
with open('data.txt', 'w') as f_out:
print('\t{signal1}>=\t{bins}'.format(signal1=signal1, bins='\t'.join(map(str,bins1[:-1]))), file=f_out)
print('\t{signal1}<\t{bins}'.format(signal1=signal1, bins='\t'.join(map(str,bins1[1:]))), file=f_out)
print('{signal2}>=\t{signal2}<'.format(signal2=signal2))
for a, b, data in zip(bins2[:-1], bins2[1:], hist_result):
print(a, b, *data, sep='\t', file=f_out)
Creates data.txt:
speed>= 0 10 20 30
speed< 10 20 30 40
rpm>= rpm<
0 500 1 4 5 12
500 1000 -5 8 9 0
1000 1500 -6 7 11 19
1500 2000 1 4 5 12
2000 2500 -5 8 9 0
2500 3000 -6 7 11 19
3000 3500 1 4 5 12
3500 4000 -5 8 9 0
4000 4500 -6 7 11 19

How exactly 'abs()' and 'argsort()' works together

#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
output:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
df.loc[(df.CCC-aValue).abs().argsort()]
output:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
works
df.loc[(df.CCC-aValue).abs().argsort()]
With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
or
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.
(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Resources