Convert large arrays in pandas dataframe to large numpy array

Convert large arrays in pandas dataframe to large numpy array - python-3.x

I have a dateframe that has a column that looks like this:
0 [ [ 1051, 0, 10181, 62, 17, ...
1 [ [ 882, 0, 9909, 59, 23, 9...
2 [ [ 1061, 0, 10192, 60, 17, ...
3 [ [ 122, 4, 501, 2, 8, 3, ...
4 [ [ 397, 1, 859, 9, 8, 5, ...
5 [ [ 1213, 1, 10791, 23, 17, ...
6 [ [ 1395, 3, 11147, 0, 17, ...
7 [ [ 757, 3, 1900, 34, 23, 8...
8 [ [ 129, 0, 507, 10, 8, 3, ...
9 [ [ 1438, 0, 11177, 26, 2, ...
10 [ [ 1272, 1, 10901, 7, 17, ...
An example row with fewer features would be something like this:
[[1,2,3,4],[2,3,4,5],[3,4,5,6]]
The datatype is a string so json.loads has to be used to convert them to arrays that are [N_TIMESTAMPS, N_FEATURES] where each feature is a numerical value.
In order to use this data as input for a neural network I have to convert this column into a numpy array of shape: [N_SAMPLES, N_TIMESTAMPS, N_FEATURES]. So, like this:
[[[1,2,3,4],[2,3,4,5],[3,4,5,6]],[[1,2,3,4],[2,3,4,5],[3,4,5,6]]]
This is how I am doing it now:
train_x = np.array(
df.time_stream.apply(json.loads).apply(np.array).apply(
lambda x: x.reshape(N_TIMESTAMPS,N_FEATURES).tolist()).values.tolist()
)
For a dataset that has 268,521 rows, this computation takes 12.5 mins. Not ideal, but it was working; however, it's not scalable. For the new dataset with 756,961 rows it never finishes (N_TIMESTAMPS = 100; N_FEATURES = 54) because it uses up all of the RAM and the computer crashes.
I'm looking for recommendations for how to make this faster, and perhaps more memory efficient. One issue is that a lot of swap is being used.

Related

Get the length of every sentence before padding in torchtext bucketiterator

Is it possible to get the length of every sentence before padding in torchtext bucketiterator :
train_loader = torchtext.legacy.data.BucketIterator(train_data, batch_size = 64, repeat=True, shuffle=True, sort_key = lambda x: len(x.text), sort=False, sort_within_batch=True, device = device)
bucketiterator dataloader :
inputs: tensor([[ 34, 87, 2, ..., 227, 239, 263],
[ 138, 7, 1006, ..., 840, 142, 665],
[ 549, 4, 1028, ..., 11, 14, 4],
...,
[ 1, 1, 5, ..., 66, 23, 13],
[ 1, 1, 1062, ..., 177, 252, 1587],
[ 1, 1, 66, ..., 553, 52, 73]]), shape: torch.Size([64, 91])
Like when using pytorch dataloader:
train_loader = data.DataLoader(train_data, batch_size = 64, shuffle=True, collate_fn=padding)
def padding(batch):
doc = [doc['input'] for doc in batch]
len_doc = [len(doc['input']) for doc in batch]
doc_pad = pad_sequence(doc, batch_first=True, padding_value=0)
return doc_pad, len_doc
pytorch dataloader :
inputs: tensor([[ 2, 1396, 2686, ..., 0, 0, 0],
[ 2, 1391, 1396, ..., 0, 0, 0],
[ 2, 2018, 2597, ..., 0, 0, 0],
...,
[ 2, 1546, 1623, ..., 0, 0, 0],
[ 2, 1435, 1396, ..., 0, 0, 0],
[ 2, 1391, 1396, ..., 0, 0, 0]]), shape: torch.Size([64, 40])
inputs_len_before_padding: tensor([18, 8, 21, 16, 16, 12, 40, 12, 9, 12, 17, 12, 17, 15, 16, 12, 8, 24,
25, 10, 22, 8, 8, 13, 12, 22, 17, 14, 21, 14, 19, 13, 21, 8, 28, 16,
31, 24, 23, 19, 10, 7, 16, 12, 16, 12, 17, 12, 18, 11, 8, 13, 17, 14,
11, 13, 13, 20, 8, 12, 22, 7, 9, 11]), shape: torch.Size([64])

Here is a minimal example that uses torchtext.data.Field and torchtext.data.BucketIterator:
import torchtext.data as data
# sample data
text = [
'This is sentence 1.',
'This sentence is a bit longer than the previous sentence.'
]
# define field -- notice include_lengths is set to True
text_field = data.Field(include_lengths=True, tokenize=lambda x: x.split())
fields = [('text', text_field)]
# create dataset and build vocabulary
examples = [data.Example.fromlist([t], fields) for t in text]
dataset = data.Dataset(examples, fields)
text_field.build_vocab(dataset)
# create iterator
data_iter = data.BucketIterator(dataset, batch_size=2, shuffle=False)
# the text field will now return both the data tensor and the length of the input text
for x in data_iter:
print('Data:', x.text[0])
print('Lengths:', x.text[1])
This should print (data tensor shortened for brevity):
Data: tensor([[ 2, 2],
...
[ 1, 10]])
Lengths: tensor([ 4, 10])

Pytorch how to reshape/reduce the number of filters without altering the shape of the individual filters

With a 3D tensor of shape (number of filters, height, width), how can one reduce the number of filters with a reshape which keeps the original filters together as whole blocks?
Assume the new size has dimensions chosen such that a whole number of the original filters can fit side by side in one of the new filters. So an original size of (4, 2, 2) can be reshaped to (2, 2, 4).
A visual explanation of the side by side reshape where you see the standard reshape will alter the individual filter shapes:
I have tried various pytorch functions such as gather and select_index but not found a way to get to the end result in a general manner (i.e. works for different numbers of filters and different filter sizes).
I think it would be easier to rearrange the tensor values after performing the reshape but could not get a tensor of the pytorch reshaped form:
[[[1,2,3,4],
[5,6,7,8]],
[[9,10,11,12],
[13,14,15,16]]]
to:
[[[1,2,5,6],
[3,4,7,8]],
[[9,10,13,14],
[11,12,15,16]]]
for completeness, the original tensor before reshaping:
[[[1,2],
[3,4]],
[[5,6],
[7,8]],
[[9,10],
[11,12]],
[[13,14],
[15,16]]]

Another option is to construct a list of parts and concatenate them
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
y = torch.cat([x[i::2] for i in range(2)], dim=2)
print('Before\n', x)
print('After\n', y)
which gives
Before
tensor([[[0, 0],
[0, 0]],
[[1, 1],
[1, 1]],
[[2, 2],
[2, 2]],
[[3, 3],
[3, 3]]])
After
tensor([[[0, 0, 1, 1],
[0, 0, 1, 1]],
[[2, 2, 3, 3],
[2, 2, 3, 3]]])
Or a little more generally we could write a function that takes groups of neighbors along a source dimension and concatenates them along a destination dimension
def group_neighbors(x, group_size, src_dim, dst_dim):
assert x.shape[src_dim] % group_size == 0
return torch.cat([x[[slice(None)] * (src_dim) + [slice(i, None, group_size)] + [slice(None)] * (len(x.shape) - (src_dim + 2))] for i in range(group_size)], dim=dst_dim)
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
# read as "take neighbors in groups of 2 from dimension 0 and concatenate them in dimension 2"
y = group_neighbors(x, group_size=2, src_dim=0, dst_dim=2)
print('Before\n', x)
print('After\n', y)

You could do it by chunking tensor and then recombining.
def side_by_side_reshape(x):
n_pairs = x.shape[0] // 2
filter_size = x.shape[-1]
x = x.reshape((n_pairs, 2, filter_size, filter_size))
return torch.stack(list(map(lambda x: torch.hstack(x.unbind()), k)))
>> p = torch.arange(1, 91).reshape((10, 3, 3))
>> side_by_side_reshape(p)
tensor([[[ 1, 2, 3, 10, 11, 12],
[ 4, 5, 6, 13, 14, 15],
[ 7, 8, 9, 16, 17, 18]],
[[19, 20, 21, 28, 29, 30],
[22, 23, 24, 31, 32, 33],
[25, 26, 27, 34, 35, 36]],
[[37, 38, 39, 46, 47, 48],
[40, 41, 42, 49, 50, 51],
[43, 44, 45, 52, 53, 54]],
[[55, 56, 57, 64, 65, 66],
[58, 59, 60, 67, 68, 69],
[61, 62, 63, 70, 71, 72]],
[[73, 74, 75, 82, 83, 84],
[76, 77, 78, 85, 86, 87],
[79, 80, 81, 88, 89, 90]]])
but I know it's not ideal since there is map, list and unbind which disrupts memory. This is what I offer till I figure out how to do it via view only (so a real reshape)

Merging pickled .npz files in a desired format

I have multiple npz files which i want to merge into one npz.file with the format similar to "mnist.npz"
the format of mnist.npz is:
((array([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]],
[0, 0, 0, ..., 0, 0, 0]]], dtype=uint8),
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8))
Here two arrays are merged into one big npz file.
My two npz arrays are:
x_array:
[[[252, 251, 253],
[151, 150, 152],
[ 28, 25, 27],
...,
[ 30, 25, 27],
[ 30, 25, 27],
[ 32, 27, 29]],
[ 23, 18, 20]],
[[ 50, 92, 163],
[ 55, 90, 163],
[ 75, 105, 176],
...,
[148, 197, 242],
[109, 157, 208],
[109, 165, 222]],
[[ 87, 104, 155],
[ 82, 112, 168],
...,
[ 29, 52, 105],
[ 30, 55, 111],
[ 36, 55, 106]]]
y_array:
[1, 1, 1, 1, 1, 1]
When i tried to merge my files, the output i got is:
(array([[[252, 251, 253],
[151, 150, 152],
[ 28, 25, 27],
...,
[ 30, 25, 27],
[ 30, 25, 27],
[ 32, 27, 29]],
[ 23, 18, 20]]], dtype=uint8), array([[[ 50, 92, 163],
[ 55, 90, 163],
[ 75, 105, 176],
...,
[148, 197, 242],
[109, 157, 208],
[109, 165, 222]],
[ 87, 104, 155],
[ 82, 112, 168],
...,
[ 29, 52, 105],
[ 30, 55, 111],
[ 36, 55, 106]]], dtype=uint8),1, 1, 1, 1, 1, 1)
So in the last line, my array is formated as
1, 1, 1, 1, 1, 1
instead of something like:
array([1, 1, 1, 1, 1, 1], dtype=uint8)
My code for merging two npz files is:
data = load('x_array.npz',allow_pickle=True)
lst = data.files
for item in lst:
x_train = data[item]
#print((x_item,x_train))
data1 = load('y_array.npz',allow_pickle=True)
lst1 = data1.files
for item in lst1:
y_train = data1[item]
out1 = (*x_train,*y_train)
np.savez('out1.npz',out1)
print(out1)
Can anyone please suggest how i can convert my second array of (1, 1, 1, 1, 1, 1) to array([1, 1, 1, 1, 1, 1], dtype=uint8)? Any suggestions are helpful

After going through my code i found out that by changing the line
out1 = (*x_train,*y_train)
to
out1 = (*x_train,y_train)

Method of vectors in various vector length to fixed length (NLP)

Recently I have been looking around about Natural Language Processing and its vectorization method and advantages of each vectorizer.
I am into character to vectorize, but it seems like the most concerns about the character vectorizer for each word is the embedding to have fixed length.
I do not want to just embed them with 0, which is well known as 0 padding, for instance, the target fixed length is 100 and 72 characters only exists then all 28 of 0 will be padded at the end.
"The example of paragraphs and phrases.... ... in vectorizer form" < with length 72
becomes
[0, 25, 60, 12, 24, 0, 19, 99, 7, 32, 47, 11, 19, 43, 18, 19, 6, 25,
43, 99, 0, 32, 40, 14, 20, 5, 37, 47, 99, 11, 29, 7, 19, 47, 18, 20,
60, 18, 19, 2, 19, 11, 31, 130, 130, 76, 0, 32, 40, 14, 20, 7, 19, 47,
18, 20, 60, 11, 37, 43, 99, 11, 29, 99, 17, 39, 47, 11, 31, 18, 19,
43, 0, 19, 77, 0, 0, 0, 0, 0, 0, 0, 0, ...., 0, 0, 0, 0, 0, 0]
.
.
I want to make the vectors be in a fair distribution form in N fixed dimensions, not like the one above
If you know any papers or algorithms preferring consider this matter, or common way to produce a fixed length vectors from various length of vectors please share .
.
.
Further information added as gojomo requested;
I am trying to get the character level vectors for words in corpus.
Let say, in above example, "The example of paragraphs...." starts with
T [40]
h [17]
e [3]
e [3]
x [53]
a [1]
m [21]
p [25]
l [14]
e [3]
Notice that each character has its own number (etc, could be ascii) and word represents the vectors of character vectors combination, for example,
The [40, 17, 3]
example [3, 53, 1, 21, 25, 14, 3]
which the vectors are not in same dimension. With the case mention above, many people are padding 0 at the end to make it in uniform size
For example, if someone wants to make the dimension of each word to be 300, then 297 of 0s will be padded to letter "The" and 293 of 0s will be padded to "example"., like
The [40, 17, 3, 0, 0, 0, 0, 0, ...., 0]
example [3, 53, 1, 21, 25, 14, 3, 0, 0, 0, 0, 0, ...., 0]
Now I do not think this padding method is appropriate to my experiments so I want to know if there are any methods to convert its vectors to in uniform form with not sparsed form(if this term is allowed).
Even with the phrase with two words, "The example" only takes 11 characters long , still not long enough either.
Whatever the case is that, I would like to know if there are some well known techniques to convert the informal length of vectors to some fixed length.
Thank you !

Load data from file and normalize

How to normalize data loaded from file? Here what I have. Data looks kind of like this:
65535, 3670, 65535, 3885, -0.73, 1
65535, 3962, 65535, 3556, -0.72, 1
Last value in each line is a target. I want to have the same structure of the data but with normalized values.
import numpy as np
dataset = np.loadtxt('infrared_data.txt', delimiter=',')
# select first 5 columns as the data
X = dataset[:, 0:5]
# is that correct? Should I normalize along 0 axis?
normalized_X = preprocessing.normalize(X, axis=0)
y = dataset[:, 5]
Now the question is, how to pack correctly normalized_X and y back, that it has the structure:
dataset = [[normalized_X[0], y[0]],[normalized_X[1], y[1]],...]

It sounds like you're asking for np.column_stack. For example, let's set up some dummy data:
import numpy as np
x = np.arange(25).reshape(5, 5)
y = np.arange(5) + 1000
Which gives us:
X:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
Y:
array([1000, 1001, 1002, 1003, 1004])
And we want:
new = np.column_stack([x, y])
Which gives us:
New:
array([[ 0, 1, 2, 3, 4, 1000],
[ 5, 6, 7, 8, 9, 1001],
[ 10, 11, 12, 13, 14, 1002],
[ 15, 16, 17, 18, 19, 1003],
[ 20, 21, 22, 23, 24, 1004]])
If you'd prefer less typing, you can also use:
In [4]: np.c_[x, y]
Out[4]:
array([[ 0, 1, 2, 3, 4, 1000],
[ 5, 6, 7, 8, 9, 1001],
[ 10, 11, 12, 13, 14, 1002],
[ 15, 16, 17, 18, 19, 1003],
[ 20, 21, 22, 23, 24, 1004]])
However, I'd discourage using np.c_ for anything other than interactive use, simply due to readability concerns.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert large arrays in pandas dataframe to large numpy array - python-3.x

Related

Get the length of every sentence before padding in torchtext bucketiterator

Pytorch how to reshape/reduce the number of filters without altering the shape of the individual filters

Merging pickled .npz files in a desired format

Method of vectors in various vector length to fixed length (NLP)

Load data from file and normalize

Categories

Resources