How to convert the following processing using numpy - python-3.x

I am trying to improve a part of code that is slowing down the whole script significantly, right to the point of making it unfeasible. In particular the piece of code is:
for vectors1 in EC1:
for vectors2 in EC2:
r = np.add(vectors1, vectors2)
for vectors3 in CDC:
result = np.add(r, vectors3).tolist()
if result not in states: # This is what makes it very slow
states.append(result)
EC1, EC2 and CDC are lists that contains as elements, lists of lists, as an example of one iteration, we get:
vectors1: [[2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0]]
vectors2: [[0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
vectors3: [[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [2, 1, 0], [2, 1, 0]]
result: [[2, 0, 0], [2, 0, 0], [2, 1, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], [2, 0, 0], [4, 1, 0], [2, 1, 0]]
Notice how vectors1, vectors2 and vectors3 correspond to one element from EC1, EC2 and CDC respectively, also how 'result' is the summation from vectors1, vectors2 and vectors3, hence the previous vectors cannot be altered in any manner or sorted, otherwise it would change the expected result from the 'result' variable.
In the first two loops each item in EC1 and EC2 are summed, for later on sum up the previous result with items in CDC. To sum the list of lists from EC1 and EC2 and later on the previous result ('r') with the list of lists from CDC I use numpy.add(). Finally, I reconvert 'result' back to list. So Basically I am managing lists of lists as elements from EC1, EC2 and CDC.
The problem is that I must deal with hundreds of thousands (close to 1M) of results and having to check if a result exists in states list is slowing things drastically, specially since states list grows as more results are processed.
I've tried to keep inside the numpy world by managing everything as numpy arrays. First declaring states as:
states = np.empty([9, 3], int)
Then, concatenating the result numpy array to states numpy array, prior checking if already exists in states:
for vectors1 in EC1:
for vectors2 in EC2:
r = np.add(vectors1, vectors2)
for vectors3 in CDC:
result = np.add(r, vectors3)
if not np.isin(states, result).any():
np.concatenate(states, result, axis=0)
But definitely I am doing something wrong because result is not being concatenated to states, I've also tried without success:
np.append(states, result, axis=0)
Could this be parallelized in some way?

You can do the sums solely in numpy by using broadcasting
res = ((EC1[:,None,:] + EC2).reshape(-1, 1, 3) + CDC).reshape(-1, 3)
given that EC1, EC2 and CDC are arrays.
Afterwards you can filter out the duplicates with
np.unique(res, axis=0)
But like Lucas, I would strongly advise you to filter the arrays beforehand. For your example arrays that would shrink the number of rows in res from 729 to 8.

I'm not sure how large the data are that you are working with but this may speed things up somewhat:
EC1 = [[2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0]]
EC2 = [[0, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [2, 0, 0], [2, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]]
CDC = [[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [2, 1, 0], [2, 1, 0]]
EC1.sort()
EC2.sort()
CDC.sort()
unique_triples = dict()
for v1 in EC1:
for v2 in EC2:
for v3 in CDC:
if str(v1)+str(v2)+str(v3) not in unique_triples: # list not hashable but strings are
unique_triples[str(v1)+str(v2)+str(v3)] = list(np.add(np.add(v1, v2), v3))
The basic idea is to remove duplicate triples of (EC1,EC2, CDC) entries and only do the additions on unique triples, sort the lists so that they are ordered lexicographically
A dictionary has O(1) lookups so these lookups are (maybe) faster.
Whether this is faster or not might depend on how large-and how many unique values of triples-the data are that are being processed.
The 3-vector sums are the values of the dictionary, e.g.
list(unique_triples.values()) for me gives:
>>> list(unique_triples.values())
[[0, 0, 0], [2, 1, 0], [2, 0, 0], [4, 1, 0], [2, 0, 0], [4, 1, 0], [4, 0, 0], [6, 1, 0]]
I did not remove the duplicates in the original lists of lists here. If the application you are looking at allows, it is also likely beneficial to remove these duplicates in EC1, EC2, and CDC before iterating over the values.

Related

Upsampling xarray DataArray similar to np.repeat()?

I'm hoping to upsample values in a large 2-dimensional DataArray (below). Is there an xarray tool similar to np.repeat() which can be applied in each dimension (x and y)? In the example below, I would like to duplicate each array entry in both x and y.
import xarray as xr
import numpy as np
x = np.arange(3)
y = np.arange(3)
x_mesh,y_mesh = np.meshgrid(x, y)
arr = x_mesh*y_mesh
df = xr.DataArray(arr, coords={'x':x, 'y':y}, dims=['x','y'])
Desired input:
array([[0, 0, 0],
[0, 1, 2],
[0, 2, 4]])
Desired output:
array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 2, 2],
[0, 0, 1, 1, 2, 2],
[0, 0, 2, 2, 4, 4],
[0, 0, 2, 2, 4, 4]])
I am aware of the xesmf regridding tools, but they seem more complicated than necessary for the application I have in mind.
There is a simple solution for this with np.kron.
>>> arr
array([[0, 0, 0],
[0, 1, 2],
[0, 2, 4]])
>>> np.int_(np.kron(arr, np.ones((2,2))))
array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 2, 2],
[0, 0, 1, 1, 2, 2],
[0, 0, 2, 2, 4, 4],
[0, 0, 2, 2, 4, 4]])

How to get padding mask from input ids?

Considering a batch of 4 pre-processed sentences (tokenization, numericalizing and padding) shown below:
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
where 0 states for [PAD] token.
Thus, what would be an efficient approach to generate a padding masking tensor of the same shape as the batch assigning zero at [PAD] positions and assigning one to other input data (sentence tokens)?
In the example above it would be something like:
padding_masking=
tensor([
[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]
])
The following is tested on pytorch 1.3.1.
pad_token_id = 0
batch = torch.tensor([
[1, 2, 0, 0],
[4, 0, 0, 0],
[3, 5, 6, 7]
])
pad_mask = ~(batch == pad_token_id)
print(pad_mask)
Output
tensor([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 1, 1, 1]], dtype=torch.uint8)
You can get your desired result with
padding_masking = batch > 0
If you want ints instead of booleans, use
padding_masking.type(torch.int)

Split list into sublists by delimiter

I have a list of lists:
[[0, 0], [0, 0], [0, 0], [0, 1, 0], [0, 0]]
I want to split it into what comes before the list [0,1,0] and what comes after like so:
[[0, 0], [0, 0], [0, 0]], [[0, 0]]
If I had a list:
[[0, 0], [0, 0], [0, 0], [0, 1, 0], [0, 0], [0, 1, 0], [0, 0]]
I would want to split it into a list like this:
[[0, 0], [0, 0], [0, 0]], [[0, 0]], [[0, 0]]
I am really stuck with this while loop, which does not seem to reset the temporary list at the right place:
def count_normal_jumps(jumps):
_temp1 = []
normal_jumps = []
jump_index = 0
while jump_index <= len(jumps) - 1:
if jumps[jump_index] == [0,0]:
_temp1.append(jumps[jump_index])
else:
normal_jumps.append(_temp1)
_temp1[:] = []
jump_index += 1
return normal_jumps
Why does this not work and is there a better approach?
You can use a for loop to append the sublists in the list to the last sublist in a list of lists, and append a new sublist to the list of lists when the input sublist is equal to [0, 1, 0]:
def split(lst):
output = [[]]
for l in lst:
if l == [0, 1, 0]:
output.append([])
else:
output[-1].append(l)
return output
or you can use itertools.groupby:
from itertools import groupby
def split(lst):
return [list(g) for k, g in groupby(lst, key=[0, 1, 0].__ne__) if k]
so that:
print(split([[0, 0], [0, 0], [0, 0], [0, 1, 0], [0, 0]]))
print(split([[0, 0], [0, 0], [0, 0], [0, 1, 0], [0, 0], [0, 1, 0], [0, 0]]))
outputs:
[[[0, 0], [0, 0], [0, 0]], [[0, 0]]]
[[[0, 0], [0, 0], [0, 0]], [[0, 0]], [[0, 0]]]
You can do something like this:
myList = [[0, 0], [0, 0], [0, 0], [0, 1, 0], [0, 0]]
toMatch = [0, 1, 0]
allMatches = []
currentMatches = []
for lst in myList:
if lst == toMatch:
allMatches.append(currentMatches)
currentMatches = []
else:
currentMatches.append(lst)
#push leftovers when end is reached
if currentMatches:
allMatches.append(currentMatches)
print(allMatches)

Batch multiplication/division with scalar in tensorflow

I'm struggling to find a simple way to multiply a batch of tensors with a batch of scalars.
I have a tensor with dimensions N, 4, 4. What I want is to divide tensor in the batch with the value at position 3, 3.
For example, let's say I have:
A = [[[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 0],
[0, 0, 0, a]],
[[1, 1, 1, 0],
[1, 1, 1, 0],
[1, 1, 1, 0],
[0, 0, 0, b]]
What I want is to obtain the following:
B = [[[1/a, 1/a, 1/a, 0],
[1/a, 1/a, 1/a, 0],
[1/a, 1/a, 1/a, 0],
[0, 0, 0, 1]],
[[1/b, 1/b, 1/b, 0],
[1/b, 1/b, 1/b, 0],
[1/b, 1/b, 1/b, 0],
[0, 0, 0, 1]]
You should just do:
B = A / A[:, 3:, 3:]

How to rotate the results in a board made of lists?

I'm trying to learn how to code with Python and I have tried this exercise in which I have to rotate this board by 90° degrees but I dont get how. Thanks for the help.
numlist = [1,3,0,2]
board = [[0, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 2],
[0, 3, 0, 0]]
I use this to print a table when it is given a numlist:
def ctcb(numlist): # Create The Chess Board
n = 0
board = []
the_len = len(numlist)
for i in range(the_len): # create a list with nested lists
board.append([])
for n in range(the_len):
board[i].append(0) # fills nested lists with data
while n < len(board):
for x,y in enumerate(numlist):
board[y][x] = y
n += 1
# print(board)
for e in board:
print(e)
the result should be this one:
board = [[0, 0, 2, 0],
[0, 0, 0, 0],
[0, 0, 0, 3],
[0, 1, 0, 0]]
We can use zip(*board) to transpose the board, and then use reversed to get the reverse of that transpose.
board = [[0, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 2],
[0, 3, 0, 0]]
print([list(x) for x in reversed(list(zip(*board)))])
# [[0, 0, 2, 0], [0, 0, 0, 0], [0, 0, 0, 3], [0, 1, 0, 0]]

Resources