Calculate cumulative sum of pyspark array column - apache-spark

I have a spark dataframe with an array column that looks like this:
+--------------+
| x |
+--------------+
| [1, 1, 0, 1] |
| [0, 0, 0, 0] |
| [0, 0, 1, 1] |
| [0, 0, 0, 1] |
| [1, 0, 1] |
+--------------+
I want to add a new column with another array that contains the cumulative sum of x at each index. The result should look like this:
+--------------+---------------+
| x | x_running_sum |
+--------------+---------------+
| [1, 1, 0, 1] | [1, 2, 2, 3] |
| [0, 0, 0, 0] | [0, 0, 0, 0] |
| [0, 0, 1, 1] | [0, 0, 1, 2] |
| [0, 0, 0, 1] | [0, 0, 0, 1] |
| [1, 0, 1] | [1, 1, 2] |
+--------------+---------------+
How can I create the x_running_sum column? I've tried using some of the higher order functions like transform, aggregate, and zip_with, but I haven't found a solution yet.

To perform a cumulative sum I sliced the array by index position and reduce the values from it:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(x=[1, 1, 0, 1]),
Row(x=[0, 0, 0, 0]),
Row(x=[0, 0, 1, 1]),
Row(x=[0, 0, 0, 1]),
Row(x=[1, 0, 1])
])
(df
.selectExpr('x', "TRANSFORM(sequence(1, size(x)), index -> REDUCE(slice(x, 1, index), CAST(0 as BIGINT), (acc, el) -> acc + el)) AS x_running_sum")
.show(truncate=False))
Output
+------------+-------------+
|x |x_running_sum|
+------------+-------------+
|[1, 1, 0, 1]|[1, 2, 2, 3] |
|[0, 0, 0, 0]|[0, 0, 0, 0] |
|[0, 0, 1, 1]|[0, 0, 1, 2] |
|[0, 0, 0, 1]|[0, 0, 0, 1] |
|[1, 0, 1] |[1, 1, 2] |
+------------+-------------+

Related

How to select rows from two different Numpy arrays conditionally?

I have two Numpy 2D arrays and I want to get a single 2D array by selecting rows from the original two arrays. The selection is done conditionally. Here is the simple Python way,
import numpy as np
a = np.array([4, 0, 1, 2, 4])
b = np.array([0, 4, 3, 2, 0])
y = np.array([[0, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 1, 1],
[0, 0, 1, 0]])
x = np.array([[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 1, 0, 0],
[1, 1, 1, 1],
[0, 0, 1, 0]])
z = np.empty(shape=x.shape, dtype=x.dtype)
for i in range(x.shape[0]):
z[i] = y[i] if a[i] >= b[i] else x[i]
print(z)
Looking at numpy.select, I tried, np.select([a >= b, a < b], [y, x], -1) but got ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (5,) and arg 1 with shape (5, 4).
Could someone help me write this in a more efficient Numpy manner?
This should do the trick, but it would be helpful if you could show an example of your expected output:
>>> np.where((a >= b)[:, None], y, x)
array([[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 1, 0, 0],
[0, 0, 1, 1],
[0, 0, 1, 0]])

Hermitian Adjacency Matrix of Digraph

I am trying to find a pythonic way to calculate the Hermitian adjacency matrix in Python and I'm really struggling. The definition of a Hermitian Adjacency matrix is shown in this image:
It works as follows. Lets say we have two nodes named i and j. If there is an directed edge going from both i to j and j to i, then the corresponding matrix value at location [ i, j ] should be set to 1. If there is only a directed edge from i to j, then the matrix element at location [i, j] should be set to +i. And if there is only a directed edge from j to i then the matrix element at location [i, j] should be set to -i. All other matrix values are set to 0.
I cannot figure out a smart way to make this Hermitian Adjacency Matrix that doesn't involve iterating through my nodes one by one. Any advice?
I don't think there's a built-in for this, so I've cobbled together my own vectorised solution:
import numpy as np
import networkx as nx
# Create standard adjacency matrix
A = nx.linalg.graphmatrix.adjacency_matrix(G).toarray()
# Add to its transpose and convert from sparse array
B = A + A.T
# Get row index matrix
I = np.indices(B.shape)[0] + 1
# Apply vectorised formula to get Hermitian adjacency matrix
H = np.multiply(B/2 * (2*I)**(B%2), 2*A-1).astype(int)
Explanation
Let's start with a directed graph:
We start by creating the normal adjacency matrix using nx.linalg.graphmatrix.adjacency_matrix(), giving us the following matrix:
>>> A = nx.linalg.graphmatrix.adjacency_matrix(G).toarray()
[[1, 1, 0, 1, 0, 1, 0, 0],
[1, 0, 0, 1, 0, 0, 1, 0],
[1, 1, 1, 1, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0, 1, 1],
[0, 1, 0, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0]]
We can then add this matrix to its transpose, giving us 2 in every location where there is a directed edge going from i to j and vice-versa, a 1 in every location where only one of these edges exists, and a 0 in every location where no edge exists:
>>> B = A + A.T
>>> B
[[2, 2, 1, 1, 1, 2, 0, 0],
[2, 0, 1, 2, 0, 1, 2, 0],
[1, 1, 2, 1, 0, 1, 0, 0],
[1, 2, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1, 1, 1],
[2, 1, 1, 0, 1, 0, 1, 1],
[0, 2, 0, 0, 1, 1, 0, 1],
[0, 0, 0, 0, 1, 1, 1, 0]]
Now, we want to apply a function to the matrix so that 0 maps to 0, 2 maps to 1, and 1 maps to the row number i. We can use np.indices() to get the row number, and the following equation: x/2 * (2*i)**(x%2), where i is the row number and x is the element. Finally, we need to multiply elements in positions where no edge ij exists by -1. This can be vectorised as follows:
>>> I = np.indices(B.shape)[0] + 1
>>> H = np.multiply(B/2 * (2*I)**(B%2), 2*A-1).astype(int)
>>> H
[[ 1, 1, -1, 1, -1, 1, 0, 0],
[ 1, 0, -2, 1, 0, -2, 1, 0],
[ 3, 3, 1, 3, 0, 3, 0, 0],
[-4, 1, -4, 0, -4, 0, 0, 0],
[ 5, 0, 0, 5, 0, -5, -5, -5],
[ 1, 6, -6, 0, 6, 0, 6, 6],
[ 0, 1, 0, 0, 7, -7, 0, 7],
[ 0, 0, 0, 0, 8, -8, -8, 0]]
As required.
We can check that this is correct by using a naïve iterate-through-nodes approach:
>>> check = np.zeros([8,8])
>>> for i in G.nodes:
for j in G.nodes:
if (i, j) in G.edges:
if (j, i) in G.edges:
check[i-1, j-1] = 1
else:
check[i-1, j-1] = i
else:
if (j, i) in G.edges:
check[i-1, j-1] = -i
else:
check[i-1, j-1] = 0
>>> (check == H).all()
True

How to change only the diagonal elements of a 2D list?

So I am trying to create an NxN 2D array and then change its diagonal elemets to 1. Here is my code:
arr=[1,1,1,2,2,2]
table=[[0]*len(arr)]*len(arr)
for i in range(0,len(arr)):
table[i][i]=1
print(table)
However, whenever I run this code, I get this output:
[[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]]
I am looking to get this:
[[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1]]
I have been staring at my code for hours and I cannot figure out what's wrong
The interesting thing about this is that you are really only editing one list in the for loop, but there are just five pointers to that list. (In this case, the list would be [0, 0, 0, 0, 0, 0].) You can see this by printing the id of each list in table by using id():
>>> for t in table:
print(id(t))
2236544254464
2236544254464
2236544254464
2236544254464
2236544254464
2236544254464
Your numbers are likely different than mine, but they are all the same number, nevertheless. You also can see that the edits to one list are applied to the others in table by putting a print(table) statement after each index assignment statement.
So in order to 'fix' this, I would recommend using list comprehension instead. For example:
table = [[0]*len(arr) for _ in range(len(arr))]
If you checkout the ids of each list:
>>> for t in table:
print(id(t))
2236544617664
2236544616064
2236544616320
2236544615872
2236544618368
2236544622720
Since they are different, you can now use the method for changing only the diagonals:
>>> for i in range(0,len(arr)):
table[i][i]=1
>>> table
[[1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1]]
Your 2D "array" contains 6 lists which are the same list. Changes to any of those lists will also be reflected in the other lists. Consider this:
>>> l = [0] * 6
>>> x = [l]
>>> l[0] = 1
>>> l
[1, 0, 0, 0, 0, 0]
>>> x
[[1, 0, 0, 0, 0, 0]]
>>> x = [l, l, l]
>>> x
[[1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]
>>> x[-1][-1] = 100
>>> x
[[1, 0, 0, 0, 0, 100], [1, 0, 0, 0, 0, 100], [1, 0, 0, 0, 0, 100]]
This is because the list x contains the list l, so any changes to l are also seen through the reference to the same list in x.
The problem is when multiplying mutable objects because it creates multiple references to the same mutable object.
You should initialise your table like this:
table = [[0 for j in range(len(arr))] for i in range(len(arr))]
or
table = [[0] * len(arr) for i in range(len(arr))]
which, despite the use of multiplication, works because each list is different.
You can create your table and populate it simultaneously in nested loops:
arr=[1,1,1,2,2,2]
table = []
for i in range(len(arr)):
table.append([0]*len(arr))
for j in range(len(arr)):
if i == j:
table[i][j] = 1
print(table)
#[[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1]]
Interesting.
Try to use numpy to avoid list trap:
import numpy as np
org_row = [0]*5
l = [org_row]*5
x = np.array(l, np.int32)
for i in range(len(x)):
x[i][i]=1
print(x)
output>:
output>
[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 1 0 0]
[0 0 0 1 0]
[0 0 0 0 1]]

How to convert 3d array as string

I need to convert a 3d array I have in Python3 into a string with a specific format. My current 3d array is below:
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]
I need this as a string, but also want to replace any instance of 0 and make it into the string '----'. If a value is not 0, then leave it.
I tried using join: ''.join(str(e) for e in myArray)
but the format did not come out as I wanted.
I expected my results to look like this:
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
1 ---- ---- ---- ---- ---- ----
But my format turned out like this using the join method:
[[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]][[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]][[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]][[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]]
You need to loop over all of the layers of lists in your nested list. This will require nested list comprehension. Try this:
'\n\n'.join('\n'.join(' '.join(str(x or '----')for x in y)for y in z)for z in myArray)
Note the x or '----' bit. That will evaluate to the first Truthy value. Since zeroes are Falsey, you'll get the dashes if x is zero, or the actual value if it's not.
First, I have more formally defined your 3D array:
mylist = [[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]]
This way I have a list of lists of lists, instead of 4 lists of lists that happen to be written in a row, with this format, we can just loop through every element and continue adding to some string as we go:
abcd = ""
for i in mylist:
for j in i:
for k in j:
if(k): # might as well say if k is not 0
abcd = abcd +str(k)
else:
abcd = abcd +'---'
abcd = abcd + "\n"
abcd = abcd + "\n"
print(abcd)
This will give the desired output, but #mypetlion's method is nicer and more pythonic

How to Group list into sublist in a backward manner

What is the simplest and reasonably efficient way to slice a list into a list of the sliced sub-list sections in a reverse manner?
Here is the portion of my code that groups list into sublist:
binary1 = [1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1]
process1 = [binary1[i:i+4] for i in range(0, len(binary1), 4)]
print(process1)
Result: [[1, 0, 0, 1], [1, 0, 1, 0], [1, 0, 1, 1], [0, 1]]
However the result above is really not what I want is it will group in a reversal way, here is the result that I expected/want:
Result: [[1, 0], [0, 1, 1, 0], [1 0, 1, 0], [1, 1, 0, 1]]
I hope you could help me. Thank you!
binary1 = [1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1]
rest = len(binary1) // 4
print([binary1[:rest-1]] + [binary1[i:i+4] for i in range(rest-1, len(binary1), 4)])
Will print:
[[1, 0], [0, 1, 1, 0], [1, 0, 1, 0], [1, 1, 0, 1]]

Resources