Replace indicator values with actual values - python-3.x

I have a numpy array like this
array([[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
and an array with values
array([1, 2, 3, 4])
I would like to replace the ones in the first two-dimensional array with the corresponding values in the second array. Each row of the first array has exactly one 1, and there is only 1 replacement in the second array.
Result:
array([[0, 0, 1],
[2, 0, 0],
[0, 3, 0],
[0, 0, 4]])
I would like an elegant solution to achieve this, without loops and such.

Let's say a is the 2D data array and b the second 1D array.
An elegant solution would be -
a[a==1] = b
For performance, leveraging the fact that there's exactly one 1 per row, we could also use indexing -
a[np.arange(len(a)),a.argmax(1)] = b
Selectively assign per row
If we want to selectively mask and asign values per row, we could use one more level of masking. So, let's say we have the rows to be selected as -
select_rows = np.array([1,3])
Then, we could do -
rowmask = np.isin(np.arange(len(a)),select_rows)
So, for the replacement for the first approach would be -
a[(a==1) & rowmask[:,None]] = b[rowmask]
And for the second one -
a[np.arange(len(a))[rowmask],a.argmax(1)[rowmask]] = b[rowmask]

Related

Pyspark: How to count the number of each equal distance interval in RDD

I have a RDD[Double], I want to divide the RDD into k equal intervals, then count the number of each equal distance interval in RDD.
For example, the RDD is like [0,1,2,3,4,5,6,6,7,7,10]. I want to divided it into 10 equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10].
As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2,3),[3,4),[4,5),[5,6), and both [6,7) and [7,8) have two element. [9,10] has one element.
Finally I expected an array like array([1,1,1,1,1,1,2,2,0,1].
Try this. I have assumed that first element of the range is inclusive and last exclusive. Please confirm on this. For example when considering the range [0,1] and element is 0 the condition is element >= 0 and element < 1.
for index_upper, element_upper in enumerate(array_range):
counter = 0
for index, element in enumerate(rdd.collect()):
if element >= element_upper[0] and element < element_upper[1] :
counter +=1
countElementsWithinRange.append(counter)
print(rdd.collect())
# [0, 1, 2, 3, 4, 5, 6, 6, 7, 7, 10]
print(countElementsWithinRange)
# [1, 1, 1, 1, 1, 1, 2, 2, 0, 0]

Check if all list values in dataframe column are the same [duplicate]

If the type of a column in dataframe is int, float or string, we can get its unique values with columnName.unique().
But what if this column is a list, e.g. [1, 2, 3].
How could I get the unique of this column?
I think you can convert values to tuples and then unique works nice:
df = pd.DataFrame({'col':[[1,1,2],[2,1,3,3],[1,1,2],[1,1,2]]})
print (df)
col
0 [1, 1, 2]
1 [2, 1, 3, 3]
2 [1, 1, 2]
3 [1, 1, 2]
print (df['col'].apply(tuple).unique())
[(1, 1, 2) (2, 1, 3, 3)]
L = [list(x) for x in df['col'].apply(tuple).unique()]
print (L)
[[1, 1, 2], [2, 1, 3, 3]]
You cannot apply unique() on a non-hashable type such as list. You need to convert to a hashable type to do that.
A better solution using the latest version of pandas is to use duplicated() and you avoid iterating over the values to convert to list again.
df[~df.col.apply(tuple).duplicated()]
That would return as lists the unique values.

How to choose multiple columns from a sympy matrix? Broken indexing?

I'm trying to pick multiple columns from a sympy matrix. However, the indexing does not work as expected. The code
import sympy as sp
stdA = sp.Matrix(
[
[-2, 1, 1, 0],
[1, 1, 0, 1]
]
)
b = sp.Matrix(
[
[3],
[2]
]
)
B1 = stdA[:, [0, 1]]
B2 = stdA[:, [0, 2]]
B3 = stdA[:, [0, 3]]
B4 = stdA[:, [1, 2]]
B5 = stdA[:, [1, 3]]
B6 = stdA[:, [2, 3]]
print("std A =", stdA)
print("b =", b)
print("B1 =", B1)
print("B2 =", B2)
print("B3 =", B3)
print("B4 =", B4)
print("B5 =", B5)
print("B6 =", B6)
prints
See the issue with B3, and the matrices after it? It' supposed to read B3 = Matrix([[-2, 1], [0, 1]]). I thought slicing Sympy matrices produces copies of them, so stdA shouldn't be altered in place.
What is causing this erraneous behaviour, and how can I choose specific columns from a matrix with simple indexing?
You requested all rows and columns 0 and 3. That is what you got:
>>> B3
Matrix([
[-2, 0],
[ 1, 1]])
Matrix presents the contents as a list of rows.

python 3 double loop comprehension clarification

I'm curious about the double for-loop comprehension.
Comprehension:
multilist = [[row*col for col in range(colNum)] for row in range(rowNum)]
Normal double loop:
for row in range(rowNum):
for col in range(colNum):
multilist[row][col] = row*col
Both of the methods yield the same outcome. For instance, I insert 3 as my row and 5 as my col, they would produce
[[0, 0, 0, 0, 0], [0, 1, 2, 3, 4], [0, 2, 4, 6, 8]]
My question is why the col for-loop is placed as the outer loop in the comprehension instead of the row for-loop? I would welcome any explanation.
Thank you.
In a list comprehension, such as yours, the farthest for loop (rowNum) is executed first.
multilist = [[row*col for col in range(colNum)] for row in range(rowNum)]
Therefore, col for-loop is still the inner loop in the comprehension.

Array formulas: nested ifs and same row calculation

There are 2 inputs: A1 and B1.
In column D, there are many types of objects A.
In column B, there are many types of objects B.
Here's what the formula is supposed to do:
If (D2 is 'A1' and G2 is 'B1') then, if (E2 is bigger than F2), subtract E2 and F2 (5 - 4, in this example), otherwise subtract F2 to E2 (like what happens in line 12).
If there is no match, don't do anything and just skip the row.
I would like to do this as an array formula (Ctrl+Shift+Enter), so it would sum everything in the end.
In this example, the output would be -1, because sum(and(5-4)(2-4)) .
So far, I have the following:
{=SUM(IF((D2:D12="A1")+(G2:G12="B1");E2:E12-F2:F12;0))}
But it doesn't work properly as I'm not sure how Excel reads the subtraction part. I want to be able to subtract the values for the row where the combination was found.
If all you need is to have Column E subtracted by Column F for all matches then consider the following Array Formula:
=SUM((D2:D12=$B$2)*(G2:G12=$B$3)*(E2:E12-F2:F12))
(This can be updated with extra checks on what to subtract if needed)
This will SUM all of the subtractions (Column E) - (Column F) that contain a match to your inputs.
Here is the breakdown:
D2:D12=$B$2 and G2:G12=$B$3 will produce arrays containing 1's for a match and 0's for non-match:
{A1,A2,A3, -,A1, -, -,A4,A5,A1,A1} {B1, -,B1, -,B4, -, -,B6,B5,B2,B1}
V V V V V V V
{1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1 } {1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1 }
E2:E12-F2:F12 will result in a 3rd array consisting of the subtracted values:
{5, 5, 3, 1, 3, 3, 7, 3, 9, 7, 4}
-{4, 3, 4, 5, 6, 5, 9, 6, 7, 8, 2}
={1, 2,-1,-4,-3,-2,-2,-3, 2,-1, 2}
Multiplying all of them will result like so:
{1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1}
x{1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1}
x{1, 2,-1,-4,-3,-2,-2,-3, 2,-1, 2}
={1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2}
Then of course SUM will do it's job:
SUM({1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2}) = 3
If I understood what you are asking for correctly then yours answer would be:
=SUM((D2:D12="A1")*(G2:G12="B1")*ABS(E2:E12-F2:F12))
Remember that TRUE to Excel is the same thing as nubmer 1 and FALSE is 0.
So if any in my formula any row that has either D or G column not matching will be multiplied by 0.
Also your rule about E and F columns sounds to me like
Subtract the smaller from bigger number
this is same as:
|4-5|=1
Or in Excel formula notation:
ABS(4-5)

Resources