Calculate the duplicate in multidimensional numpy array - python-3.x

I am using python-3.x and I would like to calculate the number of duplicates in numpy array.... for example:
import numpy as np
my_array = np.array([[2, 3, 5],
[2, 3, 5], # duplicate of row 0 (this will be count as 1)
[2, 3, 5], # duplicate of row 0 (this will be count as 2)
[1, 0, 9],
[3, 6, 6],
[3, 6, 6], # duplicate of row 0 (this will be count as 3)
[1, 0, 9]])
What I would like to get from the outptu is the number of duplicates in this array:
the number of the duplicate is 3
most of the methods are returning the values such as collections.Counter or return_counts and they not returning what I want if I am using them right.
Any advice would be much appreciated

You can get the duplicate count of array by take length of array - length of unique members of array:
the_number_of the duplicate = len(my_array) - len(np.unique(my_array, axis=0))
And the result of your example is 4 ([1,0,9] is duplicate also).

Here's a slight variation from #Anh Ngoc's answer. (for older versions of numpy where axis is not supported by np.unique)
number_of_duplicates = len(my_array) - len(set(map(tuple, my_array)))

Related

How to create a list of arrays from multiple same-size vectors

I am attempting to create a list of arrays from 2 vectors.
I have a dataset I'm reading from a .csv file and need to pair each value with a 1 to create a list of arrays.
import numpy as np
Data = np.array([1, 2, 3, 4, 5]) #this is actually a column in a .csv file, but simplified it for the example
#do something here
output = ([1,1], [1,2], [1,3], [1,4], [1,5]) #2nd column in each array is the data, first is a 1
I've tried to use numpy concatenate and vstack, but they don't give me exactly what I'm looking for.
Any suggestions would be appreciated.
You can form the output using a list comprehension:
data = [1, 2, 3, 4, 5]
output = [[1, item] for item in data]
This will output:
[[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]]

Using multiple filter on multiple columns of numpy array - more efficient way?

I have the following 2 arrays:
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[7, 5, 6, 3],
[2, 4, 8, 9]]
ids = np.array([6, 5, 7, 8])
Each row in the array arr describes a 4-digit id, there are no redundant ids - neither in their values nor their combination. So if [1, 2, 3, 4] exists, no other combination of these 4 digits can exist. This will be important in a sec.
The array ids contains a 4-digit id, however the order might not be correct. Now I need to go through each row of arr and look if this id exists. In this example ids fits to the 2nd row from the top of arr. So arr[1,:].
My current solution creates a filter of each column to check if the values of ids exist in any of the 4 columns. After that I use these filters on arr. This seems way too complicated.
So I pretty much do this:
filter_1 = np.in1d(arr[:, 0], ids)
filter_2 = np.in1d(arr[:, 1], ids)
filter_3 = np.in1d(arr[:, 2], ids)
filter_4 = np.in1d(arr[:, 3], ids)
result = arr[filter_1 & filter_2 & filter_3 & filter_4]
Does anyone know a simpler solution? Maybe using generators?
Use np.isin all across arr and all-reduce to get result -
In [15]: arr[np.isin(arr, ids).all(1)]
Out[15]: array([[5, 6, 7, 8]])

numpy 1D array: identify pairs of elements that sum to 0

My code generates numpy 1D arrays of integers. Here's an example.
arr = np.array([-8, 7, -5, 2, -7, 8, -6, 3, 5])
There are two steps I need to take with this array, but I'm new enough at Python that I'm at a loss how do this efficiently. The two steps are:
a) Identify the 1st element of pairs having sum == 0. For arr, we have (-8, 7, -5).
b) Now I need to find the difference in indices for each of the pairs identified in a).
The difference in indices for (-8,8) is 5, for (7,-7) is 3,
and for (-5,5) is 6.
Ideally, the output could be a 2D array, something like:
[[-8, 5],
[ 7, 3],
[-5, 6]]
Thank you for any assistance.
Here is my solution:
arr = np.array([-8, 7, -5, 2, -7, 8, -6, 3, 5])
output = list()
for i in range(len(arr)):
for j in range(len(arr)-i):
if arr[i] + arr[i+j] == 0:
output.append([arr[i],j])
print(output)
[[-8, 5], [7, 3], [-5, 6]]
I have two comments also:
1) You can transfer the list to the numpy array by np.asarray(output)
2) Imagine you have list [8, -8, -8]. If you want to calculate distance of the first pair only, you can simply add break after the appending procedure.

I'm trying to add lists in lists by column. Is there a way to sum them with missing variables in a list?

I had followed the book and can sum lists in lists by column but one of the test cases is missing variables in the list and I'm unable to move forward because I keep getting an index error.
The first initial_list works as it should giving [3,6,9]
The second one though should apparently give me [3,4,9,4]
list_initial = [[1, 2, 3], [1, 2, 3],[1, 2, 3 ]]
list_initial = [[1, 2, 3], [1], [1, 2, 3, 4]]
def column_sums(list_initial):
column = 0
list_new = []
while column < len(list_initial):
total = sum(row[column] for row in list_initial )
list_new.append(total)
column = column + 1
print(list_new)
column_sums(list_initial)
You can effectively "transpose" your data so that rows become columns, and then use itertools.zip_longest with a fillvalue of 0, to sum across them, eg:
from itertools import zip_longest
list_initial = [[1, 2, 3], [1], [1, 2, 3, 4]]
summed = [sum(col) for col in zip_longest(*list_initial, fillvalue=0)]
# [3, 4, 6, 4]

Returning the N largest values' indices in a multidimensional array (can find solutions for one dimension but not multi-dimension)

I have a numpy array X, and I'd like to return another array Y whose entries are the indices of the n largest values of X i.e. suppose I have:
a =np.array[[1, 3, 5], [4, 5 ,6], [9, 1, 7]]
then say, if I want the first 5 "maxs"'s indices-here 9, 7 , 6 , 5, 5 are the maxs, and their indices are:
b=np.array[[2, 0], [2 2], [ 2 1], [1 1], [0 , 2])
I've been able to find some solutions and make this work for a one dimensional array like
c=np.array[1, 2, 3, 4, 5, 6]:
def f(a,N):
return np.argsort(a)[::-1][:N]
But have not been able to generate something that works in more than one dimension. Thanks!
Approach #1
Get the argsort indices on its flattened version and select the last N indices. Then, get the corresponding row and column indices -
N = 5
idx = np.argsort(a.ravel())[-N:][::-1] #single slicing: `[:N-2:-1]`
topN_val = a.ravel()[idx]
row_col = np.c_[np.unravel_index(idx, a.shape)]
Sample run -
# Input array
In [39]: a = np.array([[1,3,5],[4,5,6],[9,1,7]])
In [40]: N = 5
...: idx = np.argsort(a.ravel())[-N:][::-1]
...: topN_val = a.ravel()[idx]
...: row_col = np.c_[np.unravel_index(idx, a.shape)]
...:
In [41]: topN_val
Out[41]: array([9, 7, 6, 5, 5])
In [42]: row_col
Out[42]:
array([[2, 0],
[2, 2],
[1, 2],
[1, 1],
[0, 2]])
Approach #2
For performance, we can use np.argpartition to get top N indices without keeping sorted order, like so -
idx0 = np.argpartition(a.ravel(), -N)[-N:]
To get the sorted order, we need one more round of argsort -
idx = idx0[a.ravel()[idx0].argsort()][::-1]

Resources