Compare two matrices and create a matrix of their common values [duplicate] - python-3.x

This question already has an answer here:
Numpy intersect1d with array with matrix as elements
(1 answer)
Closed 5 years ago.
I'm currently trying to compare two matrices and return matching rows into the "intersection matrix" via python. Both matrices are numerical data-and I'm trying to return the rows of their common entries (I have also tried just creating a matrix with matching positional entries along the first column and then creating an accompanying tuple). These matrices are not necessarily the same in dimensionality.
Let's say I have two matrices of matching column length but arbitrary (can be very large and different row length)
23 3 4 5 23 3 4 5
12 6 7 8 45 7 8 9
45 7 8 9 34 5 6 7
67 4 5 6 3 5 6 7
I'd like to create a matrix with the "intersection" being for this low dimensional example
23 3 4 5
45 7 8 9
perhaps it looks like this though:
1 2 3 4 2 4 6 7
2 4 6 7 4 10 6 9
4 6 7 8 5 6 7 8
5 6 7 8
in which case we only want:
2 4 6 7
5 6 7 8
I've tried things of this nature:
def compare(x):
# This is a matrix I created with another function-purely numerical data of arbitrary size with fixed column length D
y =n_c(data_cleaner(x))
# this is a second matrix that i'd like to compare it to. note that the sizes are probably not the same, but the columns length are
z=data_cleaner(x)
# I initialized an array that would hold the matching values
compare=[]
# create nested for loop that will check a single index in one matrix over all entries in the second matrix over iteration
for i in range(len(y)):
for j in range(len(z)):
if y[0][i] == z[0][i]:
# I want the row or the n tuple (shown here) of those columns with the matching first indexes as shown above
c_vec = ([0][i],[15][i],[24][i],[0][25],[0][26])
compare.append(c_vec)
else:
pass
return compare
compare(c_i_w)
Sadly, I'm running into some errors. Specifically it seems that I'm telling python to improperly reference values.

Consider the arrays a and b
a = np.array([
[23, 3, 4, 5],
[12, 6, 7, 8],
[45, 7, 8, 9],
[67, 4, 5, 6]
])
b = np.array([
[23, 3, 4, 5],
[45, 7, 8, 9],
[34, 5, 6, 7],
[ 3, 5, 6, 7]
])
print(a)
[[23 3 4 5]
[12 6 7 8]
[45 7 8 9]
[67 4 5 6]]
print(b)
[[23 3 4 5]
[45 7 8 9]
[34 5 6 7]
[ 3 5 6 7]]
Then we can broadcast and get an array of equal rows with
x = (a[:, None] == b).all(-1)
print(x)
[[ True False False False]
[False False False False]
[False True False False]
[False False False False]]
Using np.where we can identify the indices
i, j = np.where(x)
Show which rows of a
print(a[i])
[[23 3 4 5]
[45 7 8 9]]
And which rows of b
print(b[j])
[[23 3 4 5]
[45 7 8 9]]
They are the same! That's good. That's what we wanted.
We can put the results into a pandas dataframe with a MultiIndex with row number from a in the first level and row number from b in the second level.
pd.DataFrame(a[i], [i, j])
0 1 2 3
0 0 23 3 4 5
2 1 45 7 8 9

Related

How to aggregate n previous rows as list in Pandas DataFrame?

As the title says:
a = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
Having a dataframe with 10 values we want to aggregate say last 5 rows and put them as list into a new column:
>>> a new_col
0
0 1
1 2
2 3
3 4
4 5 [1,2,3,4,5]
5 6 [2,3,4,5,6]
6 7 [3,4,5,6,7]
7 8 [4,5,6,7,8]
8 9 [5,6,7,8,9]
9 10 [6,7,8,9,10]
How?
Due to how rolling windows are implemented, you won't be able to aggregate the results as you expect, but we still can reach your desired result by iterating each window and storing the values as a list of values:
>>> new_col_values = [
window.to_list() if len(window) == 5 else None
for window in df["column"].rolling(5)
]
>>> df["new_col"] = new_col_values
>>> df
column new_col
0 1 None
1 2 None
2 3 None
3 4 None
4 5 [1, 2, 3, 4, 5]
5 6 [2, 3, 4, 5, 6]
6 7 [3, 4, 5, 6, 7]
7 8 [4, 5, 6, 7, 8]
8 9 [5, 6, 7, 8, 9]
9 10 [6, 7, 8, 9, 10]

pd.Series(pred).value_counts() how to get the first column in dataframe?

I apply pd.Series(pred).value_counts() and get this output:
0 2084
-1 15
1 13
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
dtype: int64
When I create a list I get only the second column:
c_list=list(pd.Series(pred).value_counts()), Out:
[2084, 15, 13, 10, 7, 4, 3, 3, 3, 2, 2, 2, 2]
How do I get ultimately a dataframe that looks like this including a new column for size% of total size?
df=
[class , size ,relative_size]
0 2084 , x%
-1 15 , y%
1 13 , etc.
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
You are very nearly there. Typing this in the blind as you didn't provide a sample input:
df = pd.Series(pred).value_counts().to_frame().reset_index()
df.columns = ['class', 'size']
df['relative_size'] = df['size'] / df['size'].sum()

Best way to get the column filtered from a list in dataframe

Just would like to different approached to get a list matched/filtered against df.columns in a pandas.
Below snippet works perfectly but looking forward for other approaches around.
Even we can consider a function, excuse my brevity as just learning pandas.
# list of columns names to be matched & checked
>>> matchObj = ['equity01', 'equity02', 'equity1' 'equity2']
# DataFrame construct
>>> df = pd.DataFrame({'equity01': [1, 2, 3], 'equity02': [4, 5, 6], 'equity03': [7, 8, 9], 'equity04': [2, 3, 4], 'equity05': [5, 6, 7]})
>>> df
equity01 equity02 equity03 equity04 equity05
0 1 4 7 2 5
1 2 5 8 3 6
2 3 6 9 4 7
# One way to with list comprehension as follows..
>>> print(df[[col for col in matchObj if col in df.columns]])
equity01 equity02
0 1 4
1 2 5
2 3 6
Thank a mile in advanced for any suggestion and solutions around.
Yes, using pd.Index.intersection():
df[df.columns.intersection(matchObj)]
equity01 equity02
0 1 4
1 2 5
2 3 6
Using pd.Index.isin()
df.loc[:,df.columns.isin(matchObj)]
equity01 equity02
0 1 4
1 2 5
2 3 6

How to create a separate df after applying groupby?

I have a df as follows:
Product Step
1 1
1 3
1 6
1 6
1 8
1 1
1 4
2 2
2 4
2 8
2 8
2 3
2 1
3 1
3 3
3 6
3 6
3 8
3 1
3 4
What I would like to do is to:
For each Product, every Step must be grabbed and the order must not be changed, that is, if we look at Product 1, after Step 8, there is a 1 coming and that 1 must be after 8 only. So, the expected output for product 1 and product 3 should be of the order: 1, 3, 6, 8, 1, 4; for the product 2 it must be: 2, 4, 8, 3, 1.
Update:
Here, I only want one value of 6 for product 1 and 3, since in the main df both the 6 next to each other, but both the values of 1 must be present since they are not next to each other.
Once the first step is done, the products with the same Steps must be grouped together into a new df (in the below example: Product 1 and 3 have same Steps, so they must be grouped together)
What I have done:
import pandas as pd
sid = pd.DataFrame(data.groupby('Product').apply(lambda x: x['Step'].unique())).reset_index()
But it is yielding a result like:
Product 0
0 1 [1 3 6 8 4]
1 2 [2 4 8 3 1]
2 3 [1 3 6 8 4]
which is not the result I want. I would like the value for the first and third product to be [1 3 6 8 1 4].
IIUC Create the Newkey by using cumsum and diff
df['Newkey']=df.groupby('Product').Step.apply(lambda x : x.diff().ne(0).cumsum())
df.drop_duplicates(['Product','Newkey'],inplace=True)
s=df.groupby('Product').Step.apply(tuple)
s.reset_index().groupby('Step').Product.apply(list)
Step
(1, 3, 6, 8, 1, 4) [1, 3]
(2, 4, 8, 3, 1) [2]
Name: Product, dtype: object
groupby preservers the order of rows within a group, so there isn't much need to worry about the rows shifting.
A straightforward, but not greatly performant, solution would be to apply(tuple), since they are hashable allowing you to group on them to see which Products are identical. form_seq will make it so that consecutive values only appear once in the list of steps before forming the tuple.
def form_seq(x):
x = x[x != x.shift()]
return tuple(x)
s = df.groupby('Product').Step.apply(form_seq)
s.groupby(s).groups
#{(1, 3, 6, 8, 1, 4): Int64Index([1, 3], dtype='int64', name='Product'),
# (2, 4, 8, 3, 1): Int64Index([2], dtype='int64', name='Product')}
Or if you'd like a DataFrame:
s.reset_index().groupby('Step').Product.apply(list)
#Step
#(1, 3, 6, 8, 1, 4) [1, 3]
#(2, 4, 8, 3, 1) [2]
#Name: Product, dtype: object
The values of that dictionary are the groupings of products that share the step sequence (given by the dictionary keys). Products 1 and 3 are grouped together by the step sequence 1, 3, 6, 8, 1, 4.
Another very similar way:
df_no_dups=df[df.shift()!=df].dropna(how='all').ffill()
df_no_dups_grouped=df_no_dups.groupby('Product')['Step'].apply(list)

Three-dimensional array processing

I want to turn
arr = np.array([[[1,2,3],[4,5,6],[7,8,9],[10,11,12]], [[2,2,2],[4,5,6],[7,8,9],[10,11,12]], [[3,3,3],[4,5,6],[7,8,9],[10,11,12]]])
into
arr = np.array([[[1,2,3],[7,8,9],[10,11,12]], [[2,2,2],[7,8,9],[10,11,12]], [[3,3,3],[7,8,9],[10,11,12]]])
Below is the code:
a = 0
b = 0
NewArr = []
while a < 3:
c = arr[a, :, :]
d = arr[a]
print(d)
if c[1, 2] == 6:
c = np.delete(c, [1], axis=0)
a += 1
b += 1
c = np.concatenate((d, c), axis=1)
print(c)
But after deleting the line containing the number 6, I cannot stitch the array together,Can someone help me?
thank you very much for your help.
If you want a more automatic way of processing your input data, here is an answer using numpy functions :
arr[np.newaxis,~np.any(arr==6,axis=2)].reshape((3,-1,3))
np.any(arr==6,axis=2) outputs an array which has True at rows which contain the value 6. We take the inverse of those booleans since we want to remove those rows. The solution is then used as an index selection in arr, with a np.newaxis because the output of np.any had one axis less than the original array.
Finally, the output is reshaped into a 3,x,3 array, where x will depend on the number of rows which were removed (hence the -1 in reshape)
Based on the input / output you provide, a simpler solution would be to just use index selection and slices:
import numpy as np
arr = np.array([[[1,2,3],[4,5,6],[7,8,9],[10,11,12]], [[2,2,2],[4,5,6],[7,8,9],[10,11,12]], [[3,3,3],[4,5,6],[7,8,9],[10,11,12]]])
print("arr=")
print(arr)
expected_result = np.array([[[1,2,3],[7,8,9],[10,11,12]], [[2,2,2],[7,8,9],[10,11,12]], [[3,3,3],[7,8,9],[10,11,12]]])
# select indices 0, 2 and 3 from dimension 2
a = np.copy(arr[:,[0,2,3],:])
print("a=")
print(a)
print(np.array_equal(a, expected_result))
Output:
arr=
[[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[[ 2 2 2]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
[[ 3 3 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]]
a=
[[[ 1 2 3]
[ 7 8 9]
[10 11 12]]
[[ 2 2 2]
[ 7 8 9]
[10 11 12]]
[[ 3 3 3]
[ 7 8 9]
[10 11 12]]]
True

Resources