Inverse X.toarray into a CountVectorizer in sklearn - scikit-learn

I'm following documentation here:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Suppose I already have a term frequency matrix like the one given in X.toarray(), but I didn't use CountVectorizer to obtain it.
I want to apply a TfIDF to this matrix. Is there a way for me to take a count array + a dictionary and apply some inverse of this function as a constructor to get a fit_transformed X?
I'm looking for...
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
>>> V = CountVectorizerConstructorPrime(array=(X.toarray()),
vocabulary=['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'])
such that:
>>> V == X
True

The X constructed by the CountVectorizer is a sparse matrix in SciPy's compressed sparse row (csr) format. So you can construct it directly from any word count matrix with the appropriate SciPy function:
from scipy.sparse import csr_matrix
V = csr_matrix(X.toarray())
Now V and X are equal, although this may not be obvious, because V == X will give you another sparse matrix (or rather complain that the matrix is not sparse despite the intended format, see this question). But you can check it like this:
(V != X).toarray().any()
False
Note that the word list was not needed, because the matrix only encodes the frequencies of all distinct words, no matter what they are.

Related

Get all columns per id where the column is equal to a value

Say I have a pandas dataframe:
id A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 1
id4 0 0 0 0
I want to select all the columns per id where the column name is equal to 1, this list will then be a new column in the dataframe.
Expected output:
id A B C D Result
id1 0 1 0 1 [B,D]
id2 1 0 0 1 [A,D]
id3 0 0 0 1 [D]
id4 0 0 0 0 []
I tried df.apply(lambda row: row[row == 1].index, axis=1) but the output of the 'Result' was not in the form in specified above
You can do what you are trying to do adding .tolist():
df['Result'] = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
Saying that, your approach of using lists as values inside a single column seems contradictory with the Pandas approach of keeping data tabular (only one value per cell). It will probably be better to use nested lists instead of pandas to do what you are trying to do.
Setup
I used a different set of ones and zeros to highlight skipping an entire row.
df = pd.DataFrame(
[[0, 1, 0, 1], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 1, 0]],
['id1', 'id2', 'id3', 'id4'],
['A', 'B', 'C', 'D']
)
df
A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 0
id4 0 0 1 0
Not Your Grampa's Reverse Binerizer
n = len(df)
i, j = np.nonzero(df.to_numpy())
col_names = df.columns[j]
positions = np.bincount(i).cumsum()[:-1]
result = np.split(col_names, positions)
df.assign(Result=[a.tolist() for a in result])
A B C D Result
id
id1 0 1 0 1 [B, D]
id2 1 0 0 1 [A, D]
id3 0 0 0 0 []
id4 0 0 1 0 [C]
Explanations
Ohh, the details!
np.nonzero on a 2-D array will return two arrays of equal length. The first array will have the 1st dimensional position of each element that is not zero. The second array will have the 2nd dimensional position of each element that is not zero. I'll call the first array i and the second array j.
In the figure below, I label the columns with what j they represent and correspondingly, I label the rows with what i they represent.
For each non-zero element of the dataframe, I place above the value a tuple with the (ith, jth) dimensional positions and in brackets the [kth] non-zero element in the dataframe.
# j → 0 1 2 3
# A B C D
# i
# ↓ (0,1)[0] (0,3)[1]
# 0 id1 0 1 0 1
#
# (1,0)[2] (1,3)[3]
# 1 id2 1 0 0 1
#
#
# 2 id3 0 0 0 0
#
# (3,2)[4]
# 3 id4 0 0 1 0
In the figure below, I show what i and j look like. I label each row with the same k in the brackets in the figure above
# i j
# ----
# 0 1 k=0
# 0 3 k=1
# 1 0 k=2
# 1 3 k=3
# 3 2 k=4
Now it's easy to slice the df.columns with the j array to get all the column labels in one big array.
# df.columns[j]
# B
# D
# A
# D
# C
The plan is to use np.split to chop up df.columns[j] into the sub arrays for each row. Turns out that the information is embedded in the array i. I'll use np.bincount to count how many non-zero elements are in each row. I'll need to tell np.bincount the minimum number of bins we are assuming to have. That minimum is the number of rows in the dataframe. We assign it to n with n = len(df)
# np.bincount(i, minlength=n)
# 2 ← Two non-zero elements in the first row
# 2 ← Two more in the second row
# 0 ← None in this row
# 1 ← And one more in the fourth row
However, if we take the cumulative sum of this array, we get the positions we need to split at.
# np.bincount(i, minlength=n).cumsum()
# 2
# 4
# 4 ← This repeated value results in an empty array for the 3rd row
# 5
Let's look at how this matches up with df.columns[j]. We see below that the column slice gets split exactly where we need.
# B D A D D ← df.columns[j]
# 2 44 5 ← np.bincount(i, minlength=n).cumsum()
One issue is that the 4 values in this array will result in splitting the df.columns[j] array into 5 sub-arrays. This isn't horrible because the last array will always be empty. so we slice it to appropriate size np.bincount(i, minlength=n).cumsum()[:-1]
col_names = df.columns[j]
positions = np.bincount(i, minlength=n).cumsum()[:-1]
result = np.split(col_names, positions)
# result
# [B, D]
# [A, D]
# []
# [D]
The only thing left to do is assign it to a new columns and make the individual sub-arrays lists instead.
df.assign(Result=[a.tolist() for a in result])
# A B C D Result
# id
# id1 0 1 0 1 [B, D]
# id2 1 0 0 1 [A, D]
# id3 0 0 0 0 []
# id4 0 0 1 0 [C]

Getting a dataframe of combinations from a list of dictionaries

I have a following list of dictionaries:
options = [{'A-1': ['x', 'y']},
{'A-3': ['x', 'y', 'z']},
Values of each dictionary (e.g. x and y) are basically the options that keys (e.g. A-1) can have. How can I have the following dataframe of combinations? Only one value (e.g. either x or y) of a key (e.g. A-1) can can take 1 at a time. All values of a dictionary cannot be 0 at a time.
I have trying to use itertools.combinations(), but couldn't find the way to get the desired result.
This way I can find the number of combinations n_comb and number of connections n_conn which will be number of rows and columns of the dataframe.
n_conn = 0
n_comb = 1
for dic in options:
for key in dic:
n_comb = n_comb * len(dic[key])
n_conn = n_conn + len(dic[key])
One way using pandas.get_dummies and merge:
dfs = [pd.get_dummies(pd.DataFrame(o)).assign(merge=1) for o in options]
new_df = dfs[0].merge(dfs[1], on="merge").drop("merge", 1)
print(new_df)
Or make it more flexible using functools.reduce:
from functools import reduce
new_df = reduce(lambda x, y: x.merge(y, on="merge"), dfs).drop("merge", 1)
Output:
A-1_x A-1_y A-3_x A-3_y A-3_z
0 1 0 1 0 0
1 1 0 0 1 0
2 1 0 0 0 1
3 0 1 1 0 0
4 0 1 0 1 0
5 0 1 0 0 1

Python, pandas dataframe, groupby column and known in advance values

Consider this example:
>>> import pandas as pd
>>> df = pd.DataFrame(
... [
... ['X', 'R', 1],
... ['X', 'G', 2],
... ['X', 'R', 1],
... ['X', 'B', 3],
... ['X', 'R', 2],
... ['X', 'B', 2],
... ['X', 'G', 1],
... ],
... columns=['client', 'status', 'cnt']
... )
>>> df
client status cnt
0 X R 1
1 X G 2
2 X R 1
3 X B 3
4 X R 2
5 X B 2
6 X G 1
>>>
>>> df_gb = df.groupby(['client', 'status']).cnt.sum().unstack()
>>> df_gb
status B G R
client
X 5 3 4
>>>
>>> def color(row):
... if 'R' in row:
... red = row['R']
... else:
... red = 0
... if 'B' in row:
... blue = row['B']
... else:
... blue = 0
... if 'G' in row:
... green = row['G']
... else:
... green = 0
... if red > 0:
... return 'red'
... elif blue > 0 and (red + green) == 0:
... return 'blue'
... elif green > 0 and (red + blue) == 0:
... return 'green'
... else:
... return 'orange'
...
>>> df_gb.apply(color, axis=1)
client
X red
dtype: object
>>>
What this code does, is groupby in order to get counts of each category (red, green, blue).
Than apply is used in order to implement logic for determining color of the each client (in this case there is only one).
The problem here is in fact that groupby object can conain any combiantion of RGB values.
For example, I can have R and G column but not B, or I could have just R column, or I will not have any of the RGB coluimns.
Because of that fact, int the apply function, I had to introduce if statements for each column in order to have counts for each color no matter if its value is in the groupby object or not.
Do I have any other option to enforce the logic from color function, using something else instead of apply in such (ugly) way?
For example, in this case I know in advance that I need counts for exactly three categories - R, G and B. I need something like group by column and these three values.
Can I group dataframe by these three categories (series, dict, function?) and always get zero or a sum for all three categories no matter whether they exist in group or not?
Use:
#changed data for more combinations
df = pd.DataFrame(
[
['W', 'R', 1],
['X', 'G', 2],
['Y', 'R', 1],
['Y', 'B', 3],
['Z', 'R', 2],
['Z', 'B', 2],
['Z', 'G', 1],
],
columns=['client', 'status', 'cnt']
)
print (df)
client status cnt
0 W R 1
1 X G 2
2 Y R 1
3 Y B 3
4 Z R 2
5 Z B 2
6 Z G 1
Then is added fill_value=0 parameter for replace non matched values (missing values) to 0:
df_gb = df.groupby(['client', 'status']).cnt.sum().unstack(fill_value=0)
#alternative
df_gb = df.pivot_table(index='client',
columns='status',
values='cnt',
aggfunc='sum',
fill_value=0)
print (df_gb)
status B G R
client
W 0 0 1
X 0 2 0
Y 3 0 1
Z 2 1 2
Instead function is created helper DataFrame with all combinations of 0,1 and added new column for output:
from itertools import product
df1 = pd.DataFrame(product([0,1], repeat=3), columns=['R','G','B'])
#change colors like need
df1['output'] = ['no','blue','green','color2','red','red1','red2','all']
print (df1)
R G B output
0 0 0 0 no
1 0 0 1 blue
2 0 1 0 green
3 0 1 1 color2
4 1 0 0 red
5 1 0 1 red1
6 1 1 0 red2
7 1 1 1 all
Then for replace values above 1 to 1 is used DataFrame.clip:
print (df_gb.clip(upper=1))
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all
And last is used DataFrame.merge for new output column, there is no on parameter, so joined by intersection of columns in both DataFrames, here R,G,B:
df2 = df_gb.clip(upper=1).merge(df1)
print (df2)
B G R output
0 0 0 1 red
1 0 1 0 green
2 1 0 1 red1
3 1 1 1 all

Reshaping 1D array does not work properly

Assume I have the following array, where all binary values are assumed to be in the same length:
A = [10101010, 10011010, 10111101, 11110000]
This is 1D of size 4. I want to be able to convert it to 2D numpy. Thus, using this example I should get (4,8). I use the following code but it doesn't reshape it. Any suggestions?
import numpy as np
A = [10101010, 10011010, 10111101, 11110000]
A = np.asarray(A)
A = np.reshape(A, [-1,])
You got a list of integers that are > 10 million each - not binary values.
You can fix that by making them a string, seperate into single digits and convert that:
import numpy as np
A = [10101010, 10011010, 10111101, 11110000]
B = [list(map(int,t)) for t in list(map(str,A))]
npA = np.asarray(B)
print(npA)
Output:
[[1 0 1 0 1 0 1 0]
[1 0 0 1 1 0 1 0]
[1 0 1 1 1 1 0 1]
[1 1 1 1 0 0 0 0]]

create dummies from a column for a subset of data, which does't contains all the category value in that column

I am handling a subset of the a large data set.
There is a column named "type" in the dataframe. The "type" are expected to have values like [1,2,3,4].
In a certain subset, I find the "type" column only contains certain values like [1,4],like
In [1]: df
Out[2]:
type
0 1
1 4
When I create dummies from column "type" on that subset, it turns out like this:
In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]: type_1 type_4
0 1 0
1 0 1
It does't have the columns named "type_2", "type_3".What i want is like:
Out[6]: type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Is there a solution for this?
What you need to do is make the column 'type' into a pd.Categorical and specify the categories
pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Another solution with reindex_axis and add_prefix:
df1 = pd.get_dummies(df["type"])
.reindex_axis([1,2,3,4], axis=1, fill_value=0)
.add_prefix('type')
print (df1)
type1 type2 type3 type4
0 1 0 0 0
1 0 0 0 1
Or categorical solution:
df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Since you tagged your post as one-hot-encoding, you may find sklearn module's OneHotEncoder useful, in addition to pure Pandas solutions:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5
# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))
# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])
print(newdf)
type_0 type_1 type_2 type_3 type_4
0 0 1 0 0 0
1 0 0 0 0 1
One advantage of using this approach is that OneHotEncoder easily produces sparse vectors, for very large class sets. (Just change to sparse=True in the OneHotEncoder() declaration.)

Resources