I have an excel matrix that looks like this:
ID A B C D E F G
XYZ 0 1 0 1 0 1 0
ZVC 1 0 1 0 1 0 1
...
ABC 0 1 0 1 0 1 1
I would like to transform this matrix into three columns:
XYZ A 0
XYZ B 1
XYZ C 0
...
ABC F 1
ABC G 1
What would be an efficient way to do that (possibly without macros)?
I actually found a simple solution to my problem: https://www.youtube.com/watch?v=N3wWQjRWkJc
Related
I am trying to convert a data frame into a 1,0 matrix format
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data:
Val1 Val2 Val3 Comb
0 A C E Comb1
1 B A F Comb2
2 B D C Comb3
What I need is to convert to below data frame
Comb A C D E B F
0 Comb1 1 1 0 1 0 0
1 Comb2 1 0 0 0 1 1
2 Comb3 0 1 1 0 1 0
I was able to do it with a FOR loop but as my dataframe increases, the processing time increases. Is there a better way to do it?
header = set(data[['Val1','Val2','Val3']].values.ravel())
matrix = pd.DataFrame(columns=header)
for i in range(data.shape[0]):
temp_dict = {data["Val1"].iloc[i]:1, data["Val2"].iloc[i]:1, data["Val3"].iloc[i]:1}
matrix = matrix.append(temp_dict, ignore_index=True)
matrix = matrix.loc[:, matrix.columns.notnull()]
matrix = matrix.fillna(0)
matrix = pd.merge(data[["Comb"]],matrix, left_index=True, right_index=True, how= 'outer')
Thanks!
There may be a better solution, but this is what came to my mind: convert each raw to a dictionary of "present" letters, build a Series from the dictionary, and combine the Series into a dataframe.
data.loc[:, 'Val1':'Val3'].apply(lambda row:
pd.Series({letter: 1 for letter in row}), axis=1)\
.fillna(0).astype(int).join(data.Comb)
# A B C D E F Comb
#0 1 0 1 0 1 0 Comb1
#1 1 1 0 0 0 1 Comb2
#2 0 1 1 1 0 0 Comb3
There are propably multiple ways to solve this, I used pd.crosstab for it:
import pandas as pd
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data["lst"] = data[['Val1', 'Val2', 'Val3']].values.tolist()
data = data.explode("lst")
print(pd.crosstab(data["Comb"], data["lst"]))
Out[20]:
lst A B C D E F
Comb
Comb1 1 0 1 0 1 0
Comb2 1 1 0 0 0 1
Comb3 0 1 1 1 0 0
I guess this will work. Please let me know if it works
pd.get_dummies(data, columns=['Val1','Val2','Val3'],prefix="",prefix_sep="").groupby(axis=1,level=0).sum()
Here's another way:
data.melt('Comb').set_index('Comb')['value'].str.get_dummies().sum(level=0).reset_index()
Output:
Comb A B C D E F
0 Comb1 1 0 1 0 1 0
1 Comb2 1 1 0 0 0 1
2 Comb3 0 1 1 1 0 0
I have a pandas.DataFrame that looks like this:
A B C D E F
0 0 1 0 0 0
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
There are several rows that share a 1 in their columns and in each row there is only one 1 present. I want to merge the rows with each other so the resulting dataFrame would onyl consist of one row, that combines all the 1s of the dataframe, like this:
A B C D E F
0 1 1 1 1 0
Is there a smart, easy way to do this with pandas?
Use DataFrame.sum, then compare for greater or equal by Series.ge and last convert to 0,1 by Series.view:
s = df.sum().ge(1).view('i1')
Another idea if 0,1 values only is use DataFrame.any with convert mask to 0,1:
s = df.any().view('i1')
print (s)
A 1
B 1
C 1
D 1
E 1
F 0
dtype: int8
We can do
df.sum().ge(1).astype(int)
Out[316]:
A 1
B 1
C 1
D 1
E 1
F 0
dtype: int32
I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1
We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1
Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]
If I have a database
Example:
Name A B C
0 Jon 0 1 0
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
If there is a duplicate in my dataset I want the person who has column A as 1 to have precedence.
NB. Column A can only have values 1 or 0
Output:
Name A B C
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
IIUC sort value before drop duplicate
df.sort_values('A').drop_duplicates('Name',keep='last').sort_index()
Out[126]:
Name A B C
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
What is an efficient method of doing this?
I have a column with name of buyer and a column with item names. Each item the person bought is on a new row
For example:
Person 1 Item 1
Person 1 Item 2
Person 1 Item 5
Person 1 Item 7
Person 2 Item 1
Person 2 Item 2
Person 2 Item 11
Person 2 Item 15
Person 2 Item 20
Person 2 Item 21
Person 2 Item 17
Person 3 Item 1
Person 3 Item 2
Person 3 Item 6
Person 3 Item 11
Person 3 Item 15
Person 4 Item 1
Person 4 Item 2
Person 4 Item 5
Person 4 Item 7
There are about 1000000 rows in total and each person has an average of 30 items.
I want to count how often two specific items are bought by a person.
I am picturing it something like this
Item1 Item2 Item3 Item4 Item5 Item6
Item1 xxxxx 0% 0% 5% 10% 90%
Item2
Item3
Item4
Item5
Item6
I have tried using pivot table putting item on row labels and person on column labels then counting items. Then I can use a formula lookup and multiply the results from the pivot table but this is doesn't work with such a large file. Is there a more efficient method?
I am open to all kinds of solutions.
You can use a helper 'table' to do this. First create a table of purchases by person. The formula in this table is:
=SUMPRODUCT(--($A$1:$A$20=E$2),--($B$1:$B$20=$D3))
Which gives a 1/0 result if a person ever bought that item. Example:
Then create the grid of products like in your post and enter this formula:
=SUMPRODUCT($E3:$H3,INDEX($E$3:$H$12,MATCH(K$2,$D$3:$D$12,0),0))
Which multiples instances of purchase of Item X and Item Y. Example:
Maybe I misunderstand you, but you are not interested in the person, who buys but in the items, bought by the same person? I do not think you can do this in one step only using formulas (of course in vba you can do it easier).
To do it w/o vba you could:
List item
Sort by Person and Item
Build an accumulating string with all (different) items bought by one person
(untested: something like IF(A1=A2;B1;"")&B2
Ignore all strings but the last of a person (something like IF(A2=A3;"";B2)
After this you have something like
P I Items_a All_Items
1 A A
1 B AB
1 E ABE
1 G ABEG ABEG
2 A A
2 B AB
2 K ABK
2 O ABKO
2 Q ABKOQ
2 T ABKOQT
2 U ABKOQTU ABKOQTU
3 A A
3 B AB
3 F ABF
3 K ABFK
3 O ABFKO ABFKO
4 A A
4 B AB
4 E ABE
4 G ABEG ABEG
In the next step you could copy all combinations to a new table, and build all combinations (ascending, cause items were sorted) in a column and mark as 1 if condition matches
To explain it easier the items are named A, B, ... (corresponding to Item 1, Item 2 ... in your example)
The formula is something like
=IF(IS_ERROR(FIND(PART(B$1;1;1);$A2));0;1)*IF(IS_ERROR(FIND(PART(B$1;2;1);$A2));0;1)
And your case it would be the following possible combinations
AB AE AF AG AK AO AQ AT AU BE BF BG BK BO BQ BT BU EF EG EK EO EQ ET EU FG FK FO FQ FT FU GK GO GQ GT GU KO KQ KT KU OQ OT OU QT QU TU
But in this example 66% exist, So I only Show the beginning of the table:
XXXXX AB AE AF AG AK AO AQ AT AU BE BF
ABEG 1 1 0 1 0 0 0 0 0 1 0
ABEG 1 1 0 1 0 0 0 0 0 1 0
ABFKO 1 0 1 0 1 1 0 0 0 0 1
ABKOQTU 1 0 0 0 1 1 1 1 1 0 0
SUM ALL 4 2 1 2 2 2 1 1 1 2 1
And now you can count, whatever you want.
A simple WVERWEIS function would help to get this:
A B E F G K O Q T U
A 0 4 2 1 2 2 2 1 1 1
B 0 0 2 1 2 2 2 1 1 1
E 0 0 0 0 2 0 0 0 0 0
F 0 0 0 0 0 1 1 0 0 0
G 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 2 1 1 1
O 0 0 0 0 0 0 0 1 1 1
Q 0 0 0 0 0 0 0 0 1 1
T 0 0 0 0 0 0 0 0 0 1
U 0 0 0 0 0 0 0 0 0 0
But for my opinion you can handle that for maybe 10 Items (the Help- Cols would be n*(n-1)/2, 10 Items --> 45 Columns (cause AA, BB, ... are not evaluated)
In all other cases you should try to program that.