Count which combination of items are bought most frequently - excel

What is an efficient method of doing this?
I have a column with name of buyer and a column with item names. Each item the person bought is on a new row
For example:
Person 1 Item 1
Person 1 Item 2
Person 1 Item 5
Person 1 Item 7
Person 2 Item 1
Person 2 Item 2
Person 2 Item 11
Person 2 Item 15
Person 2 Item 20
Person 2 Item 21
Person 2 Item 17
Person 3 Item 1
Person 3 Item 2
Person 3 Item 6
Person 3 Item 11
Person 3 Item 15
Person 4 Item 1
Person 4 Item 2
Person 4 Item 5
Person 4 Item 7
There are about 1000000 rows in total and each person has an average of 30 items.
I want to count how often two specific items are bought by a person.
I am picturing it something like this
Item1 Item2 Item3 Item4 Item5 Item6
Item1 xxxxx 0% 0% 5% 10% 90%
Item2
Item3
Item4
Item5
Item6
I have tried using pivot table putting item on row labels and person on column labels then counting items. Then I can use a formula lookup and multiply the results from the pivot table but this is doesn't work with such a large file. Is there a more efficient method?
I am open to all kinds of solutions.

You can use a helper 'table' to do this. First create a table of purchases by person. The formula in this table is:
=SUMPRODUCT(--($A$1:$A$20=E$2),--($B$1:$B$20=$D3))
Which gives a 1/0 result if a person ever bought that item. Example:
Then create the grid of products like in your post and enter this formula:
=SUMPRODUCT($E3:$H3,INDEX($E$3:$H$12,MATCH(K$2,$D$3:$D$12,0),0))
Which multiples instances of purchase of Item X and Item Y. Example:

Maybe I misunderstand you, but you are not interested in the person, who buys but in the items, bought by the same person? I do not think you can do this in one step only using formulas (of course in vba you can do it easier).
To do it w/o vba you could:
List item
Sort by Person and Item
Build an accumulating string with all (different) items bought by one person
(untested: something like IF(A1=A2;B1;"")&B2
Ignore all strings but the last of a person (something like IF(A2=A3;"";B2)
After this you have something like
P I Items_a All_Items
1 A A
1 B AB
1 E ABE
1 G ABEG ABEG
2 A A
2 B AB
2 K ABK
2 O ABKO
2 Q ABKOQ
2 T ABKOQT
2 U ABKOQTU ABKOQTU
3 A A
3 B AB
3 F ABF
3 K ABFK
3 O ABFKO ABFKO
4 A A
4 B AB
4 E ABE
4 G ABEG ABEG
In the next step you could copy all combinations to a new table, and build all combinations (ascending, cause items were sorted) in a column and mark as 1 if condition matches
To explain it easier the items are named A, B, ... (corresponding to Item 1, Item 2 ... in your example)
The formula is something like
=IF(IS_ERROR(FIND(PART(B$1;1;1);$A2));0;1)*IF(IS_ERROR(FIND(PART(B$1;2;1);$A2));0;1)
And your case it would be the following possible combinations
AB AE AF AG AK AO AQ AT AU BE BF BG BK BO BQ BT BU EF EG EK EO EQ ET EU FG FK FO FQ FT FU GK GO GQ GT GU KO KQ KT KU OQ OT OU QT QU TU
But in this example 66% exist, So I only Show the beginning of the table:
XXXXX AB AE AF AG AK AO AQ AT AU BE BF
ABEG 1 1 0 1 0 0 0 0 0 1 0
ABEG 1 1 0 1 0 0 0 0 0 1 0
ABFKO 1 0 1 0 1 1 0 0 0 0 1
ABKOQTU 1 0 0 0 1 1 1 1 1 0 0
SUM ALL 4 2 1 2 2 2 1 1 1 2 1
And now you can count, whatever you want.
A simple WVERWEIS function would help to get this:
A B E F G K O Q T U
A 0 4 2 1 2 2 2 1 1 1
B 0 0 2 1 2 2 2 1 1 1
E 0 0 0 0 2 0 0 0 0 0
F 0 0 0 0 0 1 1 0 0 0
G 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 2 1 1 1
O 0 0 0 0 0 0 0 1 1 1
Q 0 0 0 0 0 0 0 0 1 1
T 0 0 0 0 0 0 0 0 0 1
U 0 0 0 0 0 0 0 0 0 0
But for my opinion you can handle that for maybe 10 Items (the Help- Cols would be n*(n-1)/2, 10 Items --> 45 Columns (cause AA, BB, ... are not evaluated)
In all other cases you should try to program that.

Related

Returning column header corresponding to matched value

need some help here.. I am looking to retrieve Gender from Sheet 2 corresponding to the name in Sheet 1.
Step 1 - Match the name in sheet 1 to sheet 2 (not all names in sheet 1 will be in sheet 2, mark NA for non matching names)
Step 2 - Look for the corresponding gender in sheet 2.
Step 3 - Retrieve the column header or the last number in the column header (1,2,3...6)
Sheet 1
Name
Gender
w
???
e
r
t
y
u
i
q
w
e
r
Sheet 2
Name
Male 1
Female 2
other 3
other 4
other 5
Do not know 6
w
1
0
0
0
0
0
a
0
0
0
0
0
1
q
1
0
0
0
0
0
r
0
1
0
0
0
0
e
1
0
0
0
0
0
t
0
0
0
0
1
0
y
0
0
0
0
0
1
u
0
1
0
0
0
0
with Office 365 we can use FILTER:
=IFERROR(FILTER($F$1:$K$1,INDEX($F$2:$K$9,MATCH(A2,$E$2:$E$9,0),0)=1),"No Match")
With older versions we can use another INDEX/MATCH:
=IFERROR(INDEX($F$1:$K$1,MATCH(1,INDEX($F$2:$K$9,MATCH(A2,$E$2:$E$9,0),0),0)),"No Match")

How to return all rows that have equal number of values of 0 and 1?

I have dataframe that has 50 columns each column have either 0 or 1. How do I return all rows that have an equal (tie) in the number of 0 and 1 (25 "0" and 25 "1").
An example on a 4 columns:
A B C D
1 1 0 0
1 1 1 0
1 0 1 0
0 0 0 0
based on the above example it should return the first and the third row.
A B C D
1 1 0 0
1 0 1 0
Because you have four columns, we assume you must have atleast two sets of 1 in a row. So, please try
df[df.mean(1).eq(0.5)]

How to identify where a particular sequence in a row occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 0
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 0 0 1 0 0 1
E 1 1 1 1 1 1
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to find the column number where the 1 0 0 pattern occurs for the first time. For example, for member A, the first 1 0 0 pattern occurs at appear_1. so the first occurrence will be 1. Similarly for the member B, the first 1 0 0 pattern occurs at appear_2, so the first occurrence will be at column 2. The resulting table should have a new column named 'first_occurrence'. If there is no such 1 0 0 pattern occurs (like in row E) then the value in first occurrence column will the sum of number of 1 in that row. The resulting table should look something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
A 1 0 0 1 0 0 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 4
D 0 0 1 0 0 1 3
E 1 1 1 1 1 1 6
Thank you in advance.
I try not to reinvent the wheel, so I develop on my answer to previous question. From that answer, you need to use additional idxmax, np.where, and get_indexer
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['first_occurrence'] = np.where(m2.any(1), df1.columns.get_indexer(m2.idxmax(1)),
df1.shape[1])
Out[540]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
0 A 1 0 0 1 0 0 1
1 B 1 1 0 0 1 0 2
2 C 1 0 1 1 0 0 4
3 D 0 0 1 0 0 1 3
4 E 1 1 1 1 1 1 6

pandas assign value in multiple columns based on value in one

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

drop duplicate rows from dataframe based on column precedence - python

If I have a database
Example:
Name A B C
0 Jon 0 1 0
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
If there is a duplicate in my dataset I want the person who has column A as 1 to have precedence.
NB. Column A can only have values 1 or 0
Output:
Name A B C
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1
IIUC sort value before drop duplicate
df.sort_values('A').drop_duplicates('Name',keep='last').sort_index()
Out[126]:
Name A B C
1 Jon 1 0 1
2 Alan 1 0 0
3 Shaya 0 1 1

Resources