MS Excel - obtaining and presenting results by combining logical operators, sums, pivot tables, and forms - excel

Before I present the question, I would like to state that I have spent time researching for a solution for this problem, with partial luck.
So, here's the scenario:
There are X number of users, using Y number of products. Each user may use one or more products. Some of the users work from home and others from the office.
e.g.
ID Office Home Dept P1 P2 P3
1 loc1 1 dep1 1 1 0
2 loc1 0 dep1 0 0 1
3 loc2 1 dep1 1 1 1
4 loc4 1 dep2 1 0 1
5 loc3 0 dep2 1 1 0
6 loc3 0 dep1 1 1 0
7 loc1 0 dep3 1 0 0
What I want to achieve:
I would like to calculate and present, for various combinations of products, the number of users in common and the total number of users. I know that I can use the logical functions 'or' and 'and' to determine 'union' and 'intersection', and calculate a column sum to get the total number of users and number of users in common. However, I would like to be able to select various combinations of products without creating a new formula each time; something with check-boxes.
What I want to present:
As for the presentation of the data, would I be able to create something useful via a pivot table, or should I look at custom forms, which include check-boxes for P1, P2, P3, ...? Any hints and suggestions would be much appreciated.
Thank you,
amx13

Related

Compare multiple data from rows

I'm looking for a way to compare multiple rows with data to each other, trying to find the best possible match. Each number in every column must be an approximately match the other numbers in the same column.
Example:
Customer #1: 1 5 10 9 7 7 8 2 3
Customer #2: 10 5 9 3 5 7 4 3 2
Customer #3: 1 4 10 9 8 7 6 2 2
Customer #4: 9 5 6 7 2 1 10 5 6
In this example customer #1 and #3 is quite similar, and I need to find a way to highlight or sort the rows so I can easily find the best match.
I've tried using conditional formatting to highlight the numbers that are the similar, but that is quite confusing, because the amount of data is quite big.
Any ideas of how I could solve this?
Thanks!
The following formula entered in (say) L1 and pulled down gives the best match with the current row based on the sum of the absolute differences between corresponding cells:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(ABS($C1:$K1-$C$1:$K$4),TRANSPOSE(COLUMN($C$1:$K$4))^0))))
It is an array formula and must be entered with CtrlShiftEnter.
You can then sort on column L to bring the customers with lowest similarity scores to the top or use conditional formatting to highlight rows with a certain similarity value.
EDIT
If you wanted to penalise large differences in individual columns more heavily than small differences to try and avoid pairs of customers which are fairly similar except for having some columns very different, you could try something like the square of the differences:-
=MIN(IF(ROW($C$1:$K$4)<>ROW(),(MMULT(($C1:$K1-$C$1:$K$4)^2,TRANSPOSE(COLUMN($C$1:$K$4))^0))))
then the scores for your test data would come out as 7,127,7,127.
I'm assuming you want to compare customers 2-4 with customer 1 and that you are comparing only within each column. In this case, you could implement a 'scoring system' using multiple IFs. For example,:
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2
3 Customer 3 0 1 0
you could use in E2
=if(B2=$B$1,1,0)+if(C2=$C$1,1,0)+if(D2=$D$1,1,0)
This will return a 'score' of 1 when you have a match and a 'score' of 0 when you don't. It then adds up the scores and your highest value will be your best match. Copying down would then give
A B C D E
1 Customer 1 1 1 2
2 Customer 2 1 2 2 2
3 Customer 3 0 1 0 1
so customer 2 is the best match.

Group people based on their hobbies in Spark

I am working with PySpark on a case where I need to group people based on their interests. Let's say I have n persons:
person1, movies, sports, dramas
person2, sports, trekking, reading, sleeping, movies, dramas
person3, movies, trekking
person4, reading, trekking, sports
person5, movies, sports, dramas
.
.
.
Now I want to group people based on their interests.
Group people who have at least m common interests (m is user input, it could be 2, 3, 4...)
Let's assume m=3
Then the groups are:
(person1, person2, person5)
(person2, person4)
User who belongs to x groups (x is user input)
Let's assume x=2
Then
person2 is in two groups
My response will be algebraic and not Spark/Python specific, but can be implemented in Spark.
How can we express the data in your problem?
I will go with matrix - each row represents a person, each column represents interest. So following your example:
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
What if we would like to investigate similarity of P2 and P3 - check how many interests do they share - we could use the following formula:
(movies)+(sports)+(trekking)+(reading)+(sleeping)+(dramas)
1*1 + 1*0 + 1*1 + 1*0 + 1*0 + 1*0 = 2
It may look familiar to you - it looks like part of matrix multiplication calculation.
To get full usage of the fact that we observed, we have to transpose the matrix - it will look like that:
P1,P2,P3,P4,P5
movies 1 1 1 0 1
sports 1 1 0 0 1
trekking 0 1 1 1 0
reading 0 1 0 1 0
sleeping 0 1 0 0 0
dramas 1 1 0 1 1
Now if we multiply the matrices (original and transposed) you would get new matrix:
P1 P2 P3 P4 P5
P1 3 3 1 1 3
P2 3 6 2 3 4
P3 1 2 2 1 1
P4 1 3 1 2 1
P5 3 3 1 1 3
What you see here is the result you are looking for - check the value on the row/column crossing and you will get number of shared interests.
How many interests do P2 share with P4? Answer: 3
Who shares 3 interests with P1? Answer: P2 and P5
Who shares 2 interests with P3? Answer: P2
Some hints on how to apply this idea into Apache Spark
How to operate on matrices using Apache Spark?
Matrix Multiplication in Apache Spark
How to transpose matrix using Apache Spark?
Matrix Transpose on RowMatrix in Spark
EDIT 1: Adding more realistic method (after the comments)
We have a table/RDD/Dataset "UserHobby":
movies,sports,trekking,reading,sleeping,dramas
P1: 1 1 0 0 0 1
P2: 1 1 1 1 1 1
P3: 1 0 1 0 0 0
P4: 0 0 1 1 0 1
P5: 1 1 0 0 0 1
Now to find all the people that share 2 groups with P1 you would have to execute:
SELECT * FROM UserHobby
WHERE movies*1 + sports*1 + sports*0 +
trekking*0 + reading*0+ sleeping*0 + dramas*1 = 2
Now you would have to repeat this query for all the users (changing 0s and 1s to the actual values). The algorithm complexity is O(n^2 * m) - n number of users, m number of hobbies
What is nice about this method is that you don't have to generate subsets.
Probably my answer might not be the best, but will do the work. If you know the total list of hobbies in prior, you can write a piece of code which can compute the combinations before going into spark's part.
For example:
Filter out the people whose hobbies_count < input_number at the the very start to scrap out unwanted records.
If the total list of hobbies are {a,b,c,d,e,f} and the input_number is 2, the list of combinations in this case would be
{(ab)(ac)(ad)(ae)(af)(bc)(bd)(be)(bf)(cd)(ce)(cf)(de)(df)(ef)}
So will need to generate the possible combinations in prior for the input_number
Later, perform a filter for each combination and track the record count.
If the number of users is large, you can't possibly think about going for any User x User approach.
Step 0. As a first step, we should ignore all the users who don't have at least m interests (since they cannot have at least m common interests with anyone).
Step 1. Possible approaches:
i)Brute Force: If the maximum number of interests is small, say 10, you can generate all the possible interest combinations in a HashMap, and assign an interest group id to each of these. You will need just one pass over the interest set of a user to find out which interest groups do they qualify for. This will solve the problem in one pass.
ii) Locality Sensitive Hashing: After step 0 is done, we know that we only have users that have a minimum of m hobbies. If you are fine with an approximate answer, locality sensitive hashing can help. How to understand Locality Sensitive Hashing?
A sketch of the LSH approach:
a) First map each of the hobbies to an integer (which you should do anyways, if the dataset is large). So we have User -> Set of integers (hobbies)
b) Obtain a signature for each user, by taking the min hash (Hash each of the integers by a number k, and take the min. This gives User -> Signature
c) Explore the users with the same signature in more detail, if you want (or you can be done with an approximate answer).

Excel Pivot Drilldown on Sums of Flags

I've been hitting a wall on this today. I'm pulling my data from a SQL DB and in my query I'm setting flags so that I can use the Pivot table to count up all the different checks I'm trying to run on the data. A simple version of what I'm doing would look like this is below with a Greater than 90 days and Error Check flags set depended on the data in Uptime.
Host Team Uptime >90 ErrChk
Srv1 A 15 0 0
Svr2 A 102 1 0
Srv3 B 95 1 0
Svr4 B 20 0 0
Srv5 B 21 0 0
Srv6 B ERROR 0 1
Srv7 A ERROR 0 1
Srv8 B 150 1 0
Srv9 A 100 1 0
Srv10 A 10 0 0
Srv11 A 125 1 0
Srv12 A 40 0 0
Srv13 B 111 1 0
Srv14 B 100 1 0
Srv15 A 15 0 0
If you were to plug this into Excel exactly the way that it is you will see what I'm getting at. Once you create a Pivot table on this data and use the Team as the Rows and the Sum of >90 and ErrChk the pivot comes out and looks correct.
Row Labels Sum of >90 Sum of ErrChk
A 3 1
B 4 1
Grand Total 7 2
But the next piece is where it gets wonky. With a Pivot table when you double click one of the values it filteres down on what makes that value, however when you do it on the value of a SUM, it just filteres down on the Row Label. So for this example you would click 3 under >90 for team A to see what the 3 servers are that have been up for longer thank 90 days. However when you do this it gives drills down to just a filter of team A showing you all of the servers, the good, the bad, the error.
My question is how can I drill down to just the items that make up that value? I've tried all I can think of, have the filed be NULL, 0, calculated fields.
I would suggest adding an additional pivot table for the view you want. For your example you might want to add Team, then Host to Row Labels. Then add Sum of > 90 to Values. Then add > 90 to Report Filter and set the filter to 1.
These pivot tables could all be on the one worksheet pulling from the same data source as a kind of summary and the filter would be preset so there would be no need to provide instructions to those viewing the summary.
If you'd like let users change the filter easily you could add a slicer which will present all options by which the data can be filtered by.

Calculate percentage using functions

I have an Excel workbook (Office 2010) that lists multiple different spreadsheets (offices) in our organization. We use this workbook to keep track of their "errors" in turned-in documents (example only). I'm trying to figure out a good way to determine, through automation (functions) the percentage of documents that have had errors in them. I want to determine the percentage of documents looked at versus the amount of errors, completely ignoring the amount of errors. So if I looked at 10 documents and 7 of those had at least one error, the office's percentage of errors would be 70%.
Is there any easy way to do this?
I've tried a few functions but I continue to get errors. I show a sample sheet (one office) below. This example is similar across multiple sheets and there is a dashboard that I would like to display all of these statistics based on offices.
workpaper DISCREPENCIES
Paper Spelling Grammar Punctuation Total Errors/Paper
A.36.7 1 0 1 2
A.36.8 0 1 1 2
A.36.9 0 1 0 1
A.36.10 0 0 0 0
A.36.11 1 0 0 1
A.36.12 1 1 1 3
A.36.13 2 3 0 5
A.36.14 0 0 0 0
A.36.15 0 0 0 0
A.36.16 1 1 1 3
Total Errors 17
Total Documents 10
Total Documents w/ errors 7
Percentage of Errors 70%
I can do all of this manually but I would like to find a way to do this across all sheets since there are a quite a few and output them to a "dashboard" that has all offices listed in rows.
One way is to look at the number of worksheets with 0 errors in them. Then subtract that percentage from 100%. For example in G5:
=COUNTIF(E3:E12,0)/COUNT(E3:E12)
and in G6:
=100%-G5

PivotTable will not group ANYTHING

I have done some Googling and found other people had issues when their DATE fields were not true dates, had blank values, or where date values were not actual dates (e.g. 1/33/2015). However, none of these are true for my table.
I am also unable to group regular text fields into a new group like I used to be - I am working with ticket closures and want to Group "Cancelled" and "Completed" tickets into a Closed group and then all else into an Open Group.
I was able to do this on another Excel sheet a few weeks ago but now the option is completely grayed out.
Count of ETC Column Labels
Row Labels Closed-AS IS Complete New Pending Grand Total
John Doe 2 365 367
1/2/2015 2 2
1/5/2015 1 1
1/8/2015 1 1
1/15/2015 1 1
1/16/2015 1 1
1/20/2015 2 2
1/21/2015 2 2
1/22/2015 2 2
1/23/2015 1 1
1/26/2015 2 2
1/30/2015 1 1
2/2/2015 1 1
2/3/2015 2 2
2/4/2015 3 3
2/5/2015 1 1
2/6/2015 1 1
EDIT:
AHAH! Okay, so I've narrowed it down (still not resolved though). I am able to GROUP/UNGROUP utilizing an extrenal data source (in this case it's a .iqy file - or SharePoint List/Export). I am UNABLE to GROUP/UNGROUP when referencing a table defined on the actual workbook or on the Data Model.
This is not much of an answer but here's what I found.
PowerPivot, apparently, does NOT allow for GROUPING/UNGROUPING so if your data is added to PowerPivot's data model these options will be grayed out. Instead, you will have to "force" the data by creating new columns with whatever grouping logic you can build in DAX and then create a hierarchy out of them.

Resources