Find identical Excel files (except file name and some attributes) in a folder - excel

Students submit an assignment in Excel. Many students copy someone else's work and submit identical Excel files (The files are identical in every other way except the filename and date/time attributes might be different. Size may be slightly different for some reason unknown to me.).
All the files are in a single folder.
How may I check to see which files are identical (except for filename, some date/time attributes, and minor file size differences)?

Have a look at Duplicate Files Finder.

In general, you should do a pair-wise comparison. For example, for 6 students, you would compare 15 pairs:
compare 1 to 2
compare 1 to 3
compare 1 to 4
compare 1 to 5
compare 1 to 6
compare 2 to 3
compare 2 to 4
compare 2 to 5
compare 2 to 6
compare 3 to 4
compare 3 to 5
compare 3 to 6
compare 4 to 5
compare 4 to 6
compare 5 to 6

I do not know of any software that checks all pairs of spreadsheets in a folder and lists those that are identical. The tool suggested by #Serge does a byte-by-byte comparison but that is too restrictive for your purposes. Two students may share a spreadsheet and simply by saving them at different times or with difference software versions the files may have differences at the byte level but no real differences at the level of cell contents.
However, if you have a small number of files and can manually compare each pairing, then the following formula may help you. Suppose the spreadsheets are Student1 and Student2, and they have only one worksheet each, and the meaningful content is restricted to the range A1:Z1000. Then this array formula will return true if and only if every cell in the range are identical on the two sheets.
=AND([Student1]Sheet1!A1:Z1000=[Student2]Sheet1!A1:Z1000)
(Note that this is an array formula so must be entered using Ctrl-Shift-Enter.)
Once you get that to work for just two files, then you could set up a list of file pairs to compare, perhaps like this:
+---------------+---------------+-----------+
| Spreadsheet A | Spreadsheet B | Identical |
+---------------+---------------+-----------+
| Student1 | Student2 | FALSE |
| Student1 | Student3 | TRUE |
| Student2 | Student3 | FALSE |
+---------------+---------------+-----------+
where the formula in C2 is {=AND(INDIRECT(CONCAT("'[",A2,".xlsx]Sheet1'!A1:Z1000"))=INDIRECT(CONCAT("'[",B2,".xlsx]Sheet1'!A1:Z1000")))}

Related

How to merge data of two excel sheets into the third sheet with some cleansing operations

I have a homework assignment where I have to merge data of two excel sheets by performing some cleansing operations using formulas.
Sheet 1:
OrderID | Full Name | Customer Status
1001 Waqar Hussain Silver
2002 Ali Moin Gold
Sheet 2:
OrderID | First Name | Last Name | Customer Status
A1003 Junaid Ali 2
A2004 Kamran Hussain 1
Sheet 3:(Combined Sheet) - Expected
OrderID | Full Name | Customer Status
1001 Waqar Hussain Silver
2002 Ali Moin Gold
1003 Junaid Ali Silver
2004 Kamran Hussain Gold
There are probably a lot of ways to do this. First make sure the data is cleaned. If you are already 100% positive the data is clean you can skip this step. If you aren't sure it's better to be safe than sorry. For each column create a new column using the CLEAN and TRIM functions to remove any non-printable characters and any extra spaces. Something similar to =TRIM(CLEAN(A2)). Then drag the formula for each cell.
After this in order to merge the data together we need something to join on. The full name seems to make the most sense. On sheet two we'll write a new function to join the first name and last name together. The =CONCAT formula should work.
=CONCAT(First Name, " " ,Last Name). Make sure to note the extra space added by the quote. That way it matches the Full Name from Sheet 1. Looks like we'll also need to strip out the letter from Order ID in sheet 2. I'm going to assume that all Order IDs are 5 characters long. If this isn't true then you'll need a different solution. You can use =RIGHT(A2,4). This will grab the right 4 characters from the text string.
At this point let's create a distinct list. Copy the Full Names from Sheet1 and Paste them on to sheet 3. Copy the Full Names we created on Sheet2 and Paste VALUES onto sheet 3 below the full names from sheet 1. Then select all the rows in the column and go to the Data tab. Click "Remove Duplicates". This will now generate a distinct list of values.
We can now merge the data together using an INDEX MATCH. There are lots of great tutorials on how to use INDEX match in combination. It's a little long to explain on this thread, but this is a great thread explaining how it works. It's worth taking 10 minutes to fully understand it because it is a formula you will use thousands of times throughout your life.
https://www.deskbright.com/excel/using-index-match/
Let me know if I can clarify anything.
Best,
Brett

Excel - Sort Columns of Differing Length That Adds Blank Cells when No Match Is Found

I have a spreadsheet with four total columns of data. One set is the product SKU and on hand quantities for a brick and mortar store and the other set is for the website. The two sets of product SKUs have different item count totals. I have googled to see if it was possible to sort the lists to make them even (SKU = SKU) by filling in blanks where no match is found. It seems to be but I have yet to find anything concrete on what to actually DO.
For example, if I have a group that looks like this in the spreadsheet:
SKU1 0 SKU1 2
SKU2 2 SKU3 2
SKU3 3 SKU8 5
I need it to look like this:
SKU1 0 SKU1 2
SKU2 2
SKU3 2 SKU3 3
SKU8 5
I checked several Excel forums and several links here. I checked these: Compare columns of unequal length for matches and differences, http://www.excelforum.com/excel-formulas-and-functions/877911-sort-and-match-uneven-columns-of-data.html, http://www.excelforum.com/excel-general/608313-sorting-with-unequal-cell-size.html. This link was close: https://superuser.com/questions/307429/how-do-i-line-up-two-sets-of-data-in-excel. However, I don't think it quite achieves my goal, but it's possible I'm just not sure of the information given (I'm no Excel pro, which is why I'm asking).
If I need to provide the spreadsheet for further context, I can do so.
This seems to be a common type of question, but I have yet to find anything solid that fully explains it. If I follow the information right, "VLOOKUP" is what I want, but I don't fully understand it.
[EDIT] Forgot to mention, I do need both lists as I need to identify the differences in quantity between each list. That, I think I can do already, but I need the list sorted as desired for better comparison.

Searching in Excel for certain values, if found give text from cell to the left of where we found the value

First let me explain what I want to achieve.
I currently have an Excel like this:
Names | Standards
James | Standard 1
James | Standard 2
James | Standard 3
Francis | Standard 1
Francis | Standard 2
Francis | Standard 3
Leon | Standard 2
Leon | Standard 3
Peter | Standard 2
Michael | Standard 3
And I want to create something like this:
Standard | Name 1 | Name 2 | Name 3 | Name 4
Standard 1 | James | Francis | |
Standard 2 | James | Francis | Leon | Peter
Standard 3 | James | Francis | Leon | Michael
My real Excel has more than 300 standards, so I would like to automate this using Excel Formula. I know this is possible, but I haven't used Excel in a while, so I could use a push in the right direction.
Couple of things I need (I think):
Need to count how many times people in the names column mention a standard. So I want to know that I need 2 names for standard 1 and 4 for standard 3. I think I can do this by using the COUNTIF method.
We need to search for the location of the standards. I think I can do this by using the Match function. This gives us the location of the first match in my original Excel. By sorting my original Excel a-z and combining it with the countif result I know where all the matches are (first match + countif = location of the last match, and everything inbetween is also that standard).
For the first name that mentioned a standard, I will reference the cell left of the first match (because the names are in the cell to the left of the standard I found). For the second name I will reference the cell left of the cell below the first match. I keep doing this till I find as many names as Countif mentioned. So I need an IF statement that makes sure that if 2 people mention standard 1 only gets 2 names and 2 cells with a "".
How will I reference the cells? By another if statement that uses this: Excel Reference To Current Cell , Correct me if I am wrong, but can't I then just say THIS.CELL=cell location I found (probably should use INDIRECT here?).
This is just me brainstorming, but I would love to know if people have any other ideas for my problem or have some feedback for my current plan.
An important thing to mention is that I want to do this using Excel Formula. I do realise that this isn't always the best, but VBA is not an option atm. I am also not worried about performance issues, because I think i'll just copy all the values after I found all the names using formulas.
Thanks in advance!
Depending on how you want to have the layout, I think you should use a pivot table. Drag the 'Standards' and 'Names' fields to the 'rows' data box and then right-click on a standard, click 'Field Settings' - 'Layout and Print' - 'Show item labels in tabular form'. (See example below.)
If you definitely need the data in the format in your question, I would edit the pivot table by dragging the 'names' field to the 'columns' data box. Then drag the 'standards' field from the field list above a second time and duplicate it in the 'values' box (see example below).
In the space underneath the pivot table, use an IF formula to only copy the name if there is a 1. This kind of approach will obviously be quite fragile, so if you can make do with the first approach, I think you will run into fewer problems in the future.

Rank the top 5 entries in different criteria

I have a table that I want to find the top X people in each of the different groups.
Unique Names Number Group
a 30 1
b 4 2
c 19 3
d 40 2
e 1 1
f 9 2
g 15 3
I've ranked the top 5 people by number by using =index($A$2:$A$8,match(large($B$2:$B$8,1),$B$2:$B$8,0)). The 1 in the LARGE function I linked to a ranked range so that when I dragged down it changed up the number.
What I would like to do next is rank the top x number of people in each group. So top 3 in group 1.
I tried =index($A$2:$A$8,match("1"&large($B$2:$B$8,1),$C$2:$C$8&$B$2:$B$8,0)) but it didn't seem to work.
Thanks
EDIT: After looking at the answers below I have realised why they are not working for me. My actual data that I want to use the formula with have multiple entries of numbers. I have adjusted the example data to show this. The problem I have is that if there are duplicate numbers then it returns both of the names even if one is not in the group.
Unique Names Number Group
a 30 1
b 30 2
c 19 3
d 40 2
e 1 1
f 30 2
g 15 3
Proof of Concept
Use the following formula in the example above in cell F2 and copy down and to the right as needed.
=IFERROR(INDEX($A$2:$A$8,MATCH(AGGREGATE(14,6,($C$2:$C$8=F$1)*($B$2:$B$8),ROW($A2)-1),$B$2:$B$8,0)),"")
In the header row provide the group numbers. or come up with a formula to augment and reset the group number as you copy down based on your X number in your question.
Explanation:
The AGGREGATE function unlike the large function is an array function without the need to use CSE. As such we can add criteria to what we want to use. In this case only 1 criteria was used and that was the group number. in the formula it was the following part:
($C$2:$C$8=F$1)
If there were multiple criteria we would use either an + operator as an OR or we would use an * operator as an AND.
The 6 option in the aggregate function allows us to ignore errors. This is useful when trying to get the small. It is also useful for dealing with other information that may cause errors that do not need to be worried about.
As this is technically an array operation avoid using full column/row references as they can bog down your system.
The basics of what the over all formula is doing is building a list that match the group number you are interested in. After filtering your numbers, it then determines which is the largest, second largest etc by what row you have copied down to. It then determine what row the nth largest number occurs in through the match function, and finally it returns to the corresponding name to that row with the index function.
Building on all the other great answers.
Because you have the possibilities of duplicate values in each group we need to do this with two formulas.
First we need to get the numbers in order. I used the Aggregate, but this could be done with the array LARGE(IF()) also:
=IFERROR(AGGREGATE(14,6,$B$2:$B$8/($C$2:$C$8=E$1),ROW(1:1)),"")
Then using that number and order we can reference, we can use a modified version of #ForwardEd's formula, using COUNTIF() to ensure we get the correct name in return.
=IFERROR(INDEX($A$2:$A$8,AGGREGATE(15,6,(ROW($B$2:$B$8)-ROW($B$2)+1)/(($C$2:$C$8=F$1)*($B$2:$B$8=E3)),COUNTIF(E$2:E2,E3)+1)),"")
This will count the number in the results returned and then bring in the correct name.
You could also solve this with array formulas - to filter a group whose name is stored in E1, your code
=INDEX($A$2:$A$8,MATCH(LARGE($B$2:$B$8,1),$B$2:$B$8,0))
would then be adapted to
=INDEX($A$2:$A$8,MATCH(LARGE(IF($C$2:$C$8<>E1,-1,$B$2:$B$8),1),$B$2:$B$8,0))
Note: After entering an array formula, you have press CTRL+SHIFT+ENTER.
Thank you to everyone who offered help but for some reason none of your methods worked for me, which I am sure was to do with the quality of my data. I used an alternate method in the end which is slightly convoluted but seemed to work.
=IF($C2="1",RANK($B2,$B$2:$B$8,1)+ROW()/10000,-1)
Essentially using the rank function and adding a fraction to separate out duplicate values.

Excel VLOOKUP not finding correct row

I have the following table of two columns:
102-6956821-1091413 1
115-8766130-0234619 2
109-8688911-2954602 3
109-7731824-8641056 4
If I put in the following VLOOKUP:
=+VLOOKUP(B2,B$2:C$5,2)
I get the result of:
1
2
1
1
If I change it to =+VLOOKUP(B2,B$2:C$5,2, FALSE) I get the expected:
1
2
3
4
But why is this? There is an exact match available so why does it need to approximate? If it is, why is it generating the numbers it is? How is it reducing that text value to a number that it is approximating from? Thanks!
Transfer from comment for the sake of an answer:
If your search list (ColumnB) were sorted you would see the results you expect anyway (though in a different order). For speed, VLOOKUP is using a binary search method for which an ordered list is necessary if to achieve meaningful results. There are exact matches only in the first half of the unsorted list (hence 1 and 2 are correct but neither 1 nor 1 is).

Resources