Find the differences between 2 Excel worksheets? - excel

I have two excel files with the same structure: they both have 1 column with data. One has 800 records and the other has 805 records, but I am not sure which of the 5 in the 805 set are not in the 800 set. Can I find this out using Excel?

vlookup is your friend!
Position your column, one value per row, in column A of each spreadsheet.
in column B of the larger sheet, type
=VLOOKUP(A1,'[Book2.xlsb]SheetName'!$A:$A,1,FALSE)
Then copy the formula down as far as your column of data runs.
Where the result of the formula is FALSE, that data is not in the other worksheet.

It might seem like a hack, but I personally prefer copying the cells as text (or exporting as a CSV) into Winmerge or any other diff tool. Assuming the two sheets contain mostly identical data, Winmerge will show the differences immediately.

LibreOffice provides a Workbook Compare feature: Edit -> Compare Document

Easy way: Use a 3rd sheet to check.
Say you want to find differences between Sheet 1 and Sheet 2.
Go to Sheet 3, cell A1, enter
=IF(Sheet2!A1<>Sheet1!A1,"difference","").
Then select all cells
of sheet 3, fill down, fill right.
The cells that are different
between Sheet 1 and Sheet 2 will now say "difference" in Sheet 3.
You could adjust the formula to show the actual values that were different.

Excel has this built in if you have an excel version with the Inquire add-in.
This link from office webpage describes the process of enabling the add-in, if it isn't activated, and how to compare two compare two workbooks - among other things.
The comparison shows both structural differances as well as editorial and a lot of other changes if
http://office.microsoft.com/en-us/excel-help/what-you-can-do-with-spreadsheet-inquire-HA102835926.aspx

you should try this free online tool - www.cloudyexcel.com/compare-excel/
works good for most of the time, sometimes the results are a little off.
plus it also gives a good visual output
You can also download the results in excel format. (you need to signup for that)

COUNTIF works well for quick difference-checking. And it's easier to remember and simpler to work with than VLOOKUP.
=COUNTIF([Book1]Sheet1!$A:$A, A1)
will give you a column showing 1 if there's match and zero if there's no match (with the bonus of showing >1 for duplicates within the list itself).

If you have Microsoft Office Professional Plus 2013, you can use Microsoft Spreadsheet Compare to run a report on the differences between two workbooks.
Launch Spreadsheet Compare:
In Windows 7: On the Windows Start menu, under Office 2013 Tools, click Spreadsheet Compare.
In Windows 8: On the Start screen, click Spreadsheet Compare. If you do not see a Spreadsheet Compare tile, begin typing the words Spreadsheet Compare, and then select its tile.
Compare two Excel workbooks:
Click Home > Compare Files.
a. Click the blue folder icon next to the Compare box to browse to the location of the earlier version of your workbook. (In addition to files saved on your computer or on a network, you can enter a web address to a site where your workbooks are saved.)
b. Click the green folder icon next to the To box to browse to the location of the workbook that you want to compare to the earlier version, and then click OK. (TIP You can compare two files with the same name if they're saved in different folders.)
In the left pane, choose the options you want to see in the results of the workbook comparison by checking or unchecking the options, such as Formulas, Macros, or Cell Format. Or, just Select All.
Reference:
https://support.office.com/en-us/article/Basic-tasks-in-Spreadsheet-Compare-f2b20af8-a6d3-4780-8011-f15b3229f5d8

I think your best option is a freeware app called Compare IT! .... absolutely brilliant utility and dead easy to use. http://www.grigsoft.com/wincmp3.htm

Use the vlookup function.
Put both sets of data in the same excel file, on different sheets. Then, in the column next to the 805 row set (which I assume is on sheet2), enter
=if(isna(vlookup(A1, Sheet1!$A$1:$A$800, 1, false)), 0, 1)
The column will contain 0 for values that are not found in the other sheet, and 1 for values that are. You can sort the sheet to find all the missing values.

SO in fact that you are using excel means that you can use the
SpreadSheet Compare from Microsoft. It is available from Office 2013. Yes i know this question is older then 6 years. But who knows maybe someone need this information today.

The Notepad++ compare plugin works perfectly for this. Just save your sheets as .csv files and compare them in Notepad++. Notepad++ gives you a nice visual diff.

May be this replay is too late. But hope will help some one looking for a solution
What i did was, I saved both excel file as CSV file and did compare with Windiff.

ExcelDiff exports a HTML report in a Divided (Side-by-side) or Merged (Overlay) view highlighting the differences as well as the row and column.

I used Excel Compare. It is payware, but they do have a 15 day trial. It will report amended rows, added rows, and deleted rows. It will match based on the worksheet name (as an option):
http://www.formulasoft.com/excel-compare.html

Use conditional formatting to highlight the differences in excel.
Here's an example.

With just one column of data in each to compare a PivotTable may provide much more information. In the image below ColumnA is in Sheet1 (with a copy in Sheet2 for the sake of the image) and ColumnC in Sheet2. In each sheet a source flag has been added (Columns B and D in the image). The PT has been created with multiple consolidation ranges (Sheet1!$A$1:$B$15 and Sheet2!$C$1:$D$10):
The left hand numeric column shows what is present in Sheet1 (including q twice) and the right what in Sheet2 (again with duplicates – of c and d). d-l are in Sheet1 but not Sheet2 and w and z are in Sheet2 (excluding those there just for the image) but not Sheet1. Add display Show grand totals for columns and control totals would appear.

I found this command line utility that doesn't show the GUI output but gave me what I needed: https://github.com/na-ka-na/ExcelCompare
Sample output (taken from the project's readme file):
> excel_cmp xxx.xlsx yyy.xlsx
DIFF Cell at Sheet1!A1 => 'a' v/s 'aa'
EXTRA Cell in WB1 Sheet1!B1 => 'cc'
DIFF Cell at Sheet1!D4 => '4.0' v/s '14.0'
EXTRA Cell in WB2 Sheet1!J10 => 'j'
EXTRA Cell in WB1 Sheet1!K11 => 'k'
EXTRA Cell in WB1 Sheet2!A1 => 'abc'
EXTRA Cell in WB2 Sheet3!A1 => 'haha'
----------------- DIFF -------------------
Sheets: [Sheet1]
Rows: [1, 4]
Cols: [A, D]
----------------- EXTRA WB1 -------------------
Sheets: [Sheet1, Sheet2]
Rows: [1, 11]
Cols: [B, K, A]
----------------- EXTRA WB2 -------------------
Sheets: [Sheet1, Sheet3]
Rows: [10, 1]
Cols: [J, A]
-----------------------------------------
Excel files xxx.xlsx and yyy.xlsx differ

Tried to find a tool that will help to extract only the different sheets with the cell difference highlighted. Could not find any, so ended up writing one for myself. I hope this helps someone who is looking for similar solution. It takes care of left/right unique sheets, identical/different size sheets.
import pandas as pd
import xlsxwriter
import numpy as np
from openpyxl import load_workbook
# Get a complete list of sheets from both WorkBook
BOOK1 = "Book_1.xlsx"
BOOK2 = "Book_2.xlsx"
xlBook1 = pd.ExcelFile(BOOK1)
sheetsBook1 = xlBook1.sheet_names
xlBook2 = pd.ExcelFile(BOOK2)
sheetsBook2 = xlBook2.sheet_names
sheets = list(set(sheetsBook1 + sheetsBook2))
with pd.ExcelWriter('Difference.xlsx', engine='xlsxwriter', mode='w') as writer:
for sheet in sheets:
print (sheet)
book1 = None
book2 = None
book1Exists = True
book2Exists = True
try:
book1 = pd.read_excel(BOOK1,sheet_name=sheet,header=None,index_col=False).fillna(' ')
except ValueError as ve:
book1Exists = False
try:
book2 = pd.read_excel(BOOK2,sheet_name=sheet,header=None,index_col=False).fillna(' ')
except ValueError as ve:
book2Exists = False
# Case 1: Both sheet exist and they are identical size
if ( (( (book1Exists == True) and (book2Exists == True) )) and
( (len(book1) == len(book2)) and (len(book1.columns) == len(book2.columns)) )):
comparevalues = book1.values == book2.values
if False in comparevalues:
rows,cols = np.where(comparevalues==False)
for item in zip(rows,cols):
book1.iloc[item[0],item[1]] = ' {} --> {} '.format(book1.iloc[item[0], item[1]], book2.iloc[item[0],item[1]])
book1.to_excel(writer,sheet_name=sheet,index=False,header=False)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets[sheet]
# Add a format. Light red fill with dark red text.
format1 = workbook.add_format({'bg_color': '#FFC7CE',
'font_color': '#9C0006'})
# Apply a conditional format to the cell range.
worksheet.conditional_format('A1:AZ100',{'type': 'text',
'criteria': 'containing',
'value': '-->',
'format': format1})
# Case 2: Left unique case
elif (book1Exists == False):
book2.to_excel(writer,sheet_name=sheet+" B2U",index=False,header=False)
# Case 3: Right unique case
elif (book2Exists == False):
book1.to_excel(writer,sheet_name=sheet+" B1U",index=False,header=False)
# Case 4: Both exist but different size
elif (( (book1Exists == True) and (book2Exists == True) ) and
( (len(book1) != len(book2)) or (len(book1.columns) != len(book2.columns)) )):
if (book1.size > book2.size):
book1.to_excel(writer,sheet_name=sheet+" SD",index=False,header=False)
elif (book2.size > book1.size):
book2.to_excel(writer,sheet_name=sheet+" SD",index=False,header=False)

It is not clear from your question if you want to identify values not present in larger set or to check in the larger set if the value is present in the shorter one. Here a solution for both cases:
Values in Subset not in Set
=FILTER(B2:B11,ISNUMBER(MATCH(B2:B11,A2:A6,0)))
Check if value in Set is not in Subset
=IF(ISNUMBER(MATCH(B2:B11,A2:A6,0)), TRUE, FALSE)

excel overlay will put both spreadsheets on top of each other (overlay them) and highlight the differences.
http://download.cnet.com/Excel-Overlay/3000-2077_4-10963782.html?tag=mncol

Related

Reading an Excel file with united cells in Python

I have an excel table of the following type (the problem described below is driven by the presence of the united cells).
I am using read_excel from pandas to read it.
What I want: I would like to use the values in the first column as an index, and to have the values in the third column combined in one cell, e.g. like here.
What I get from directly applying read_excel can be seen here.
If needed: please see the code used to read the file below (I am reading it from google drive in google colab):
path = '/content/drive/MyDrive/ExampleFile.xlsx'
pd.read_excel(path, header = 0, index_col = 0)
Could you please help?
Please let me know if anything in the question is unclear.
here is one way to accomplish it. I created the xls similar to yours, the first column had a heading of sno
# fill the null values with values from previous rows
df=df.ffill()
# combine the rows where class is the same and create a new column
df=df.assign(comb=df.groupby(['class'])['type'].transform(lambda x: ','.join(x)))
# drop the duplicated rows
df2=df.drop_duplicates(subset=['class','comb'])[['class','comb']]
class comb
0 fruit apple,orange
2 toys car,truck,train

Making a vector out of excel columns using python

everyone...
I just started on python a couple of days ago because I require to handle some excel data in order to automatically update the data of certain cells from one file into another.
However, I'm kind of stuck since I have barely programmed before, and it's my first time using python as well, but my job required me to find a solution and I'm trying to make it work even though it's not my field of expertise.
I used the "xlrd library", imported my file and managed to print the columns I'm needing... However, I can't find a way to put those columns into a matrix in order to handle the data like this:
Matrix =[DataColumnA DataColumnG DataColumnH] in the size [nrows x 3]
As for now, I have 3 different outputs for the 3 different columns I need, but I'm trying to join them together into one big matrix.
So far my code looks like this:
import xlrd
workbook = xlrd.open_workbook("190219_serviciosWRAmanualV5.xls");
worksheet = workbook.sheet_by_name("ServiciosDWDM");
workbook2 = xlrd.open_workbook("Potencia2.xlsx");
worksheet2 = workbook2.sheet_by_name("Hoja1");
filas = worksheet.nrows
filas2 = worksheet2.nrows
columnas = worksheet.ncols
for row in range (2, filas):
Equipo_A = worksheet.cell(row,12).value
Client_A = worksheet.cell(row,13).value
Line_A = worksheet.cell(row, 14).value
print (Equipo_A, Line_A, Client_A)
So I have only gotten, as mentioned above, the data in the columns which is what I'm printing which you can see.
What I'm trying to do, or the main thing I need to do is to read the cell of the first row in Column A and look for it in the other excel file... if the names match, I would have to validate that for the same row (in file 1) the data in both the ColumnG and ColumnH is the same as the data in the second file.
If they match I would have to update Column J in the first file with the data from the second file.
My other approach is to retrieve the value of the cell in ColumnA and look for it in the column A of the second file, then I would make an if conditional to see if ColumnsG and H are equal to Column C of 2nd file and so on...
The thing here is, I have no idea how to pin point the position of the cell and extract the data to make the conditional for this second approach.
I'm not sure if by making that matrix my approach is okay or if the second way is better, so any suggestion would be absolutely appreciated.
Thank you in advance!

Openpyxl to check for keywords, then modify next to cells to contain those keywords and total found

I'm using python 3.x and openpyxl to parse an excel .xlsx file.
For each row, I check a column (C) to see if any of those keywords match.
If so, I add them to a separate list variable and also determine how many keywords were matched.
I then want to add the actual keywords into the next cell, and the total of keywords into the cell after. This is where I am having trouble, actually writing the results.
contents of the keywords.txt and results.xlsx file
here
import openpyxl
# Here I read a keywords.txt file and input them into a keywords variable
# I throwaway the first line to prevent a mismatch due to the unicode BOM
with open("keywords.txt") as f:
f.readline()
keywords = [line.rstrip("\n") for line in f]
# Load the workbook
wb = openpyxl.load_workbook("results.xlsx")
ws = wb.get_sheet_by_name("Sheet")
# Iterate through every row, only looking in column C for the keyword match.
for row in ws.iter_rows("C{}:E{}".format(ws.min_row, ws.max_row)):
# if there's a match, add to the keywords_found list
keywords_found = [key for key in keywords if key in row[0].value]
# if any keywords found, enter the keywords in column D
# and how many keywords into column E
if len(keywords_found):
row[1].value = keywords_found
row[2].value = len(keywords_found)
Now, I understand where I'm going wrong, in that ws.iter_rows(..) returns a tuple, which can't be modified. I figure I could two for loops, one for each row, and another for the columns in each row, but this test is a small example of a real-world scenario, where the amount of rows are in the tens of thousands.
I'm not quite sure which is the best way to go about this. Thankyou in advance for any help that you can provide.
Use the ws['C'] and then the offset() method of the relevant cell.
Thanks Charlie for the offset() tip. I modified the code slightly and now it works a treat.
for row in ws.iter_rows("C{}:C{}"...)
for cell in row:
....
if len(keywords_found):
cell.offset(0,1).value = str(keywords_found)
cell.offset(0,2).value = str(len(keywords_found))

Create comma-delimited values in Excel (using PivotTable)?

Is there a way to generate comma-delimited values in Excel (optimally using a PivotTable)? Consider the following data:
Object Color
foo Red
foo Blue
bar Red
bar Blue
bar Green
baz Yellow
I'd like to get a table like the following:
Object Count of Color Colors
foo 2 Red,Blue
bar 3 Red,Blue,Green
baz 1 Yellow
Is this possible in Excel? The data is coming from a SQL query, so I could write a UDF with a recursive CTE to calculate, but this was for a single ad-hoc query, and I wanted a quick-and-dirty way to get the denormalized data. In the end, it's probably taken longer to post this than to write the UDF, but...
Here's a much simpler answer, adapted from this superuser answer (HT to #yioann for pointing it out and #F106dart for the original):
Assuming the data is in columns A (Category) and B (Value):
Create a new column (C), and name it "Values". Use this formula, starting in cell C2 and copying all the way down: =IF(A2=A1, C1&","&B2, B2)
Create a second new column (D), and name it "Count". Use this formula, starting in cell D2, and copying all the way down: =IF(A2=A1, D1+1, 1)
Create a third new column (E), and name it "Last Line?". Use this fomula, starting in cell E2, and copying all of the way down: =A2<>A3
You can now hide column B (Value) and filter column E (Last Line?) for only the TRUE values.
In summary:
A B C D E
+--------- ----- ----------------------- ------------------- ----------
1| Category Value Values Count Last Line?
2| foo Red =IF(A2=A1,C1&","&B2,B2) =IF(A2=A1, D1+1, 1) =A2<>A3
3| foo Blue =IF(A3=A2,C2&","&B3,B3) =IF(A3=A2, D2+1, 1) =A3<>A2
etc.
Yes, you would be much better off using the tools of whatever RDBMS you're running (MS SQL, MySQL, etc.).
Such a pivot table is possible in Excel. But, only if you write a cumbersome VBA module -- which I don't recommend.
However, the task is simpler in MS Access -- which usually comes bundled with Excel. Microsoft makes it "easy" to link Access and Excel and to use the former to run queries on the latter.
So, given the spreadsheet cells as stated:
For best results, sort the table by Object and then by Color.
Make sure the spreadsheet is saved.
Open up MS Access.
Select File --> Open   (CtrlO)
Under Files of type, select Microsoft Excel
Navigate to and choose your existing spreadsheet.
Choose the worksheet or named range that contains your table.
Give the linked table the name MyPivot.
Open the Visual Basic Editor... Tools --> Macro --> Visual Basic Editor (AltF11)
Insert a module and paste in this UDF:
'Concat returns a comma-seperated list of items
Public Function Concat (CategoryCol As String, _
ItemCol As String) As String
Static LastCategory As String
Static ItemList As String
If CategoryCol = LastCategory Then
ItemList = ItemList & ", " & ItemCol
Else
LastCategory = CategoryCol
ItemList = ItemCol
End If
Concat = ItemList
End Function
Save the project and close the VB editor
Under Queries, Create a new query in design view.
Switch to the SQL View.
Paste in this SQL:
SELECT
Object,
COUNT (Color) AS [Count of Color],
LAST (Concat (Object, Color)) AS [List 'O Colors]
FROM
MyPivot
GROUP BY
Object
Run the query (Press the red exclamation mark or just select the Datasheet View).
Voilà, done in 15 easy steps!   ;)
Results:
Object Count of Color List 'O Colors
bar 3 Blue, Green, Red
baz 1 Yellow
foo 2 Blue, Red
An even easier way is to add the data to the data model when you create the pivot table and then use a "measure" (called "Colours") as follows:
=CONCATENATEX(Table1,[Color],", ")
Then add the "Colours" field to the values portion of the pivot.

Multiply numbers in Excel or LibreOffice cell contents by a constant when they are mixed with text?

I have a long series of cells written like this (example text):
Example Number (3502, 456)
How would I multiply the numbers by 4 without having to delete the text?
I also have cells in the format [sic below]:
Example Number (3502,456) (4560,250) (2345,223)
et cetera, there are on average ten parentheses per text string.
Occasionally, the text might also be only one word long, e.g.
Example (3205, 456)
or
Example (3205,456) (4560,250) (2345,223)
et cetera.
(all above is [sic]).
As a sort of newbie to Excel (well, really Libre Office Calc but it's essentially the same), how would I do this? I don't want to go through and manually multiply all the numbers myself. The number I want to multiply by is 4. I've tried just running a find-and-replace to replace all ,'s and )'s with *4's, but the program I need these numbers for can't evaluate expressions, it needs single numbers.
There are some 110+ items on each list I need to change, and just one math error on any of the three lists (!) and the program won't run correctly (I'm resizing an image, and the points I plotted on the image didn't scale up with it). I don't want to risk it.
It should be possible to do this with a macro but unless I'm mistaken LibreOffice macro code is quite different from Excel VBA.
However if you can afford to use several columns of your spreadsheet to figure the values out, you can do so using formulae. If cell A1 contains
Example Number (3502,456) (4560,250) (2345,223)
and B1 contains
=MID(A1,FIND("(",A1)+1,9999)
then this formula will return the 3502 as a number:
=NUMBERVALUE(LEFT(B1,FIND(",",B1)-1))
(9999 is chosen to be much larger than the likely length of any line, so the MID function will always return the whole of the rest of the text after the search character).
You should be able to combine MID and FIND functions in further cells to isolate the other numbers, assuming these are always found in the format (xxx,yyy) as per your example. Then you can use a final formula to rebuild the string from the multiplied numbers:
="Example Number (" & 4*C1 & "," & 4*E1 & ")"
and so on.
If your data has a variable number of numbers to find, some of your FIND functions may return a #VALUE error. You may need to use an IF function to exclude these, for example:
=IF(ISERROR(G1),"",G1)
would return the value of G1 if it contains data, but blank if it contains an error.
Here is a Python LibreOffice macro that does what you want. It assumes all of the values are in column A, and it writes the results to column B.
import re
def do_calculations():
document = XSCRIPTCONTEXT.getDocument()
sheet = document.getSheets().getByIndex(0)
cellrange = sheet.getCellRangeByName("A1:A10000")
row_tuples = cellrange.getDataArray()
row = 1
for row_tuple in row_tuples:
if row_tuple:
row = output_values(row, row_tuple[0], sheet)
def output_values(row, pairs_string, sheet):
"""Multiply pairs of values by 4 and output each pair to B column.
:param row: the row number in the B column
:param pairs_string: a string like "Example Number (123, 456) (789, 1011)"
:param sheet: the current spreadsheet
Returns the next row number in the B column.
"""
pairs = re.findall(r'\([^)]+\)', pairs_string)
for pair in pairs:
match_obj = re.match(r'\((\d+),\s*(\d+)\)', pair)
x, y = match_obj.groups()
result = "(%d,%d)" % (int(x) * 4, int(y) * 4)
cell = sheet.getCellRangeByName("B" + str(row))
cell.setString(result)
row += 1
return row
# Functions that can be called from Tools -> Macros -> Run Macro.
g_exportedScripts = do_calculations,
Save the code to a text file, for example calc_multiply_numbers.py. Put it in Scripts/python in your LibreOffice user directory. On my Windows system it is C:\Users\JimStandard\AppData\Roaming\LibreOffice\4\user\Scripts\python. If the python directory doesn't exist yet, create it.
To run it, open the spreadsheet and go to Tools -> Macros -> Run Macro. Under My Macros, click calc_multiply_numbers and then press the Run button.
EDIT:
I don't think you need to worry about the JRE error. On my system I can uncheck "Use a Java runtime environment" in Tools -> Options -> LibreOffice -> Advanced, and it still works. I just click "No" when it asks if I want to enable the use of a JRE now, and then it runs my python macro.
The reason it is not showing up under My Macros is because python is not able to interpret the file correctly. To find the error, test it with python using the following steps (assuming Windows):
Open a command prompt, for example by pressing Win, typing cmd, and clicking "Command Prompt" from the start menu.
Type cd "path-to-libreoffice/program". On my 64-bit system this is cd "C:\Program Files (x86)\LibreOffice 5\program" I use the normal Windows File Explorer to find the exact path.
Type "python.exe python-script". On my system it is python.exe "C:\Users\JimStandard\AppData\Roaming\LibreOffice\4\user\Scripts\python\calc_multiply_numbers.py"
The python interpreter will give an error message about the problem. If you are not able to figure out the message, write it in the comments below and I will help you.

Resources