comparison of two excel sheet and write output - python-3.x

Am writing a python code using xlrd and xlwt to compare two excel sheet and writing output in third sheet.
For example
Sheet 1
nativeEMSName
HR_MEWT_XX5906_TR_I_HR10001
HR_LOHN_5811X_T01_C_X_HO55001
HR_PHKL_XX6541_TR_I_HR10001
HR_RWRI_XX3608_TR_I_HR10001
HR_KTHL_XX6382_AR_I_HR50001
ABC
HR_KURU_XX3714_TR_I_HR10001
HR_RWRI_XX1142_TR_I_HR10001
HR_SAHU_SAHUW_B01_C_X_EX10001
HR_KTHL_XX3622_TR_I_HR10001
Sheet2
nativeEMSName id
HR_KURU_XX3714_TR_I_HR10001 66
HR_PHKL_XX6541_TR_I_HR10001 999
HR_MEWT_XX5906_TR_I_HR10001 2
HR_KTHL_XX6382_AR_I_HR50001 7777
HR_KTHL_XX3622_TR_I_HR10001 4
HR_SAHU_SAHUW_B01_C_X_EX10001 3
HR_LOHN_5811X_T01_C_X_HO55001 111
HR_RWRI_XX1142_TR_I_HR10001 55
HR_RWRI_XX3608_TR_I_HR10001 888
am finding sheet1's nativeEMSName in sheet2 and write nativeEMSName and respective ID in sheeet3.
Below code is using for same
conls=0
colnd=0
for rowsr in range(sheet1.nrows):
test=sheet1.cell(rowsr,colns).value
for rowdr in range(sheet2.nrows):
test1=sheet2.cell(rowdr,colnd).value
if test==test1:
ID = sheet2.cell(rowdr, colnd +1).value
sheet3.write(rowsr,colns,ID)
sheet3.write(rowsr,colnd+1,test1)
wb.save('test.xls')
break
But the challenge is when noumber of row is like 30k in both sheet then code taking too much time to execute. I want to reduce execution time.
Any help will appreciate to optimize this code or use another way to get output in shortest time.

Your look-up is O(mxn) where m is number of rows in sheet1, n is number of rows in sheet2.
I assume there are no duplicates. I think this will work better. It is also easy to find bugs because you have data in map
1. Read sheet1 into a list
sheet1_list = [HR_MEWT_XX5906_TR_I_HR10001,
HR_LOHN_5811X_T01_C_X_HO55001,
HR_PHKL_XX6541_TR_I_HR10001,
HR_RWRI_XX3608_TR_I_HR10001, and_so_on.. ]
2. Read sheet2 (look-up sheet) into a map
sheet2_map = {
'HR_KURU_XX3714_TR_I_HR10001' : '66',
'HR_PHKL_XX6541_TR_I_HR10001' : '999',
'HR_MEWT_XX5906_TR_I_HR10001' : '2,
'HR_KTHL_XX6382_AR_I_HR50001' : '7777',
<So on..>
}
3. Loop list and find id,
Here, individual look up is O(1), and total time is O(n), n is number of entries in sheet1, hence reduces time.
for key in sheet1_list:
print(key, sheet2_map[key]) #check by print
sheet3_map[key] = sheet2_map[key] # inserts key:id, like {'HR_MEWT_XX5906_TR_I_HR10001': '2' }
4. convert_map_to_excel( sheet3_map) ,
there are libraries but, its easy, use xlwt to write xls. https://xlwt.readthedocs.io/en/latest/

Related

How do I drop complete rows (including all values in it) that contain a certain value in my Pandas dataframe?

I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!

Count Values for Each Number in a cell in a Column

I have an excel sheet like the following, and would like to go down each row and add 1 to each of the numbers listed under the L3 column. Eventually, I would like to output something like this:
L3s Count Attr Ids
4770 10 [370, 380, ...]
6420 8 [481, 490...]
21253 20 [580....290]
... ... ...
The count is derived by going through all of the rows, and adding 1 to each L3 number whenever it is encountered. Attr IDs are the ids that contributed to the count. Is there any simple way to accomplish this in excel without having to vba/python?
Thanks in advance!
If you have windows Excel O365, you can use the following formulas:
(Note that I made the original data into a Table)
Sorted Unique list of the L3s:
=SORT(UNIQUE(FILTERXML("<t><s>" &SUBSTITUTE(SUBSTITUTE(TEXTJOIN("</s><s>",TRUE,Table1[L3s])," ",""),",","</s><s>")&"</s></t>","//s")))
Count of the L3s
=COUNT(FILTERXML("<t><s>" &SUBSTITUTE(SUBSTITUTE(TEXTJOIN("</s><s>",TRUE,Table1[L3s])," ",""),",","</s><s>")&"</s></t>","//s[.=" & F8 &"]"))
Associate Attr IDs
="[" &TEXTJOIN(",",TRUE,FILTER(Table1[attr],ISNUMBER(FIND(","&F8&",",SUBSTITUTE(","&Table1[L3s]& ","," ","")))))&"]"

Making a vector out of excel columns using python

everyone...
I just started on python a couple of days ago because I require to handle some excel data in order to automatically update the data of certain cells from one file into another.
However, I'm kind of stuck since I have barely programmed before, and it's my first time using python as well, but my job required me to find a solution and I'm trying to make it work even though it's not my field of expertise.
I used the "xlrd library", imported my file and managed to print the columns I'm needing... However, I can't find a way to put those columns into a matrix in order to handle the data like this:
Matrix =[DataColumnA DataColumnG DataColumnH] in the size [nrows x 3]
As for now, I have 3 different outputs for the 3 different columns I need, but I'm trying to join them together into one big matrix.
So far my code looks like this:
import xlrd
workbook = xlrd.open_workbook("190219_serviciosWRAmanualV5.xls");
worksheet = workbook.sheet_by_name("ServiciosDWDM");
workbook2 = xlrd.open_workbook("Potencia2.xlsx");
worksheet2 = workbook2.sheet_by_name("Hoja1");
filas = worksheet.nrows
filas2 = worksheet2.nrows
columnas = worksheet.ncols
for row in range (2, filas):
Equipo_A = worksheet.cell(row,12).value
Client_A = worksheet.cell(row,13).value
Line_A = worksheet.cell(row, 14).value
print (Equipo_A, Line_A, Client_A)
So I have only gotten, as mentioned above, the data in the columns which is what I'm printing which you can see.
What I'm trying to do, or the main thing I need to do is to read the cell of the first row in Column A and look for it in the other excel file... if the names match, I would have to validate that for the same row (in file 1) the data in both the ColumnG and ColumnH is the same as the data in the second file.
If they match I would have to update Column J in the first file with the data from the second file.
My other approach is to retrieve the value of the cell in ColumnA and look for it in the column A of the second file, then I would make an if conditional to see if ColumnsG and H are equal to Column C of 2nd file and so on...
The thing here is, I have no idea how to pin point the position of the cell and extract the data to make the conditional for this second approach.
I'm not sure if by making that matrix my approach is okay or if the second way is better, so any suggestion would be absolutely appreciated.
Thank you in advance!

Secifying a common range for xlsread function in Matlab

I am trying to figure out how to specify a common range for xlsread() function in matlab.
Usually I use n=xlsread('filename','#sheet','A1:A10'), but I have quite a bit of data in the same sheet and I'd like to know if I can specify it with one range, i.e . if all my data is between '1:10', I want to specify 1:10 as range, and only call the letter values of each column.
I was thinking to do it as follows:
function [a,b,c]=getdata(filename,'1:10')
a=xlsread(filename,1,'A:A'???)
b=xlsread(filename,1,'B:B'???)
c=xlsread(filename,1,'C:C'???)
end
After some research I could not find any information as to how this is done.
Thanks in advance,
Greg
If you want to read 1 to 10 rows of column A, use:
data = xlsread(filename, 1, 'A1:A10');
If you want to read 1 to 10 rows of all columns, use:
data = xlsread(filename, 1, '1:10');
If you want to read 1 to 10 rows of, say, first three columns A, B, and C, use:
data = xlsread(filename, 1, 'A1:C10');
Using dynamic variable names is always a bad idea. Read this for explanation. But if you still want to create a, b, and c and so on depending on the number of columns in the Excel file, you can use:
for k=1:size(data,2)
assignin('caller', char(96+k), data(:,k)); %or char(64+k) for block letters
end
The above will work if number of columns are less than or equal to 26. This may only be feasible if you're dealing with a few columns. But I still recommend to avoid it.

Inserting data into a particular column of an Excel sheet [duplicate]

I have a .mat file which contains titles={'time','data'} and 2 column vectors:
time=[1;2;3;4;5] and data=[10;20;30;40;50].
I created a new cell called table={'time','data';time data} and i used:
xlswrite(filename,table);
However, when i open the xlsx file it shows me only the titles and not showing the numbers.
I saw that xlswrite will show empty cell in case im trying to export more than 1 number in a cell.
Is there anything i can do to export the whole vector instead of writing each value in it's cell?
The final result that i tried to get is like this:
time data
1 10
2 20
3 30
4 40
5 50
You have a couple options. Usually what I do is break it into two xlswrite calls, one for the header and one for the data.
titles = {'time','data'};
time = [1;2;3;4;5];
data = [10;20;30;40;50];
xlswrite('myfile.xlsx', titles, 'Sheet1', 'A1');
xlswrite('myfile.xlsx', [time, data], 'Sheet1', 'A2');
Alternatively, if you have R2013b or newer you can also use the table builtin, which has its own method for writing out data. With the same sample data:
mytable = table(time, data, 'VariableNames', titles);
writetable(mytable, 'myfile.xlsx');

Resources