Pandas read_excel with Hyperlink - excel

I have an Excel spreadsheet that I am reading into a Pandas DataFrame:
df = pd.read_excel("file.xls")
However, one of the columns of the spreadsheet contains text which have a hyperlink associated with it. How do I access the underlying hyperlink in Pandas?

This can be done with openpyxl, I'm not sure its possible with Pandas at all. Here's how I've done it:
import openpyxl
wb = openpyxl.load_workbook('yourfile.xlsm')
ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)
You can also use iPython, and set a variable equal to the hyperlink object:
t = ws.cell(row=2, column=1).hyperlink
then do t. and press tab to see all the options for what you can do with or access from the object.

Quick monkey patching, without converters or anything like this, if you would like to treat ALL cells with hyperlinks as hyperlinks, more sophisticated way, I suppose, at least be able to choose, what columns treat as hyperlinked or gather data, or save somehow both data and hyperlink in same cell at dataframe. And using converters, dunno. (BTW I played also with data_only, keep_links, did not helped, only changing read_only resulted ok, I suppose it can slow down your code speed).
P.S.: Works only with xlsx, i.e., engine is openpyxl
P.P.S.: If you reading this comment in the future and issue https://github.com/pandas-dev/pandas/issues/13439 still Open, don't forget to see changes in _convert_cell and load_workbook at pandas.io.excel._openpyxl and update them accordingly.
import pandas
from pandas.io.excel._openpyxl import OpenpyxlReader
import numpy as np
from pandas._typing import FilePathOrBuffer, Scalar
def _convert_cell(self, cell, convert_float: bool) -> Scalar:
from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
# here we adding this hyperlink support:
if cell.hyperlink and cell.hyperlink.target:
return cell.hyperlink.target
# just for example, you able to return both value and hyperlink,
# comment return above and uncomment return below
# btw this may hurt you on parsing values, if symbols "|||" in value or hyperlink.
# return f'{cell.value}|||{cell.hyperlink.target}'
# here starts original code, except for "if" became "elif"
elif cell.is_date:
return cell.value
elif cell.data_type == TYPE_ERROR:
return np.nan
elif cell.data_type == TYPE_BOOL:
return bool(cell.value)
elif cell.value is None:
return "" # compat with xlrd
elif cell.data_type == TYPE_NUMERIC:
# GH5394
if convert_float:
val = int(cell.value)
if val == cell.value:
return val
else:
return float(cell.value)
return cell.value
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from openpyxl import load_workbook
# had to change read_only to False:
return load_workbook(
filepath_or_buffer, read_only=False, data_only=True, keep_links=False
)
OpenpyxlReader._convert_cell = _convert_cell
OpenpyxlReader.load_workbook = load_workbook
And after adding this above in your python file, you will be able to call df = pandas.read_excel(input_file)
After writing all this stuff it came to me, that maybe it would be easier and cleaner just use openpyxl by itself ^_^

as commented by slaw it doesnt grab the hyperlink but only the text
here text.xlsx contains links in the 9th column
from openpyxl import load_workbook
workbook = load_workbook('test.xlsx')
worksheet = workbook.active
column_indices = [9]
for row in range(2, worksheet.max_row + 1):
for col in column_indices:
filelocation = worksheet.cell(column=col, row=row) # this is hyperlink
text = worksheet.cell(column=col + 1, row=row) # thi is your text
worksheet.cell(column=col + 1, row=row).value = '=HYPERLINK("' + filelocation.value + '","' + text.value + '")'
workbook.save('test.xlsx')

You cannot do that in pandas. You can try with other libraries designed to deal with excel files.

Related

Is there a python coding that can access and change the cell's alphabet to its opposite from in excel?

I'm new to python and I need to make a program that changes the letter's in the cell to the opposite form and also know the amount of names in the column and which row the name list is at so that it can change all of the names. The code is for me to be able to change the names without to ever look at the name list due to privacy reasons. I'm currently using Pycharm and Openpyxl if anyone is wondering. The picture shows the before and after of how it should look like. I have done a few tries but after that, I just can't seem to get any ideas on how to change the alphabet. I also tried the replacement (replacement = {'Danial' = 'Wzmrzo'}) function however I am required to look at the name list and then be able to change the letters.
import openpyxl
from openpyxl import Workbook, load_workbook
from openpyxl.utils import get_column_letter
print("Type the file name:")
DF = input()
wb = load_workbook(DF + '.xlsx')
print("Sheet Name:")
sht = input()
ws = wb[sht]
NC = str(input("Where is the Name Column?"))
column = ws[ NC ]
column_list = [column[x].value for x in range(len(column))]
print(column_list)
wb.save(DF + '.xlsx')
Before
After
Warning I'm not too familiar with openpyxl and how they access rows/cols but it seems to have changed a lot in the last few years. So this should give you an idea for how to make it work but might not work exactly as written depending on your version.
To find the name column you could use
name_col = False
# loop along the top row looking for "Name"
for i,x in enumerate(ws.iter_cols(max_row=1)):
if x[0].value == "Name":
name_col = i + 1 # enumerate is 0 indexed, excel rows/cols are 1 indexed
break
if name_col:
# insert name changing code here
else:
print("'Name' column not found.")
To change the names you could use (insert this in the code above)
# loop down name column
for i,x in enumerate(ws.iter_rows(min_col = name_col, max_col = name_col)):
# we need to skip the header row so
if i == 0:
continue
name = x[0].value
new_name = ""
for c in name:
# ord() gets the ASCII value of the char, manipulates it to be the opposite then uses chr() to get back the character
if ord(c) > 90:
new_c = chr(25 - (ord(c) - 97) + 97)
else:
new_c = chr(25 - (ord(c) - 65) + 65)
new_name.append(new_c)
ws.cell(row=i+1, column=name_col).value = new_name # enumerate is 0 indexed, excel rows/cols are 1 indexed hence i+1

AttributeError: 'Workbook' object has no attribute 'add_format' using openpyxl engine in append mode

I am using Excelwriter with openpyxl engine as I want to open excel file in append mode.
I am using append mode so that I would be able to clear previous sheets in workbook while every re run. But I m getting this error while using he syntax as below for adding formats to the excel :-AttributeError: 'Workbook' object has no attribute 'add_format'
How do I make it work with openpyxl engine
def write_dataframes_to_excel_sheet(dataframes, dir, name,writer):
#with pd.ExcelWriter(f'{dir}/{name}.xlsx', engine='xlsxwriter') as writer:
workbook = writer.book
worksheet = workbook.create_sheet(str(id))
writer.sheets[str(id)] = worksheet
COLUMN = 0
row = 0
for df in dataframes:
#worksheet.write_string(row, COLUMN, df.name)
row += 1
df.to_excel(writer, sheet_name=str(id),
startrow=row, startcol=COLUMN,index=False)
header_format= workbook.add_format({'bold':True,'fg_color' :'00C0C0C0','border': 1})
for col_num,value in enumerate(df.columns.values):
worksheet.write(0,col_num,value,header_format)
column_len=df[value].astype(str).str.len().max()
column_len=max(column_len,len(value))+3
worksheet.set_column(col_num,col_num,column_len)
row += df.shape[0] + 3
with pd.ExcelWriter(input_filename, engine='openpyxl',mode='a') as writer:
write_dataframes_to_excel_sheet(df_array, 'C:/Users/path',input_filename,writer)
AttributeError: 'Workbook' object has no attribute 'add_format'
The add_format() method is an xlsxwriter method so that won't work with openpyxl. You will need to use the equivalent openpyxl method.
You can find all the info here.
I was searching for the same thing, so let me give you a snippet. xlsxwriter seems so much easier though.
from openpyxl import Workbook
wb = Workbook(write_only = True)
ws = wb.create_sheet('test')
from openpyxl.cell import WriteOnlyCell
from openpyxl.styles import Font
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2):
for cell in row:
cell = WriteOnlyCell(ws, value="hello world")
cell.font = Font(bold=True, color='00C0C0C0')
cell.border = Border(left=Side(border_style='Thin',
color='FF000000'),
right=Side(border_style='Thin',
color='FF000000'),
top=Side(border_style='Thin',
color='FF000000'),
bottom=Side(border_style='Thin',
color='FF000000')
You should find here all the border styles you would like to add.
Hope this helps!

Extract some data from a text file

I am not so experienced in Python.
I have a “CompilerWarningsAllProtocol.txt” file that contains something like this:
" adm_1 C:\Work\CompilerWarnings\adm_1.h type:warning Reason:wunused
adm_2 E:\Work\CompilerWarnings\adm_basic.h type:warning Reason:undeclared variable
adm_X C:\Work\CompilerWarnings\adm_X.h type:warning Reason: Unknown ID"
How can I extract these three paths(C:..., E:..., C:...) from the txt file and to fill an Excel column named “Affected Item”.?
Can I do it with re.findall or re.search methods?
For now the script is checkling if in my location exists the input txt file and confirms it. After that it creates the blank excel file with headers, but I don't know how to populate the excel file with these paths written in column " Affected Item" let's say.
thanks for help. I will copy-paste the code:
import os
import os.path
import re
import xlsxwriter
import openpyxl
from jira import JIRA
import pandas as pd
import numpy as np
# Print error message if no "CompilerWarningsAllProtocol.txt" file exists in the folder
inputpath = 'D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt'
if os.path.isfile(inputpath) and os.access(inputpath, os.R_OK):
print(" 'CompilerWarningsAllProtocol.txt' exists and is readable")
else:
print("Either the file is missing or not readable")
# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('CompilerWarningsFresh.xlsx')
worksheet = workbook.add_worksheet('Results')
# Widen correspondingly the columns.
worksheet.set_column('A:A', 20)
worksheet.set_column('B:AZ', 45)
# Create the headers
headers=('Module','Affected Item', 'Issue', 'Class of Issue', 'Issue Root Cause', 'Type of Issue',
'Source of Issue', 'Test sequence', 'Current Issue appearances in module')
# Create the bold headers and font size
format1 = workbook.add_format({'bold': True, 'font_color': 'black',})
format1.set_font_size(14)
format1.set_border()
row=col=0
for item in (headers):
worksheet.write(row, col, item, format1)
col += 1
workbook.close()
I agree with #dantechguy that csv is probably easier (and more light weight) than writing a real xlsx file, but if you want to stick to Excel format, the code below will work. Also, based on the code you've provided, you don't need to import openpyxl, jira, pandas or numpy.
The regex here matches full paths with any drive letter A-Z, followed by "type:warning". If you don't need to check for the warning and simply want to get every path in the file, you can delete everything in the regex after S+. And if you know you'll only ever want drives C and E, just change A-Z to CE.
warningPathRegex = r"[A-Z]:\\\S+(?=\s*type:warning)"
compilerWarningFile = r"D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt"
warningPaths = []
with open(compilerWarningFile, 'r') as f:
fullWarningFile = f.read()
warningPaths = re.findall(warningPathRegex, fullWarningFile)
# ... open Excel file, then before workbook.close():
pathColumn = 1 # Affected item
for num, warningPath in enumerate(warningPaths):
worksheet.write(num + 1, pathColumn, warningPath) # num + 1 to skip header row

Pandas read excel and skip cells with strikethrough

I have to process some xlsx received from external source. Is there a more straightforward way to load a xlsx in pandas while also skipping rows with strikethrough?
Currently I have to do something like this:
import pandas as pd, openpyxl
working_file = r"something.xlsx"
working_wb = openpyxl.load_workbook(working_file, data_only=True)
working_sheet = working_wb.active
empty = []
for row in working_sheet.iter_rows("B", row_offset=3):
for cell in row:
if cell.font.strike is True:
p_id = working_sheet.cell(row=cell.row, column=37).value
empty.append(p_id)
df = pd.read_excel(working_file, skiprows=3)
df = df[~df["ID"].isin(empty)]
...
Which works but only by going through every excel sheet twice.
Ended up subclassing pd.ExcelFile and _OpenpyxlReader. It was easier than I thought :)
import pandas as pd
from pandas.io.excel._openpyxl import _OpenpyxlReader
from pandas._typing import Scalar
from typing import List
from pandas.io.excel._odfreader import _ODFReader
from pandas.io.excel._xlrd import _XlrdReader
class CustomReader(_OpenpyxlReader):
def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
data = []
for row in sheet.rows:
first = row[1] # I need the strikethrough check on this cell only
if first.value is not None and first.font.strike: continue
else:
data.append([self._convert_cell(cell, convert_float) for cell in row])
return data
class CustomExcelFile(pd.ExcelFile):
_engines = {"xlrd": _XlrdReader, "openpyxl": CustomReader, "odf": _ODFReader}
With the custom classes set, now just pass the files like a normal ExcelFile, specify the engine to openpyxl and voila! Rows with strikethrough cells are gone.
excel = CustomExcelFile(r"excel_file_name.xlsx", engine="openpyxl")
df = excel.parse()
print (df)
In this case I would not use Pandas. Just use openpyxl, work from the end of the worksheet and delete rows accordingly. Working backwards from the end of the worksheet means you don't suffer with side-effects when deleting rows.

Iteration error writing file to excel with python

import string
import xlrd
import xlsxwriter
workbook = xlsxwriter.Workbook('C:\T\file.xlsx')
worksheet = workbook.add_worksheet()
book = open_workbook(r'C:\T\test.xls','r')
sheet = book.sheet_by_index(0)
for row_index in range(sheet.nrows):
for col_index in range(sheet.ncols):
print sheet.cell(row_index,0).value
x = sheet.cell(row_index,0).value
worksheet.write_string(row_index,col_index,x)
workbook.close()
I'm a skiddy to python. Here i'm trying to read the xls file with xlrd for data and copy it to another xlsx file through xlsxwriter module. but the data won't get pasted in the created xlsx sheet. Please guide me through this. Above is my exact code. Please correct me if any wrong.
A volley of Thanks in advance.
Your example program almost works. Mainly it needs the open_workbook() method to be prefixed with a class and it is better to use XlsxWriter write() instead of write_string() unless you are sure that all the data you are reading is of a string type. Also, the program was only reading values from column 0.
Here is the same example with those changes in place. I've also renamed the variables in_ and out_ to make it clearer which module is calling which method:
import xlrd
import xlsxwriter
out_workbook = xlsxwriter.Workbook('file.xlsx')
out_worksheet = out_workbook.add_worksheet()
in_workbook = xlrd.open_workbook(r'test.xls', 'r')
in_worksheet = in_workbook.sheet_by_index(0)
for row_index in range(in_worksheet.nrows):
for col_index in range(in_worksheet.ncols):
cell_value = in_worksheet.cell(row_index, col_index).value
out_worksheet.write(row_index, col_index, cell_value)
print cell_value
out_workbook.close()

Resources