Read an excel from a URL - excel

I am trying to read an excel file from the following URL: http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx
I used the code:
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
indicadores = pd.read_excel(ruta_indicadores)
But when i run the code, the dataframe is empty, but the file is not, so i dont know why it isn't reading excel file.
Here is the screenshoot for the excel file:

The problem is the pd.read_excel() function by default read the first sheet, but the table you want have a special sheet name, which is "HOJA1".
Here is the code that worked:
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
indicadores = pd.read_excel(ruta_indicadores, sheet_name='HOJA1')
further more, a more robust solution:
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
indicadores_dict = pd.read_excel(ruta_indicadores, ,sheet_name=None)
# remove the empty sheet
sheetname_list = list(filter(lambda x: not indicadores_dict[x].empty, indicadores_dict.keys()))
df_list = [indicadores_dict[s] for s in sheetname_list]
ref. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Let's, first of all, discuss the Scenario that Why your Code is not able to print Output. Then we will move towards how we can resolve it. The Issue was :-
You are directly fetching Table from URL. So, it has some Cache in Sheet. So, due to this, your pd.read_csv() module is not able to find your Primary Sheet.
How I found that there is another sheet in your data. For that kindly follow the code given below:-
# Import all important-libraries
from openpyxl import load_workbook
# Load 'Cuadro_17.xlsx' Excel Sheet to Workbook
indicadores = load_workbook(filename = "Cuadro_17.xlsx")
# Print Sheet Names of 'Cuadro_17.xlsx'
indicadores.sheetnames
# Output of above Cell:-
['Cognos_Office_Connection_Cache', 'HOJA1']
As you can see our first sheet is Cognos_Office_Connection_Cache and we can't fetch it.
Appropriate Solution in this Scenario:-
Now we know that our data has been stored in HOJA1 Sheet. So, we can fetch that specific Sheet. and another important thing is your Data contains Multi-Indexing. So, we have to fetch data accordingly. Code for the Same was stated below:-
# Import all important-libraries
import pandas as pd
# Store 'URL' in 'ruta_indicadores' Variable
ruta_indicadores = 'http://www.ssf.gob.sv/html_docs/boletinesweb/bdiciembre2020/III_Bancos/Cuadro_17.xlsx'
# 'Read CSV' from 'URL' Using 'ps.read_excel()' Module and also Specifies 'Sheet Name', 'Starting Range' of 'Table' and 'header' for 'Multi-level Indexing'
indicadores = pd.read_excel(ruta_indicadores, sheet_name = 'HOJA1', skiprows = 7, header = [0, 1])
# 'Drop' unnecessary 'Column'
indicadores.drop('Unnamed: 0_level_0', axis = 1, inplace = True)
# Rename Child level Column of 'Conceptos'
indicadores.rename(columns={'Unnamed: 1_level_1': ''}, inplace = True)
# Remove 'NaN' Entries from the 'indicadores' Data
indicadores = indicadores.fillna('')
# Print Few records of 'indicadores' Data
indicadores.head()
I can't print that big Output here. So, I have Attached Sample Output of above mentioned Code in the Image given below:-
As you can see we have fetched Table successfully. Hope this Solution helps you.

Related

How do I export a Google sheet as a CSV in Python without using pandas?

I'm using Python 3.9 and the following version of Google Sheets ...
gsheets==0.5.1
gspread==3.6.0
I'm trying to export my Google sheet as a CSV file. In older versions of Python, I was using the Pandas module like so
import gspread
...
client = gspread.authorize(creds)
sheet = client.open('My_Sheet_name')
# get the third sheet of the Spreadsheet. This
# contains the data we want
sheet_instance = sheet.get_worksheet(3)
records_data = sheet_instance.get_all_records()
records_df = pd.DataFrame.from_dict(records_data)
# view the top records
records_df.to_csv(sys.stdout)
How would I export the CSV without using Pandas? I ask because it would seem newer versions of Python (e.g. 3.9) do not support the pandas module yet.
I believe your goal as situation as follows.
You want to retrieve one of sheets in Google Spreadsheet as the CSV data.
You want to achieve this using gspread without using Pandas.
You have already been able to use gspread.
In this case, in order to achieve your goal, I would like to propose to use the endpoint for exporting the sheet as CSV data. The access token is retrieved from client of client = gspread.authorize(creds). When this proposal is reflected to your script, it becomes as follows.
Modified script:
client = gspread.authorize(creds)
sheet = client.open('My_Sheet_name')
# get the third sheet of the Spreadsheet. This
# contains the data we want
sheet_instance = sheet.get_worksheet(2) # Modified
# I added below script.
url = 'https://docs.google.com/spreadsheets/d/' + sheet.id + '/gviz/tq?tqx=out:csv&gid=' + str(sheet_instance.id)
headers = {'Authorization': 'Bearer ' + client.auth.token}
res = requests.get(url, headers=headers)
print(res.text)
In above script, please add import requests.
When above script is run, 3rd sheet is exported as the CSV data.
Note:
About sheet_instance = sheet.get_worksheet(3), your comment says get the third sheet of the Spreadsheet.. But the 1st number of get_worksheet is 0. So in this case, 4th sheet in the Spreadsheet is retrieved. Please be careful this.
In this case, I think that you can also use the endpoint as follows.
url = 'https://docs.google.com/spreadsheets/d/' + sheet.id + '/export?format=csv&gid=' + str(sheet_instance.id)
You can use the DictWriter from the csv module to add each dictionary as a seperate line to the csv result:
import sys
from csv import DictWriter
dict_writer = DictWriter(sys.stdout, records_data[0].keys())
dict_writer.writeheader()
for data in records_data:
dict_writer.writerow(data)
If you want to write the csv to a file instead of stdout, you can use this snippet instead:
from csv import DictWriter
with open('./path/to/the/file', 'w') as csvfile:
dict_writer = DictWriter(csvfile, records_data[0].keys())
dict_writer.writeheader()
for data in records_data:
dict_writer.writerow(data)
Example:
records_data contains the following values: [{'a': 1, 'b': 2}, {'a': 2, 'b': 3}, {'a': 3, 'b': 4}]
Then the header is taken from the keys of an arbitrary element of the list (in this case the first one): a and b.
Then the values are added line by line to the csv:
a, b
1, 2
2, 3
3, 4

Appending data from multiple excel files into a single excel file without overwriting using python pandas

Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.
Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)

Extract some data from a text file

I am not so experienced in Python.
I have a “CompilerWarningsAllProtocol.txt” file that contains something like this:
" adm_1 C:\Work\CompilerWarnings\adm_1.h type:warning Reason:wunused
adm_2 E:\Work\CompilerWarnings\adm_basic.h type:warning Reason:undeclared variable
adm_X C:\Work\CompilerWarnings\adm_X.h type:warning Reason: Unknown ID"
How can I extract these three paths(C:..., E:..., C:...) from the txt file and to fill an Excel column named “Affected Item”.?
Can I do it with re.findall or re.search methods?
For now the script is checkling if in my location exists the input txt file and confirms it. After that it creates the blank excel file with headers, but I don't know how to populate the excel file with these paths written in column " Affected Item" let's say.
thanks for help. I will copy-paste the code:
import os
import os.path
import re
import xlsxwriter
import openpyxl
from jira import JIRA
import pandas as pd
import numpy as np
# Print error message if no "CompilerWarningsAllProtocol.txt" file exists in the folder
inputpath = 'D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt'
if os.path.isfile(inputpath) and os.access(inputpath, os.R_OK):
print(" 'CompilerWarningsAllProtocol.txt' exists and is readable")
else:
print("Either the file is missing or not readable")
# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('CompilerWarningsFresh.xlsx')
worksheet = workbook.add_worksheet('Results')
# Widen correspondingly the columns.
worksheet.set_column('A:A', 20)
worksheet.set_column('B:AZ', 45)
# Create the headers
headers=('Module','Affected Item', 'Issue', 'Class of Issue', 'Issue Root Cause', 'Type of Issue',
'Source of Issue', 'Test sequence', 'Current Issue appearances in module')
# Create the bold headers and font size
format1 = workbook.add_format({'bold': True, 'font_color': 'black',})
format1.set_font_size(14)
format1.set_border()
row=col=0
for item in (headers):
worksheet.write(row, col, item, format1)
col += 1
workbook.close()
I agree with #dantechguy that csv is probably easier (and more light weight) than writing a real xlsx file, but if you want to stick to Excel format, the code below will work. Also, based on the code you've provided, you don't need to import openpyxl, jira, pandas or numpy.
The regex here matches full paths with any drive letter A-Z, followed by "type:warning". If you don't need to check for the warning and simply want to get every path in the file, you can delete everything in the regex after S+. And if you know you'll only ever want drives C and E, just change A-Z to CE.
warningPathRegex = r"[A-Z]:\\\S+(?=\s*type:warning)"
compilerWarningFile = r"D:\Work\Python\CompilerWarnings\Python_CompilerWarnings\CompilerWarningsAllProtocol.txt"
warningPaths = []
with open(compilerWarningFile, 'r') as f:
fullWarningFile = f.read()
warningPaths = re.findall(warningPathRegex, fullWarningFile)
# ... open Excel file, then before workbook.close():
pathColumn = 1 # Affected item
for num, warningPath in enumerate(warningPaths):
worksheet.write(num + 1, pathColumn, warningPath) # num + 1 to skip header row

Updating excel sheet with Pandas without overwriting the file

I am trying to update an excel sheet with Python codes. I read specific cell and update it accordingly but Padadas overwrites the entire excelsheet which I loss other pages as well as formatting. Anyone can tell me how I can avoid it?
Record = pd.read_excel("Myfile.xlsx", sheet_name'Sheet1', index_col=False)
Record.loc[1, 'WORDS'] = int(self.New_Word_box.get())
Record.loc[1, 'STATUS'] = self.Stat.get()
Record.to_excel("Myfile.xlsx", sheet_name='Student_Data', index =False)
My code are above, as you can see, I only want to update few cells but it overwrites the entire excel file. I tried to search for answer but couldn't find any specific answer.
Appreciate your help.
Update: Added more clarifications
Steps:
1) Read the sheet which needs changes in a dataframe and make changes in that dataframe.
2) Now the changes are reflected in the dataframe but not in the sheet. Use the following function with the dataframe in step 1 and name of the sheet to be modified. You will use the truncate_sheet param to completely replace the sheet of concern.
The function call would be like so:
append_df_to_excel(filename, df, sheet_name, startrow=0, truncate_sheet=True)
from openpyxl import load_workbook
import pandas as pd
def append_df_to_excel(filename, df, sheet_name="Sheet1", startrow=None,
truncate_sheet=False,
**to_excel_kwargs):
"""
Append a DataFrame [df] to existing Excel file [filename]
into [sheet_name] Sheet.
If [filename] doesn"t exist, then this function will create it.
Parameters:
filename : File path or existing ExcelWriter
(Example: "/path/to/file.xlsx")
df : dataframe to save to workbook
sheet_name : Name of sheet which will contain DataFrame.
(default: "Sheet1")
startrow : upper left cell row to dump data frame.
Per default (startrow=None) calculate the last row
in the existing DF and write to the next row...
truncate_sheet : truncate (remove and recreate) [sheet_name]
before writing DataFrame to Excel file
to_excel_kwargs : arguments which will be passed to `DataFrame.to_excel()`
[can be dictionary]
Returns: None
"""
# ignore [engine] parameter if it was passed
if "engine" in to_excel_kwargs:
to_excel_kwargs.pop("engine")
writer = pd.ExcelWriter(filename, engine="openpyxl")
# Python 2.x: define [FileNotFoundError] exception if it doesn"t exist
try:
FileNotFoundError
except NameError:
FileNotFoundError = IOError
if "index" not in to_excel_kwargs:
to_excel_kwargs["index"] = False
try:
# try to open an existing workbook
if "header" not in to_excel_kwargs:
to_excel_kwargs["header"] = True
writer.book = load_workbook(filename)
# get the last row in the existing Excel sheet
# if it was not specified explicitly
if startrow is None and sheet_name in writer.book.sheetnames:
startrow = writer.book[sheet_name].max_row
to_excel_kwargs["header"] = False
# truncate sheet
if truncate_sheet and sheet_name in writer.book.sheetnames:
# index of [sheet_name] sheet
idx = writer.book.sheetnames.index(sheet_name)
# remove [sheet_name]
writer.book.remove(writer.book.worksheets[idx])
# create an empty sheet [sheet_name] using old index
writer.book.create_sheet(sheet_name, idx)
# copy existing sheets
writer.sheets = {ws.title: ws for ws in writer.book.worksheets}
except FileNotFoundError:
# file does not exist yet, we will create it
to_excel_kwargs["header"] = True
if startrow is None:
startrow = 0
# write out the new sheet
df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)
# save the workbook
writer.save()
We can't replace openpyxl engine here to write excel files as asked in comment. Refer reference 2.
References:
1) https://stackoverflow.com/a/38075046/6741053
2) xlsxwriter: is there a way to open an existing worksheet in my workbook?

Data from a table getting printed to csv in a single line

I've written a script to parse data from the first table of a website. I've used xpath to parse the table. Btw, I didn't use "tr" tag cause without using it I can still see the results in the console when printed. When I run my script, the data are getting scraped but being printed in a single line in a csv file. I can't find out the mistake I'm making. Any input on this will be highly appreciated. Here is what I've tried with:
import csv
import requests
from lxml import html
url="https://fantasy.premierleague.com/player-list/"
response = requests.get(url).text
outfile=open('Data_tab.csv','w', newline='')
writer=csv.writer(outfile)
writer.writerow(["Player","Team","Points","Cost"])
tree = html.fromstring(response)
for titles in tree.xpath("//table[#class='ism-table']")[0]:
# tab_r = titles.xpath('.//tr/text()')
tab_d = titles.xpath('.//td/text()')
writer.writerow(tab_d)
You might want to add a level of looping, examining each table row in turn.
Try this:
for titles in tree.xpath("//table[#class='ism-table']")[0]:
for row in titles.xpath('./tr'):
tab_d = row.xpath('./td/text()')
writer.writerow(tab_d)
Or, perhaps this:
table = tree.xpath("//table[#class='ism-table']")[0]
for row in table.xpath('.//tr'):
items = row.xpath('./td/text()')
writer.writerow(items)
Or you could have the first XPath expression find the rows for you:
rows = tree.xpath("(.//table[#class='ism-table'])[1]//tr")
for row in rows:
items = row.xpath('./td/text()')
writer.writerow(items)

Resources