I'm trying to write some code that begins with pulling data from an Excel workbook that is dropped daily into a folder OR pulled from an email attachment.
The workbook has a naming function like this: Workbook 20190821 (tomorrow the workbook name will be Workbook 20190822 I would like to make this process as touch-free as possible so is there a way to channel pandas.read_excel() (or some other function) that can handle rolling dates?
Barring some built-in method already available in Python, I wonder if a For Loop that increases by business day and then saves as Path name would work?
This script would work for searching today's worksheet from the folder path.
Is this what you are looking for?
import datetime
import pandas as pd
import os
# File folder
file_path = 'your file path for the file'
# Get current date
num = datetime.datetime.today().strftime("%Y%m%d")
# Current date's workbook name
workbook_name = 'Workbook ' + num
# Complete path
path = os.path.join(file_path, workbook_name)
# Read workbook to Pandas dataframe
df = pd.read_excel(workbook_name)
If you want to find the newest file (if the daily file has not arrived), check this: How to get the latest file in a folder using python
Related
I have an excel sheet which is placed in S3 and I want to read sheet names of excel sheet.
I have read excel sheet with aws wrangler using awswrangler.s3.read_excel(path)
How can I read sheetnames using AWS Wrangler using Python?
According to the awswrangler docs of the read_excel() function:
This function accepts any Pandas’s read_excel() argument.
And in pandas:
sheet_name : str, int, list, or None, default 0
so you could try something like this:
import awswrangler as wr
wr.s3.read_excel(file_uri,sheet_name=your_sheet)
I am currently facing a similar problem in AWS Glue, but did not manage to get it working yet.
I'm not sure you can in Wrangler, or at least I haven't been able to figure it out. You can use Wrangler to download the sheet to a temporary file, then use pyxlsb/openpyxl (using both to cover all formats):
from openpyxl import load_workbook
from pyxlsb import open_workbook
import awswrangler as wr
import os
import pandas as pd
s3_src = 's3://bucket/folder/workbook.xlsb'
filename = os.path.basename(s3_src)
wr.s3.download(path=s3_src, local_file=filename)
if filename.endswith('.xlsb'):
workbook = open_workbook(filename)
sheets = workbook.sheets
else:
workbook = load_workbook(filename)
sheets = workbook.sheetnames
# Load all sheets into an array of dataframes
dfs = [pd.read_excel(filename, sheet_name=s) for s in sheets]
# Or now that you have the sheet names, load using Wrangler
dfs = [wr.s3.read_excel(s3_src, sheet_name=s) for s in sheets]
You could extract the names of the sheets & pass them as inputs to another process that does the extraction.
There is a list of excel files in a directory. The input is a list of sheet names that has to be converted to pdf. So my code has to open the excel file, look for that particular excel sheet and convert that one sheet to pdf. Can anybody suggest which library to use and approach for this. How can I use a variable that has a list of all the required sheet names from all excel files, as argument to open the required excel sheets. Thank you.
INPUT: file1.xls file2.xls file3.xls
sheets in file1: Title, Contents, Summary
sheets in file2: Title, Contents, Summary
sheets in file3: Title, Contents, Summary
Required sheet in file1: Title
Required sheet in file2: Contents
Required sheet in file3: Summary
OUTPUT:
file1_Title.pdf
file2_Contents.pdf
file3_Summary.pdf
Approach: I have a python list with all the sheets in each excel file. And a python list which contains the required sheet to be converted.
import xlrd
book = xlrd.open_workbook(PathforInputFile)
AllSheets = book.sheet_names()
RequiredSheet= line.split("\t")
Code Output:
['Title', 'Contents', 'Summary']
['Title']
['Title', 'Contents', 'Summary']
['Contents']
['Title', 'Contents', 'Summary']
['Summary']
Openpyxl and aspose-cells seem to be the most relevant or, at least the best general excel options available that I could find.
This is an article I found. https://blog.aspose.com/2021/04/02/convert-excel-files-to-pdf-in-python/
But, I would also recommend going to the documentation of the two libraries I suggested. I think they could get you on the right track.
For going through a directory of files, use glob:
dir = (root directory path without files)
for f_csv in glob2.iglob(os.path.join(dir, '*.csv')): # '*.csv' can be changed to the file extension of choice like .xlsx, etc.
# run your ops here per file
Then you can add the base framework so that you're saving coding lines of doing this multiple times to the same exact type of file. I used openpyxl and pandas, but once you get the worksheet open and use index(0) in xlrd you would pick up right where I left off:
dir = (root directory path without files)
for f_csv in glob2.iglob(os.path.join(dir, '*.csv')):
wb = load_workbook(f_csv)
# Access to a worksheet named 'no_header'
ws = wb['no_header']
# Convert to DataFrame
df = pd.DataFrame(ws.values)
Now the last part can be done differently, but I like to convert the sheet into pandas, then use df.to_html() to get it onto a website for download.
df.to_html(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, justify=None, max_rows=None, max_cols=None, show_dimensions=False, decimal='.', bold_rows=True, classes=None, escape=True, notebook=False, border=None, table_id=None, render_links=False, encoding=None)
I would read the docs on Pandas.dataframe.to_html() if the args don't make sense or you want to customize the method.
I have a situation in which I would want to read an excel file on a daily basis where the file name is written in the following format : file_name 08.20.2018 xyz.xlsx and gets updated daily where the date getting changed on a daily basis.
The same thing I need to do when this file is being read, I need to extract the data from a sheet whose naming convention also changes daily with the date. An example sheet name is sheet1-08.20.2020-data
How should I achieve this? I am using the following code but it does not work:
df = pd.read_Excel(r'file_name 08.20.2018 xyz.xlsx', sheet_name = 'sheet1-08.20.2020-data')
how do I update this line of code so that it picks the data dynamically daily with new dates coming in. And to be clear here, the date will also be incremental with no gaps.
You could use pathlib and the datetime module to automate the process :
from pathlib import Path
from datetime import date
#assuming you have a directory of files:
folder = Path(directory of files)
sheetname = f"sheet1-0{date.today().month}.{date.today().day}.{date.today().year}-data"
date_string = f"filename 0{date.today().month}.{date.today().day}.{date.today().year}.xlsx"
xlsx_file = folder.glob(date_string)
#read in data
df = pd.read_excel(io=next(xlsx_file), sheet_name = sheetname)
I am triyng to pull some data from a stock market and saving them in different excel files. Every stock trade process has different timeframes like 1m, 3m, 5m, 15m and so on..
I want to create an excel file for each stock and different sheets for each time frames.
My code creates excel file for a stock (symbol) and adds sheets into it (1m,3m,5m...) and saves the file and then pulls the data from stock market api and saves into correct sheet. Such as ETH/BTC, create the file and sheets and pull "1m" data and save it into "1m" sheet.
Code creates file and sheets, I tested it.
The problem is after dataframe is written into excel file it deletes all other sheets. I tried to pull all data for each symbol. But when I opened the excel file only last time frame (1w) has been written and all other sheets are deleted. So please help.
I checked other problems but didn't find the same problem. At last part I am not trying to add a new sheet I am trying to save df to existed sheet.
#get_bars function pulls the data
def get_bars(symbol, interval):
.
.
.
return df
...
timeseries=['1m','3m','5m','15m','30m','1h','2h','4h','6h','12h','1d','1w']
from pandas import ExcelWriter
from openpyxl import load_workbook
for symbol in symbols:
file = ('C:/Users/mi/Desktop/Kripto/' + symbol + '.xlsx')
workbook = xlsxwriter.Workbook(file)
workbook.close()
wb = load_workbook(file)
for x in range(len(timeseries)):
ws = wb.create_sheet(timeseries[x])
print(wb.sheetnames)
wb.save(file)
workbook.close()
xrpusdt = get_bars(symbol,interval='1m')
writer = pd.ExcelWriter(file, engine='xlsxwriter')
xrpusdt.to_excel(writer, sheet_name='1m')
writer.save()
I think instead of defining the ExcelWriter as a variable, you need to use it in a With statement and use the append mode since you have already created an excel file using xlsxwriter like below
for x in range(len(timeseries)):
xrpusdt = get_bars(symbol,interval=timeseries[x])
with pd.ExcelWriter(file,engine='openpyxl', mode='a') as writer:
xrpusdt.to_excel(writer, sheet_name=timeseries[x])
And in your code above, you're using a static interval as "1m" in the xrpusdt variable which is changed into variable in this code.
Resources:
Pandas ExcelWriter: here you can see the use-case of append mode https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html#pandas.ExcelWriter
Pandas df.to_excel: here you can see how to write to more than one sheet
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
I have downloaded a module called openpyxl which I intend to use to extract data (parts numbers) out from our many Excel files and then write them into a single file. I have not used Python much and am wondering how I could alter the the following code so that the script would open a spreadsheet, run some code on that spreadsheet, and then move onto the next one. If it was a list or a string of some sort I could write for loops for it but for actual spreadsheets I don't know how this would be done.
Can anyone offer any advice on how to loop through documents like this?
from openpyxl import load_workbook
>>> wb2 = load_workbook('test.xlsx')
>>> print wb2.get_sheet_names()
['Sheet2', 'New Title', 'Sheet1']
You could get a list of the spreadsheets with os.listdir() and then extract the data using a for loop, like so:
import os
from openpyxl import load_workbook
path = "path/to/folder" # The folder containing the spreadsheets
sheets = os.listdir(path)
for sheet in sheets:
wb2 = load_workbook(os.path.join(path, sheet))
print(wb2.get_sheet_names())
wb2._archive.close()