Custom Filepath Exporting a Pandas Dataframe - python-3.x

I am working with financial data, and I am cleaning the data in Python before exporting it as a CSV. I want this file to be reused, so I want to make sure that the exported files are not overwritten. I am including this piece of code to help with this:
# Fill this out; this will help identify the dataset after it is exported
latestFY = '21'
earliestFY = '19'
I want the user to change the earliest and latest fiscal year variables to reflect the data they are working with, so when the data is exported, it is called financialData_FY19_FY21, for example. How can I do this using the to_csv function?
Here is what I currently have:
mergedDF.to_csv("merged_financial_data_FY.csv", index = False)
Here is what I want the file path to look like: financialData_FY19_FY21 where the 19 and 21 can be changed based on the input above.

You can use an f-string to update the strings that will be your file paths.
latestFY = '21'
earliestFY = '19'
filename = f"merged_financial_data_FY{earliestFY}_{latestFY}.csv"
mergedDF.to_csv(filename, index=False)
Link to docs

Related

Appending data from multiple excel files into a single excel file without overwriting using python pandas

Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.
Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)

Reading excel files in python using dynamic dates in the file and sheet name

I have a situation in which I would want to read an excel file on a daily basis where the file name is written in the following format : file_name 08.20.2018 xyz.xlsx and gets updated daily where the date getting changed on a daily basis.
The same thing I need to do when this file is being read, I need to extract the data from a sheet whose naming convention also changes daily with the date. An example sheet name is sheet1-08.20.2020-data
How should I achieve this? I am using the following code but it does not work:
df = pd.read_Excel(r'file_name 08.20.2018 xyz.xlsx', sheet_name = 'sheet1-08.20.2020-data')
how do I update this line of code so that it picks the data dynamically daily with new dates coming in. And to be clear here, the date will also be incremental with no gaps.
You could use pathlib and the datetime module to automate the process :
from pathlib import Path
from datetime import date
#assuming you have a directory of files:
folder = Path(directory of files)
sheetname = f"sheet1-0{date.today().month}.{date.today().day}.{date.today().year}-data"
date_string = f"filename 0{date.today().month}.{date.today().day}.{date.today().year}.xlsx"
xlsx_file = folder.glob(date_string)
#read in data
df = pd.read_excel(io=next(xlsx_file), sheet_name = sheetname)

Pandas Copy Values from Rows to other files without disturbing the existing data

I have 20 csv files pertaining to different individuals.
And I have a Main csv file, which is based on the final row values in specific columns. Below are the sample for both kinds of files.
All Individual Files look like this:
alex.csv
name,day,calls,closed,commision($)
alex,25-05-2019,68,6,15
alex,27-05-2019,71,8,20
alex,28-05-2019,65,7,17.5
alex,29-05-2019,68,8,20
stacy.csv
name,day,calls,closed,commision($)
stacy,25-05-2019,82,16,56.00
stacy,27-05-2019,76,13,45.50
stacy,28-05-2019,80,19,66.50
stacy,29-05-2019,79,18,63.00
But the Main File(single day report), which is the output file, looks like this:
name,day,designation,calls,weekly_avg_calls,closed,commision($)
alex,29-05-2019,rep,68,67,8,20
stacy,29-05-2019,sme,79,81,18,63
madhu,29-05-2019,rep,74,77,16,56
gabrielle,29-05-2019,rep,59,61,6,15
I require to copy the required values from the columns(calls,closed,commision($)) of the last line, for end-of-today's report, and then populate it to the Main File(template that already has some columns filled like the {name,day,designation....}).
And so, how can I write a for or a while program, for all the csv files in the "Employee_performance_DB" list.
Employee_performance_DB = ['alex.csv', 'stacy.csv', 'poduzav.csv', 'ankit.csv' .... .... .... 'gabrielle.csv']
for employee_db in Employee_performance_DB:
read_object = pd.read_csv(employee_db)
read_object2 = read_object.tail(1)
read_object2.to_csv("Main_Report.csv", header=False, index=False, columns=["calls", "closed", "commision($)"], mode='a')
How to copy values of {calls,closed,commision($)} from the 'Employee_performance_DB' list of files to the exact column in the 'Main_Report.csv' for those exact empolyees?
Well, as I had no answers for this, it took a while for me to find a solution.
The code below fixed my issue...
# Created a list of all the files in "employees_list"
employees_list = ['alex.csv', ......, 'stacy.csv']
for employees in employees_list:
read_object = pd.read_csv(employees)
read_object2 = read_object.tail(1)
read_object2.to_csv("Employee_performance_DB.csv", index=False, mode='a', header=False)

Add column and values to CSV or Dataframe

Brand new to Python and programming. I have a function that extracts a file creation date from .csv files (the date is included the file naming convention):
def get_filename_dates(self):
"""Extract date from filename and place it into a list"""
for filename in self.file_list:
try:
date = re.search("([0-9]{2}[0-9]{2}[0-9]{2})",
filename).group(0)
self.file_dates.append(date)
self.file_dates.sort()
except AttributeError:
print("The following files have naming issues that prevented "
"date extraction:")
print(f"\t{filename}")
return self.file_dates
The data within these files are brought into a DataFrame:
def create_df(self):
"""Create DataFrame from list of files"""
for i in range(0, len(self.file_dates)):
self.agg_data = pd.read_csv(self.file_list[i])
self.agg_data.insert(9, 'trade_date', self.file_dates[i],
allow_duplicates=False)
return self.agg_data
As each file in file_list is worked with, I need to insert its corresponding date into a new column (trade_date).
As written here, the value of the last index in the list returned by get_filename_dates() is duplicated into every row of the trade_date column. -- presumably because read_csv() opens and closes each file before the next line.
My questions:
Is there an advantage to inserting data into the csv file using with open() vs. trying to match each file and corresponding date while iterating through files to create the DataFrame?
If there is no advantage to with open(), is there a different Pandas method that would allow me to manipulate the data as the DataFrame is created? In addition to the data insertion, there's other clean-up that I need to do. As it stands, I wrote a separate function for the clean-up; it's not complex and would be great to run everything in this one function, if possible.
Hope this makes sense -- thank you
You could grab each csv as an intermediate dataframe, do whatever cleaning you need to do, and use pd.concat() to concatenate them all together as you go. Something like this:
def create_df(self):
self.agg_data = pd.DataFrame()
"""Create DataFrame from list of files"""
for i, date in enumerate(self.file_dates):
df_part = pd.read_csv(self.file_list[i])
df_part['trade_date'] = date
# --- Any other individual file level cleanup here ---
self.agg_data = pd.concat([self.agg_data, df_part], axis=0)
# --- Any aggregate-level cleanup here
return self.agg_data
It makes sense to do as much of the preprocessing/cleanup as possible on the aggregated level as you can.
I also went to the liberty of converting the for-loop to use the more pythonic enumerate

Removing levels from data frame read from CSV file - R

I tried loading the baseball statistics from this link. When I read it from the file using
data <- read.csv("MLB2011.csv")
it seems to be reading all fields as factor values. I tried dropping those factor values by doing:
read.csv("MLB2011.xls", as.is= FALSE)
.. but it looks like the values are still being read as factors. What can I do to have them loaded as simple character values and not factors?
You aren't reading a csv file, it is an excel spreadsheet (.xls format). It contains two worksheets bat2011 and pitch2011
You could use the XLConnect library to read this
library(XLConnect)
# load the work book (connect to the file)
wb <- loadWorkbook("MLB2011.xls")
# read in the data from the bat2011 sheet
bat2011 <- readWorksheet(wb, sheet = 'bat2011')
readWorksheet has an argument colType which you could use to specify the column types.
Edit
If you have already saved the sheets as csv files then
as.is = TRUE or stringsAsFactors = FALSE will be the correct argument values

Resources