update excel files via python - python-3.x

I want to read an excel file, sort the rows , remove duplicate files and re-save the file again
To do that, i have written this script:
import pandas as pd
data = pd.ExcelFile('FILE_NAME.xlsx')
df = data.parse('data')
df.sort_index()
df.drop_duplicates(subset = 'MAKAT', keep='first', inplace=False)
data.close()
print(pd.read_excel(data))
print('**** DONE ****')
in the result, I see the rows on the screen but the file stays with the duplicated rows.
My question is how to save these changes to the same file ?

Change the two lines as below:
df = df.sort_index()
df = df.drop_duplicates(subset = 'MAKAT', keep='first').sort_values(by=['MAKAT'])
df.to_csv('outputfile.csv)

Related

Reading multiple excel files into a pandas dataframe, but also storing the file name

I would like to read multiple excel files and store them into a single pandas dataframe, but I would like one of the columns in the dataframe to be the file name. This is because the file name contains the date (this is monthly data) and I need that information. I can't seem to get the filename, but I'm able to get the excel files into a dataframe. Please help.
import os
import pandas as pd
import fsspec
files = os.listdir("C://Users//6J2754897//Downloads//monthlydata")
paths = "C://Users//6J2754897//Downloads//monthlydata"
a = pd.DataFrame([2], index = None)
df = pd.DataFrame()
for file in range(len(files)):
if files[file].endswith('.xlsx'):
df = df.append(pd.read_excel(paths + "//" + files[file], sheet_name = "information", skiprows=7), ignore_index=True)
df['Month'] = str(files[file])
The order of operations here is incorrect. The line:
df['Month'] = str(files[file])
Is going to overwrite the entire column with the most recent value.
Instead we should only add the value to the current DataFrame:
import os
import pandas as pd
paths = "C://Users//6J2754897//Downloads//monthlydata"
files = os.listdir(paths)
df = pd.DataFrame()
for file in range(len(files)):
if files[file].endswith('.xlsx'):
# Read in File
file_df = pd.read_excel(paths + "//" + files[file],
sheet_name="information",
skiprows=7)
# Add to just this DataFrame
file_df['Month'] = str(files[file])
# Update `df`
df = df.append(file_df, ignore_index=True)
Alternatively we can use DataFrame.assign to chain the column assignment:
import os
import pandas as pd
paths = "C://Users//6J2754897//Downloads//monthlydata"
files = os.listdir(paths)
df = pd.DataFrame()
for file in range(len(files)):
if files[file].endswith('.xlsx'):
# Read in File
df = df.append(
# Read in File
pd.read_excel(paths + "//" + files[file],
sheet_name="information",
skiprows=7)
.assign(Month=str(files[file])), # Add to just this DataFrame
ignore_index=True
)
For general overall improvements we can use pd.concat with a list comprehension over files. This is done to avoid growing the DataFrame (which can be extremely slow). Pathlib.glob can also help with the ability to select the appropriate files:
from pathlib import Path
import pandas as pd
paths = "C://Users//6J2754897//Downloads//monthlydata"
df = pd.concat([
pd.read_excel(file,
sheet_name="information",
skiprows=7)
.assign(Month=file.stem) # We may also want file.name here
for file in Path(paths).glob('*.xlsx')
])
Some options for the Month Column are either:
file.stem will give "[t]he final path component, without its suffix".
'folder/folder/sample.xlsx' -> 'sample'
file.name will give "the final path component, excluding the drive and root".
'folder/folder/sample.xlsx' -> 'sample.xlsx'

How to edit columns in .CSV files using pandas

import urllib.request
import pandas as pd
# Url file Website
url = 'https://......CSV'
# Download file
urllib.request.urlretrieve(
url, "F:\.....A.CSV")
csvFilePath = "F:\.....A.CSV"
df = pd.read_csv(csvFilePath, sep='\t')
rows=[0,1,2,3]
df2 = df.drop(rows, axis=0, inplace=True)
df.to_csv(
r'F:\....New_A.CSV')
I tried doing this in code but it's making columns merge into a single column.
What I'm going to do is remove the top row from the left as shown in the picture.
I found a problem sep='\t' change to sep=','
Replace:
df = pd.read_csv(csvFilePath, sep='\t')
by:
df = pd.read_csv(csvFilePath, sep='\t', skiprows=5)

Python3: Comparing two CSV Files to identify what's new in the newer File, ignoring content that was in the oldfile

I am trying to compare two csv files with pandas and identify changes.
My goal is to identify the new entrys that are present in the new file but not in the old, ignoring everything that was in the old file and isn't available anymore in the newer.
an old file
NAME;DESCRIPTION;LINK;PRICE;IMAGE
Item4;something;https://example.com;10;https://example.com/image.jpg
Item3;something;https://example.com;10;https://example.com/image.jpg
Item2;something;https://example.com;10;https://example.com/image.jpg
Item1;something;https://example.com;10;https://example.com/image.jpg
a newer file
NAME;DESCRIPTION;LINK;PRICE;IMAGE
Item5;something;https://example.com;10;https://example.com/image.jpg
Item4;something;https://example.com;10;https://example.com/image.jpg
Item3;something;https://example.com;10;https://example.com/image.jpg
Item2;something;https://example.com;10;https://example.com/image.jpg
I already got so far to identify any changes between both files but unfortunately it also displays waht doesn't exist in the new file anymore
import pandas as pd
a = pd.read_csv('csv/new.items.csv')
b = pd.read_csv('csv/old.items.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep=False, inplace=True)
c.reset_index(drop=True, inplace=False)
c.to_csv(r'csv/pd.items.csv', index=False, header=True)
Expected result should be a new file including only the new entry which wasn't found in the old file
NAME;DESCRIPTION;LINK;PRICE;IMAGE
Item5;something;https://example.com;10;https://example.com/image.jpg
haven't worked with python for years so don't be too hard on me :)
Try this
c = b.merge(a, how = 'left', on = 'NAME', suffixes = ("", "_y"))
you should be able to get the new ones using the below command
c[c.DESCRIPTION_y.isnull()]
I solved it in the end by doing the following
import pandas as pd
a = pd.read_csv('csv/new.items.csv')
b = pd.read_csv('csv/old.items.csv')
d = b.merge(a, how='inner', on=None, suffixes=("", "_y"))
d.to_csv(r'csv/old.items.csv', index=False, header=True)
b = pd.read_csv('csv/old.items.csv')
c = pd.concat([a,b], axis=0)
c.drop_duplicates(keep=False, inplace=True) # Set keep to False if you don't want any
c.reset_index(drop=True, inplace=False)
c.to_csv(r'csv/pd.items.csv', index=False, header=True)

How to merge big data of csv files column wise into a single csv file using Pandas?

I have lots of big data csv files in terms of countries and I want to merge their column in a single csv file, furthermore, each file has 'Year' as an index and having same in terms of length and numbers. You can see below is a given example of a Japan.csv file.
If anyone can help me please let me know. Thank you!!
Try using:
import pandas as pd
import glob
l = []
path = 'path/to/directory/'
csvs = glob.glob(path + "/*.csv")
for i in csvs:
df = pd.read_csv(i, index_col=None, header=0)
l.append(df)
df = pd.concat(l, ignore_index=True)
This should work. It goes over each file name, reads it and combines everything into one df. You can export this df to csv or do whatever with it. gl.
import pandas as pd
def combine_csvs_into_one_df(names_of_files):
one_big_df = pd.DataFrame()
for file in names_of_files:
try:
content = pd.read_csv(file)
except PermissionError:
print (file,"was not found")
continue
one_big_df = pd.concat([one_big_df,content])
print (file," added!")
print ("------")
print ("Finished")
return one_big_df

Using xlsxwriter (or other packages) to create Excel tabs with specific naming, and write dataframe to the corresponding tab

I am trying to query based on different criteria, and then create individual tabs in Excel to store the query results.
For example, I want to query all the results that match criteria A, and write the result to an Excel tab named "A". The query result is stored in the panda data frame format.
My problem is, when I want to perform 4 different queries based on criteria "A", "B", "C", "D", the final Excel file only contains one tab, which corresponds to the last criteria in the list. It seems that all the previous tabs are over-written.
Here is sample code where I replace the SQL query part with a pre-set dataframe and the tab name is set to 0, 1, 2, 3 ... instead of the default Sheet1, Sheet2... in Excel.
import pandas as pd
import xlsxwriter
import datetime
def GCF_Refresh(fileCreatePath, inputName):
currentDT = str(datetime.datetime.now())
currentDT = currentDT[0:10]
loadExcelName = currentDT + '_' + inputName + '_Load_File'
fileCreatePath = fileCreatePath +'\\' + loadExcelName+'.xlsx'
wb = xlsxwriter.Workbook(fileCreatePath)
data = [['tom'], ['nick'], ['juli']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name'])
writer = pd.ExcelWriter(fileCreatePath, engine='xlsxwriter')
for iCount in range(5):
#worksheet = writer.sheets[str(iCount)]
#worksheet.write(0, 0, 'Name')
df['Name'].to_excel(fileCreatePath, sheet_name=str(iCount), startcol=0, startrow=1, header=None, index=False)
writer.save()
writer.close()
# Change the file path here to store on your local computer
GCF_Refresh("H:\\", "Bulk_Load")
My goal for this sample code is to have 5 tabs named, 0, 1, 2, 3, 4 and each tab has 'tom', 'nick' and 'juli' printed to it. Right now, I just have one tab (named 4), which is the last tab among all the tabs I expected.
There are a number of errors in the code:
The xlsx file is created using XlsxWriter directly and then overwritten by creating it Again in Pandas.
The to_excel() method takes a reference to the writer object not the file path.
The save() and close() are the same thing and shouldn't be in the
loop.
Here is a simplified version of your code with these issues fixes:
import pandas as pd
import xlsxwriter
fileCreatePath = 'test.xlsx'
data = [['tom'], ['nick'], ['juli']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name'])
writer = pd.ExcelWriter(fileCreatePath, engine='xlsxwriter')
for iCount in range(5):
df['Name'].to_excel(writer,
sheet_name=str(iCount),
startcol=0,
startrow=1,
header=None,
index=False)
writer.save()
Output:
See Working with Python Pandas and XlsxWriter in the XlsxWriter docs for some details about getting Pandas and XlsxWriter working together.

Resources