Read from CSV and store in Excel tabs - excel

I am reading multiple CSVs (via URL) into multiple Pandas DataFrames and want to store the results of each CSV into separate excel worksheets (tabs). When I keep writer.save() inside the for loop, I only get the last result in a single worksheet. And when I move writer.save() outside the for loop, I only get the first result in a single worksheet. Both are wrong.
import requests
import pandas as pd
from pandas import ExcelWriter
work_statements = {
'sheet1': 'URL1',
'sheet2': 'URL2',
'sheet3': 'URL3'
}
for sheet, statement in work_statements.items():
writer = pd.ExcelWriter('B.xlsx', engine='xlsxwriter')
r = requests.get(statement) # go to URL
df = pd.read_csv(statement) # read from URL
df.to_excel(writer, sheet_name= sheet)
writer.save()
How can I get all three results in three separate worksheets?

You are re-initializing the writer object with each loop. Simply initialize it once before for and save document once after the loop. Also, in read_csv() line, you should be reading in the request content, not the URL (i.e., statement) saved in dictionary:
writer = pd.ExcelWriter('B.xlsx', engine='xlsxwriter')
for sheet, statement in work_statements.items():
r = requests.get(statement) # go to URL
df = pd.read_csv(r.content) # read from URL
df.to_excel(writer, sheet_name= sheet)
writer.save()

Related

How to extract a table from a website(url) using python

The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format
“) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.
import pandas as pd
# URL of the table
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
# Read the table into a pandas dataframe
df = pd.read_html(url, header=0, index_col=0)[0]
# Save the processed table to a CSV file
df.to_csv("nist_table.csv", index=False)
You could use:
.droplevel([0,1]) to remove the unwanted header rows
.dropna(axis=1, how='all') to remove the empty columns
.iloc[:,1:] to select only specific 3 columns
Example
import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df
For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.
The code below should extract the desired data:
# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
# declare table variable and use soup to find table in HTML dom
table = soup.find('table')
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
# only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
if i > 3:
rows.append([value.text.strip() for value in row.find_all('td')])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv(r"datafile.csv")

pandas dataframe to a single sheet in multisheet excel file

Sometimes we open multi sheet excel file, do some operations in one sheet and then save it back in the same file or make a new file while saving. Given the operations are done in pandas dataframe, how can I copy back the result to the target sheet?
import openpyxl as op
from openpyxl.utils.dataframe import dataframe_to_rows
import pandas as pd
wbk=op.load_workbook("fileName.xlsx")
wsht=wbk['verbList']
#create dataframe with sheet data and operate
df = pd.read_excel("fileName.xlsx", sheet_name="verbList")
df.insert(0,"newCol2","") #sample operation
dataframe_to_rows(df, index=False, header=True) #dataframe converted to rows
#for loop from dataframe_to_rows moves back rows to excel file
#trying to avoid loops here
wsht["B1"].value="verbs"
wbk.save(basePath + "fileName-update.xlsx")
Any idea anyone?
If any other python excel library does the job, please let know.

Split CSV File into two files keeping header in both files

I am trying to split a large CSV file into two files. I am using below code
import pandas as pd
#csv file name to be read in
in_csv = 'Master_file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of rows of data to write to the csv,
#you can change the row size according to your need
rowsize = 600000
#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):
df = pd.read_csv(in_csv,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'File_Number' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=True,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
It is splitting the file but its missing header in second file. How can I fix it
.read_csv() returns an iterator when used with chunksize and then keeps track of the header. The following is an example. This should be much faster since the original code above reads the entire file to count the lines, then re-reads all previous lines in each chunk iteration; whereas below reads through the file only once:
import pandas as pd
with pd.read_csv('Master_file.csv', chunksize=60000) as reader:
for i,chunk in enumerate(reader):
chunk.to_csv(f'File_Number{i}.csv', index=False, header=True)

comparing two csv files in python that have different data sets

using python, I want to compare two csv files but only compare row2 of the first csv against row0 of the second csv, but print out in a new csv file only the lines where there are no matches for the compared rows.
Example....
currentstudents.csv contains the following information
Susan,Smith,susan.smith#mydomain.com,8
John,Doe,john.doe#mydomain.com,9
Cool,Guy,cool.guy#mydomain.com,3
Test,User,test.user#mydomain.com,5
previousstudents.csv contains the following information
susan.smith#mydomain.com
john.doe#mydomain.com
test.user#mydomain.com
After comparing the two csv files, a new csv called NewStudents.csv should be written with the following information:
Cool,Guy,cool.guy#mydomain.com,3
Here is what I have, but this fails to produce what I need....The below code will work, if I omit all data except the email address in the original currentstudents.csv file, but then I dont end up with the needed data in the final csv file.
def newusers():
for line in fileinput.input(r'C:\work\currentstudents.csv', inplace=1):
print(line.lower(), end='')
with open(r'C:\work\previousstudents.csv', 'r') as t1, open(r'C:\work\currentstudents.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open(r'C:\work\NewStudents.csv', 'w') as outFile:
for (line[0]) in filetwo:
if (line[0]) not in fileone:
outFile.write(line)
Thanks in advance!
This script writes NewStudents.csv:
import csv
with open('sample.csv', newline='') as csvfile1, \
open('sample2.csv', newline='') as csvfile2, \
open('NewStudents.csv', 'w', newline='') as csvfile3:
reader1 = csv.reader(csvfile1)
reader2 = csv.reader(csvfile2)
csvwriter = csv.writer(csvfile3)
emails = set(row[0] for row in reader2)
for row in reader1:
if row[2] not in emails:
csvwriter.writerow(row)
The content of NewStudents.csv:
Cool,Guy,cool.guy#mydomain.com,3
With a pandas option
For small files it's not going to matter, but for larger files, the vectorized operations of pandas will be significantly faster than iterating through emails (multiple times) with csv.
Read the data with pd.read_csv
Merge the data with pandas.DataFrame.merge
The columns do not have names in the question, so columns are selected by column index.
Select the desired new students with Boolean indexing with [all_students._merge == 'left_only'].
.iloc[:, :-2] selects all rows, and all but last two columns.
import pandas as pd
# read the two csv files
cs = pd.read_csv('currentstudents.csv', header=None)
ps = pd.read_csv('previousstudents.csv', header=None)
# merge the data
all_students = cs.merge(ps, left_on=2, right_on=0, how='left', indicator=True)
# select only data from left_only
new_students = all_students.iloc[:, :-2][all_students._merge == 'left_only']
# save the data without the index or header
new_students.to_csv('NewStudents.csv', header=False, index=False)
# NewStudents.csv
Cool,Guy,cool.guy#mydomain.com,3

Appending dataframe to existing Excel worksheet using Openpyxl

I'm trying to create a new spreadsheet and worksheet containing the column headings from a dataframe. I then want to append new data to the worksheet every For loop iteration. I am likely to have a large amount of data and therefore thought it would be necessary to write it out to Excel after every iteration rather than writing the whole DF at the end.
The "Append data to existing worksheet" code in the For loop works correctly (ie gives me 3 rows of values) on its own if I am writing to a spreadsheet that already contains the column headings that I have created within Excel. But when I run the code as you see below, I only end up with the column headings and the values from the last For loop iteration. I'm obviously missing something simple but can't seem to work it out. Any help would be much appreciated
import openpyxl as xl
import pandas as pd
import numpy as np
import datetime as dt
fn = '00test101.xlsx'
# Create new workbook
wb = xl.Workbook()
wb.save(fn)
book = xl.load_workbook(fn)
writer = pd.ExcelWriter(fn,engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
# Write DF column names to new worksheet
DF = pd.DataFrame(columns=['A','B','C'])
DF.to_excel(writer, 'ABC', header=True, startrow=0)
writer.save()
for i in range(3):
a = np.array([1,3,6]) * i
# Overwrite existing DF and add data
DF = pd.DataFrame(columns=['A','B','C'])
DF.loc[dt.datetime.now()] = a
# Append data to existing worksheet
book = xl.load_workbook(fn)
writer = pd.ExcelWriter(fn,engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
DF.to_excel(writer, 'ABC', header=None, startrow=book.active.max_row)
writer.save()
# Remove unwanted default worksheet
wb = xl.load_workbook(fn)
def_sheet = wb.get_sheet_by_name('Sheet')
wb.remove_sheet(def_sheet)
wb.save(fn)

Resources