How do you append rows to xlsx file when using beautifulsoup and pandas to scrape? - python-3.x

So, i've been looking all over and i can't seem to figure out why i can't get the results from my scrape to write to a xlsx file.
I'm running a list of urls from a .csv file. I throw 10 urls in there, beautifulsoup scrapes them. If i just print the dataframe, it comes our right.
If i try and save the results as a xlsx(which is preferred) or csv, it will only give me the results from the last url.
If i run this, it prints out perfect
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
But i can't seem to get it to append to an xlsx file. I'm only getting the last result, which i figure is because it is just "writing" and not appending.
I've tried:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
AND
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
I started reading into openpyxl, but at this point I am so confused, i don't understand it.
Any and all help is appreciated

You are iterating over your csv data line-by-line, but you are recreating your dataframe at every iteration, so you are losing the value of the previous one each time. You will need to create the df first outside of the loop, and add data in your for loop.
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
Your assumption of writing and not appending is correct, but you need to append the dataframe and then write it to excel, and not append data to the excel(if I understood correctly).
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
The pd.ExcelWriter will only produce the output when you run:
writer.save()
I have a similar code that opens the file with the following parameters and it works:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
Note that according to the documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" the document with another dataframe with the exact same format to the document structure.
For it to be reproducable, you could add a template xlsx, but I hope it helps. Please let me know.

Related

Convert a pandas dataframe to tab separated list in Python

I have a dataframe like below:
import pandas as pd
data = {'Words':['actually','he','came','from','home','and','played'],
'Col2':['2','0','0','0','1','0','3']}
data = pd.DataFrame(data)
The dataframe looks like this:
I write this dataframe into the drive using below command:
np.savetxt('/folder/file.txt', data.values,fmt='%s', delimiter='\t')
And the next script reads it with below line of code:
data = load_file('/folder/file.txt')
Below is load_file function to read a text file.
def load_file(filename):
with open(filename, 'r', encoding='utf-8') as f:
data = f.readlines()
return data
The data will be a tab separated list.
print(data)
gives me the following output:
['actually\t2\n', 'he\t0\n', 'came\t0\n', 'from\t0\n', 'home\t1\n', 'and\t0\n', 'played\t3\n']
I dont want to write the file to drive and then read it for processing. Instead I want to convert the dataframe to a tab separated list and process directly. How can I achieve this?
I checked for existing answers, but most just convert list to dataframe and not other way around.
Thanks in advance.
Try using .to_csv()
df_list = data.to_csv(header=None, index=False, sep='\t').split('\n')
df_list:
['actually\t2',
'he\t0',
'came\t0',
'from\t0',
'home\t1',
'and\t0',
'played\t3'
]
v = df.to_csv(header=None, index=False, sep='\t').rstrip().replace('\n', '\n\\n').split('\\n')
df_list:
['actually\t2\n',
'he\t0\n',
'came\t0\n',
'from\t0\n',
'home\t1\n',
'and\t0\n',
'played\t3\n'
]
I think this achieves the same result without writing to the drive:
df_list = list(data.apply(lambda row: row['Words'] + '\t' + row['Col2'] + '\n', axis=1))
Try:
data.apply("\t".join, axis=1).tolist()

Python - Creating a for loop to build a single csv file with multiple dataframes

I am new to python and trying various things to learn the fundamentals. One of the things that i'm currently stuck on is for loops. I have the following code and am positive it can be built out more efficiently using a loop but i'm not sure exactly how.
import pandas as pd
import numpy as np
url1 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=1'
url2 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=2'
url3 = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page=3'
df1 = pd.read_html(url1)
df1[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df2 = pd.read_html(url2)
df2[0].to_csv ('NFL_Receiving_Page2.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df3 = pd.read_html(url3)
df3[0].to_csv ('NFL_Receiving_Page3.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
df_receiving_agg = pd.concat([df1[0], df2[0], df3[0]])
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
I'm ultimately trying to combine the data in the above URL's into a single table in a csv file.
You can try this:
urls = [url1,url2,url3]
df_receiving_agg = pd.DataFrame()
for url in urls:
df = pd.read_html(url)
df_receiving_agg = pd.concat([df_receiving_agg, df])
df_receiving_agg.to_csv('filepath.csv',index=False)
You can do this:
base_url = 'https://www.cbssports.com/nfl/stats/player/receiving/nfl/regular/qualifiers/?page='
dfs = []
for page in range(1, 4):
url = f'{base_url}{page}'
df = pd.read_html(url)
df.to_csv(f'NFL_Receiving_Page{page}.csv', index=False)
dfs.append(df)
df_receiving_agg = pd.concat(dfs)
df_receiving_agg.to_csv('NFL_Receiving_Combined.csv', index=False)

Using xlsxwriter (or other packages) to create Excel tabs with specific naming, and write dataframe to the corresponding tab

I am trying to query based on different criteria, and then create individual tabs in Excel to store the query results.
For example, I want to query all the results that match criteria A, and write the result to an Excel tab named "A". The query result is stored in the panda data frame format.
My problem is, when I want to perform 4 different queries based on criteria "A", "B", "C", "D", the final Excel file only contains one tab, which corresponds to the last criteria in the list. It seems that all the previous tabs are over-written.
Here is sample code where I replace the SQL query part with a pre-set dataframe and the tab name is set to 0, 1, 2, 3 ... instead of the default Sheet1, Sheet2... in Excel.
import pandas as pd
import xlsxwriter
import datetime
def GCF_Refresh(fileCreatePath, inputName):
currentDT = str(datetime.datetime.now())
currentDT = currentDT[0:10]
loadExcelName = currentDT + '_' + inputName + '_Load_File'
fileCreatePath = fileCreatePath +'\\' + loadExcelName+'.xlsx'
wb = xlsxwriter.Workbook(fileCreatePath)
data = [['tom'], ['nick'], ['juli']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name'])
writer = pd.ExcelWriter(fileCreatePath, engine='xlsxwriter')
for iCount in range(5):
#worksheet = writer.sheets[str(iCount)]
#worksheet.write(0, 0, 'Name')
df['Name'].to_excel(fileCreatePath, sheet_name=str(iCount), startcol=0, startrow=1, header=None, index=False)
writer.save()
writer.close()
# Change the file path here to store on your local computer
GCF_Refresh("H:\\", "Bulk_Load")
My goal for this sample code is to have 5 tabs named, 0, 1, 2, 3, 4 and each tab has 'tom', 'nick' and 'juli' printed to it. Right now, I just have one tab (named 4), which is the last tab among all the tabs I expected.
There are a number of errors in the code:
The xlsx file is created using XlsxWriter directly and then overwritten by creating it Again in Pandas.
The to_excel() method takes a reference to the writer object not the file path.
The save() and close() are the same thing and shouldn't be in the
loop.
Here is a simplified version of your code with these issues fixes:
import pandas as pd
import xlsxwriter
fileCreatePath = 'test.xlsx'
data = [['tom'], ['nick'], ['juli']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name'])
writer = pd.ExcelWriter(fileCreatePath, engine='xlsxwriter')
for iCount in range(5):
df['Name'].to_excel(writer,
sheet_name=str(iCount),
startcol=0,
startrow=1,
header=None,
index=False)
writer.save()
Output:
See Working with Python Pandas and XlsxWriter in the XlsxWriter docs for some details about getting Pandas and XlsxWriter working together.

Adding custom Header and Footer in CSV generated using Pandas Dataframe

I am writing the results of a query to a CSV file. However I am looking to add custom header (H|10|27) and footer (F|<row_count>)
I read related posts at SO but i couldn't find anything specific to python and pandas. Neither the documentation refers to this.
I am not sure how to go about that:
My code:
cs = connect_snowflake().cursor()
try:
cs.execute("select * from <Tables> where id in (20, 24,61);")
datas = cs.fetchall()
df = pd.DataFrame(datas)
print(df.head(10))
df.to_csv('P202461.csv', sep='|', header=False)
finally:
cs.close()
The footer here should be the total count of rows. Which I can fetch in a separate variable and pass it.
Any help would be appreciated.
Doesn't seem like you have a real need for pandas in this case.
Try the standard csv module.
Something like this should work:
import csv
def output_query_to_csv(query, filename='P202461.csv):
with connect_snowflake() as conn, open(filename, 'w', encoding='utf8) as csv_out:
cs = conn.cursor()
cs.execute(query)
datas = cs.fetchall()
writer = csv.writer(csv_out, delimiter='|')
header = ('H', '10', '27')
footer = ('F', len(datas))
writer.writerow(header)
writer.writerows(datas)
writer.writerow(footer)

Saving loop output to multiple excel sheets

I have a csv file full of multiple years of water data. I've broken up each water year into it's own data frame. Now I want to do some math to those water years then save each water year to it's own excel sheet.
The math part of the code is working, but I'm having trouble with the final step of naming and saving the output of the loop correctly. Right now I have it creating the excel file and creating the sheet names correctly, but the loop just saves the final iteration to all the sheets. I've googled around but I can't get any other of the similar questions answers to work. This is my first python program so advice would be appreciated.
import pandas as pd
with open(r'wft.csv') as csvfile:
tdata = pd.read_csv(csvfile)
tdata['date'] = pd.to_datetime(tdata['date'], format='%m/%d/%Y %H:%M')
tdata = tdata.set_index(['date'])
wy2015 = tdata.loc['2014-10-1 00:00' : '2015-7-1 00:00']
wy2016 = tdata.loc['2015-10-1 00:00' : '2016-7-1 00:00']
wy2017 = tdata.loc['2016-10-1 00:00' : '2017-7-1 00:00']
writer = pd.ExcelWriter('WFT.xlsx', engine='xlsxwriter')
wyID = [wy2014, wy2015, wy2016, wy2017]
seq = ['wy2014', 'wy2015', 'wy2016', 'wy2017']
for df in wyID:
df = df.sort_values(by=['turbidity'], ascending=False)
df['rank'] = df['turbidity'].rank(method = 'first', ascending=0)
df['cunnanes'] = (df['rank'] - 0.4)/(len(df['rank']) + 0.2)*100
for name in seq:
df.to_excel(writer, sheet_name= name)
writer.save()
Issues in your code
writer = pd.ExcelWriter('WFT.xlsx', engine='xlsxwriter')
wyID = [wy2014, wy2015, wy2016, wy2017]
seq = ['wy2014', 'wy2015', 'wy2016', 'wy2017']
for df in wyID: # outer loop that figures out wy20xx
df = df.sort_values(by=['turbidity'], ascending=False)
df['rank'] = df['turbidity'].rank(method = 'first', ascending=0)
df['cunnanes'] = (df['rank'] - 0.4)/(len(df['rank']) + 0.2)*100
for name in seq: # you loop through all the names and write all sheets every time. you want to be writing just one
df.to_excel(writer, sheet_name= name)
writer.save()
Instead try this.
for i, df in enumerate(wyID): # outer loop that figures out wy20xx
df = df.sort_values(by=['turbidity'], ascending=False)
df['rank'] = df['turbidity'].rank(method = 'first', ascending=0)
df['cunnanes'] = (df['rank'] - 0.4)/(len(df['rank']) + 0.2)*100
df.to_excel(writer, sheet_name= seq[i]) # writes to correct wy20xx sheet
writer.save() # Now you're done writing the excel

Resources