I am having trouble figuring out how to write this file to csv. I am parsing data from a table, and can print it just fine, but when I try to write to a csv file i get the error "TypeError: write() argument must be str, not list". I'm not sure how to make my data points into a string.
Code:
from bs4 import BeautifulSoup
import urllib.request
import csv
html = urllib.request.urlopen("https://markets.wsj.com/").read().decode('utf8')
soup = BeautifulSoup(html, 'html.parser') # parse your html
filename = "products.csv"
f = open(filename, "w")
t = soup.find('table', {'summary': 'Major Stock Indexes'}) # finds tag table with attribute summary equals to 'Major Stock Indexes'
tr = t.find_all('tr') # get all table rows from selected table
row_lis = [i.find_all('td') if i.find_all('td') else i.find_all('th') for i in tr if i.text.strip()] # construct list of data
f.write([','.join(x.text.strip() for x in i) for i in row_lis])
Any suggestions?
w.write() takes only a string as an argument, but your passing it a list of lists of strings.
csv.writerows() will write lists to a csv file.
Change your file handle f to be :
f = csv.writer(open(filename,'wb'))
and use it by replacing the last line with:
f.writerows([[x.text.strip() for x in i] for i in row_lis])
will produce a csv with contents:
Related
I'm trying to create a metadata scraper to enrich my e-book collection, but am experiencing some problems. I want to create a dict (or whatever gets the job done) to store the index (only while testing), the path and the series name. This is the code I've written so far:
from bs4 import BeautifulSoup
def get_opf_path():
opffile=variables.items
pathdict={'index':[],'path':[],'series':[]}
safe=[]
x=0
for f in opffile:
x+=1
pathdict['path']=f
pathdict['index']=x
with open(f, 'r') as fi:
soup=BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name')=='calibre:series':
pathdict['series']=meta.get('content')
safe.append(pathdict)
print(pathdict)
print(safe)
this code is able to go through all the opf files and get the series, index and path, I'm sure of this, since the console output is this:
However, when I try to store the pathdict to the safe, no matter where I put the safe.append(pathdict) the output is either:
or
or
What do I have to do, so that the safe=[] has the data shown in image 1?
I have tried everything I could think of, but nothing worked.
Any help is appreciated.
I believe this is the correct way:
from bs4 import BeautifulSoup
def get_opf_path():
opffile = variables.items
pathdict = {'index':[], 'path':[], 'series':[]}
safe = []
x = 0
for f in opffile:
x += 1
pathdict['path'] = f
pathdict['index'] = x
with open(f, 'r') as fi:
soup = BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name') == 'calibre:series':
pathdict['series'] = meta.get('content')
print(pathdict)
safe.append(pathdict.copy())
print(safe)
For two main reasons:
When you do:
pathdict['series'] = meta.get('content')
you are overwriting the last value in pathdict['series'] so I believe this is where you should save.
You also need to make a copy of it, if you donĀ“t it will change also in the list. When you store the dict you really are storing a reeference to it (in this case, a reference to the variable pathdict.
Note
If you want to print the elements of the list in separated lines you can do something like this:
print(*save, sep="\n")
I can see the results that I'm looking for in the Command Prompt, but I'm having trouble saving them to a CSV. The saved file is the location of the text and not the actual text.
import re, requests, csv
data = requests.get('http://www.spk-wc.usace.army.mil/fcgi-bin/getplottext.py?archive=true&plot=isbqr&length=wy&interval=d&wy=1995').text
pattern = re.compile(r'\d*\.00')
matches = pattern.finditer(data)
with open('1995.csv', 'w') as new_file:
csv_writer = csv.writer(new_file)
for success in matches:
export = [matches]
csv_writer.writerow(export)
I've written a script to parse data from the first table of a website. I've used xpath to parse the table. Btw, I didn't use "tr" tag cause without using it I can still see the results in the console when printed. When I run my script, the data are getting scraped but being printed in a single line in a csv file. I can't find out the mistake I'm making. Any input on this will be highly appreciated. Here is what I've tried with:
import csv
import requests
from lxml import html
url="https://fantasy.premierleague.com/player-list/"
response = requests.get(url).text
outfile=open('Data_tab.csv','w', newline='')
writer=csv.writer(outfile)
writer.writerow(["Player","Team","Points","Cost"])
tree = html.fromstring(response)
for titles in tree.xpath("//table[#class='ism-table']")[0]:
# tab_r = titles.xpath('.//tr/text()')
tab_d = titles.xpath('.//td/text()')
writer.writerow(tab_d)
You might want to add a level of looping, examining each table row in turn.
Try this:
for titles in tree.xpath("//table[#class='ism-table']")[0]:
for row in titles.xpath('./tr'):
tab_d = row.xpath('./td/text()')
writer.writerow(tab_d)
Or, perhaps this:
table = tree.xpath("//table[#class='ism-table']")[0]
for row in table.xpath('.//tr'):
items = row.xpath('./td/text()')
writer.writerow(items)
Or you could have the first XPath expression find the rows for you:
rows = tree.xpath("(.//table[#class='ism-table'])[1]//tr")
for row in rows:
items = row.xpath('./td/text()')
writer.writerow(items)
I've written a script which is scraping Address and Phone number of certain shops based on Name and Lid. The way it is searching is that It takes Name and Lid stored in column A and Column B respectively from a csv file. However, after fetching the result based on the search, I expected the parser to put that results in column C and column D respectively as it is shown in the second Image. At this point, I got stuck. I don't know how to manipulate Third and Fourth column using reading or writing method so that the data should be placed there. I'm trying with this now:
import csv
import requests
from lxml import html
Names, Lids = [], []
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
How my csv file looks like now:
My desired output is something like:
You can do it like this. Create a fresh output csv file whose header is based on the input csv, with the addition of the two columns. When you read a csv row it's available as a dictionary, in this case called entry. You can add the new values to this dictionary from the stuff you've gleaned on the 'net. Then write each newly created row out to file.
import csv
import requests
from lxml import html
with open("mytu.csv", "r") as f, open('new_mytu.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Address', 'Phone']
writer = csv.writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
#~ for title in titles:
title = titles[0]
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
new_row = entry
new_row['Address'] = Address
new_row['Phone'] = Contact
writer.writerow(new_row)
I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']