I am new to python and this is my first practice code with Beautifulsoup. I have not learned creative solutions to specific data extract problems yet.
This program prints just fine but there is some difficult in extracting to the CSV. It takes the first elements but leaves all others behind. I can only guess there might be some whitespace, delimiter, or something that causes the code to halt extraction after initial texts???
I was trying to get the CSV extraction to happen to each item by row but obviously floundered. Thank you for any help and/or advice you can provide.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
price_page = 'http://www.harryrosen.com/footwear/c/boots'
page = urlopen(price_page)
soup = BeautifulSoup(page, 'html.parser')
product_data = soup.findAll('ul', attrs={'class': 'productInfo'})
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name)
print (shoe_type)
print (shoe_price)
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
writer.writerow([brand_name, shoe_type, shoe_price])
Here is one way to approach the problem:
collect the results into a list of dictionaries with a list comprehension
write the results to a CSV file via the csv.DictWriter and a single .writerows() call
The implementation:
data = [{
'brand': item.li.get_text(strip=True),
'type': item('li')[1].get_text(strip=True),
'price': item.find('li', class_='price').get_text(strip=True)
} for item in product_data]
with open('shoeprice.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['brand', 'type', 'price'])
writer.writerows(data)
If you want to also write the CSV headers, add the writer.writeheader() call before the writer.writerows(data).
Note that you could have as well used the regular csv.writer and a list of lists (or tuples), but I like the explicitness and the increased readability of using dictionaries in this case.
Also note that I've improved the locators used in the loop - I don't think using the .contents list and getting product children by indexes is a good and reliable idea.
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name, shoe_type, shoe_price, spe='\n')
writer.writerow([brand_name, shoe_type, shoe_price])
Change the open file to the outer loop, so you do not need to open file each loop.
Related
I'm trying to create a metadata scraper to enrich my e-book collection, but am experiencing some problems. I want to create a dict (or whatever gets the job done) to store the index (only while testing), the path and the series name. This is the code I've written so far:
from bs4 import BeautifulSoup
def get_opf_path():
opffile=variables.items
pathdict={'index':[],'path':[],'series':[]}
safe=[]
x=0
for f in opffile:
x+=1
pathdict['path']=f
pathdict['index']=x
with open(f, 'r') as fi:
soup=BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name')=='calibre:series':
pathdict['series']=meta.get('content')
safe.append(pathdict)
print(pathdict)
print(safe)
this code is able to go through all the opf files and get the series, index and path, I'm sure of this, since the console output is this:
However, when I try to store the pathdict to the safe, no matter where I put the safe.append(pathdict) the output is either:
or
or
What do I have to do, so that the safe=[] has the data shown in image 1?
I have tried everything I could think of, but nothing worked.
Any help is appreciated.
I believe this is the correct way:
from bs4 import BeautifulSoup
def get_opf_path():
opffile = variables.items
pathdict = {'index':[], 'path':[], 'series':[]}
safe = []
x = 0
for f in opffile:
x += 1
pathdict['path'] = f
pathdict['index'] = x
with open(f, 'r') as fi:
soup = BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name') == 'calibre:series':
pathdict['series'] = meta.get('content')
print(pathdict)
safe.append(pathdict.copy())
print(safe)
For two main reasons:
When you do:
pathdict['series'] = meta.get('content')
you are overwriting the last value in pathdict['series'] so I believe this is where you should save.
You also need to make a copy of it, if you don´t it will change also in the list. When you store the dict you really are storing a reeference to it (in this case, a reference to the variable pathdict.
Note
If you want to print the elements of the list in separated lines you can do something like this:
print(*save, sep="\n")
I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.
As output of my python code I am getting the marks of Randy and Shaw everytime I run my program. I run this program couple of times every month for many years.
I am storing their marks in a list in python. but how do I save it following format? I am getting output in following format[Output in a row for two different persons]
import pandas
from openpyxl import load_workbook
#These lists I am getting from a very complicated code so just creating new lists here
L1=('7/6/2016', 24,24,13)
L2=('5/8/2016', 25,24,16)
L3=('7/6/2016', 21,16,19)
L4=('5/8/2016', 23,24,21)
L5=('4/11/2016', 13, 12,17)
print('Randy's grades')
print(L1)
print(L2)
print(L3)
print('Shaw's grades')
print(L4)
print(L5)
book = load_workbook('C:/Users/Desktop/Masterfile.xlsx')
writer = pandas.ExcelWriter('Masterfile.xlsx', engine='openpyxl')
Output at run no 1:
For Randy
7/6/2016, 24,24,13
5/8/2016, 25,24,16
For Shaw
7/6/2016, 21,16,19
5/8/2016, 23,24,21
4/11/2016, 13, 12,17
Output at run no 2:
For Randy
7/8/2016, 24,24,13
5/9/2016, 25,24,16
For Shaw
7/8/2016, 21,16,19
5/9/2016, 23,24,21
I will have many such output runs for couple of years so I want to save the data by appending in the same document.
I am using OpenPyxl to open doc and I know I need to use append() operation but I am having hard time to save my list as row. I am new here. Please help me with Syntax!I understand the logic but difficulty with syntax!
Thank you!
Since you said that you are willing to use csv format, I will show a csv solution.
with open('FileToWriteTo.csv', 'w') as outFile:
outFile.write(','.join([str(item) for item in L1])) # Take everything in L1 and put commas between them then write to file
outFile.write('\n') # Write newline
outFile.write(','.join([str(item) for item in L2]))
outFile.write('\n')
outFile.write(','.join([str(item) for item in L3]))
outFile.write('\n')
outFile.write(','.join([str(item) for item in L4]))
outFile.write('\n')
outFile.write(','.join([str(item) for item in L5]))
outFile.write('\n')
If you keep a list of lists instead of separate lists, this becomes easier with a for loop:
with open('FileToWriteTo.csv', 'w') as outFile:
for row in listOfLists:
outFile.write(','.join([str(item) for item in row]))
outFile.write('\n')
I've written some code to scrape "Address" and "Phone" against some shop names which is working fine. However, it has got two parameters to be filled in to perform it's activity. I expected to do the same from a csv file where "Name" will be in first column and "Lid" will be in second column and the harvested results will be placed across third and fourth column accordingly. At this point, I can't get any idea as to how I can perform the search from a csv file. Any suggestion will be vastly appreciated.
import requests
from lxml import html
Names=["Literati Cafe","Standard Insurance Co","Suehiro Cafe"]
Lids=["3221083","497670909","12183177"]
for Name in Names and Lids:
Page_link="https://www.yellowpages.com/los-angeles-ca/mip/"+Name.replace(" ","-")+"-"+Name
response = requests.get(Page_link)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
You can get your Names and Lids lists from CSV like:
import csv
Names, Lids = [], []
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
(nevermind PEP violations for now ;)). Then you can use it in the rest of your code, although I'm not sure what you are trying to achieve with your for Name in Names and Lids: loop but it's not giving you what you think it is - it will not loop through the Names list but only through the Lids list.
Also the first order of optimization should be to replace your loop with the loop over the CSV, like:
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
page_link = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
# rest of your scraping code...
I am new to python and I am trying to read data from URL. Basically I am reading the historical stock data, get the closing price and save the closing price in to a list. The closing price is available at the 4th index (5th column) of each line. And I want to do all of these within a list comprehension.
Code snippet:
from urllib.request import urlopen
URL = "http://ichart.yahoo.com/table.csv?s=AAPL&a=3&b=1&c=2016&d=9&e=30&f=2016"
def downloadClosingPrice():
urlHandler = urlopen(URL)
next(urlHandler)
return [float(line.split(",")[4]) for line in urlHandler.read().decode("utf8").splitlines() if line]
closingPriceList = downloadClosingPrice()
The above code just works fine. I am able to read and fetch the required data. However just out of curiosity, can the code for list comprehension be written in a more simpler or easier way ?
Thanks...
I did try out various ways and this is how I could do the same using different forms of list comprehension:
return [float(line.decode("utf8").split(",")[4]) for line in urlHandler if line]
# return [float(line.decode("utf8").split(",")[4]) for line in urlHandler.readlines() if line]
# return [float(line.split(",")[4]) for line in urlHandler.read().decode("utf8").splitlines() if line]
The first one is better because it reads the file line by line which saves memory. And of course it's simpler and easier to understand.