Can't perform reverse web search from a csv file - python-3.x

I've written some code to scrape "Address" and "Phone" against some shop names which is working fine. However, it has got two parameters to be filled in to perform it's activity. I expected to do the same from a csv file where "Name" will be in first column and "Lid" will be in second column and the harvested results will be placed across third and fourth column accordingly. At this point, I can't get any idea as to how I can perform the search from a csv file. Any suggestion will be vastly appreciated.
import requests
from lxml import html
Names=["Literati Cafe","Standard Insurance Co","Suehiro Cafe"]
Lids=["3221083","497670909","12183177"]
for Name in Names and Lids:
Page_link="https://www.yellowpages.com/los-angeles-ca/mip/"+Name.replace(" ","-")+"-"+Name
response = requests.get(Page_link)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)

You can get your Names and Lids lists from CSV like:
import csv
Names, Lids = [], []
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
(nevermind PEP violations for now ;)). Then you can use it in the rest of your code, although I'm not sure what you are trying to achieve with your for Name in Names and Lids: loop but it's not giving you what you think it is - it will not loop through the Names list but only through the Lids list.
Also the first order of optimization should be to replace your loop with the loop over the CSV, like:
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
page_link = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
# rest of your scraping code...

Related

How to Append List in Python by reading csv file

I am trying to write a simple program that should give the following output when it reads csv file which contains several email ids.
email_id = ['emailid1#xyz.com','emailid2#xyz.com','emailid3#xyz.com'] #required format
but the problem is the output I got is like this following:
[['emailid1#xyz.com']]
[['emailid1#xyz.com'], ['emailid2#xyz.com']]
[['emailid1#xyz.com'], ['emailid2#xyz.com'], ['emailid3#xyz.com']] #getting this wrong format
here is my piece of code that I have written: Kindly suggest me the correction in the following piece of code which would give me the required format. Thanks in advance.
import csv
email_id = []
with open('contacts1.csv', 'r') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
email_id.append(row)
print(email_id)
NB.: Note my csv contains only one column that has email ids and has no header. I also tried the email_id.extend(row) but It did not work also.
You need to move your print outside the loop:
with open('contacts1.csv', 'r') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
email_id.append(row)
print(sum(email_id, []))
The loop can also be like this (if you only need one column from the csv):
for row in reader:
email_id.append(row[0])
print(email_id)

compiling a regex query with a row in python

I have two csv files. I want to take each row in turn of csv1, find its entry in csv 2 and then pull out other information for that row from csv2.
I'm struggling with the items I am looking for are company names. In both csv1 and csv2 they may or may not have the suffix 'LTD, LTD, Limited or LIMITED' within the name so I would like the query to not include these substrings in the query. My code is still only finding exact matches, rather than ignoring 'Ltd' etc. I'm guessing it's the way I'm combining 'row[0]' with the regex query but can't figure it out.
Code
import re, csv
with open (r'c:\temp\noregcompanies.csv', 'rb') as q:
readerM=csv.reader(q)
for row in readerM:
companySourcename = row[0]+"".join(r'.*(?!Ltd|Limited|LTD|LIMITED).*')
IBcompanies = re.compile(companySourcename)
IBcompaniesString = str(companySourcename)
with open (r'c:\temp\chdata.csv', 'rb') as f:
readerS = csv.reader(f)
for row in readerS:
companyCHname = row[0]+"".join(r'.*(?!Ltd|Limited|LTD|LIMITED)*')
CHcompanies = re.compile(companyCHname)
if CHcompanies.match(IBcompaniesString):
print ('Match is: ',row [0], row[1])
with open (r'c:\temp\outputfile.csv', 'ab') as o:
writer = csv.writer(o, delimiter=',')
writer.writerow(row) t

Data from a table getting printed to csv in a single line

I've written a script to parse data from the first table of a website. I've used xpath to parse the table. Btw, I didn't use "tr" tag cause without using it I can still see the results in the console when printed. When I run my script, the data are getting scraped but being printed in a single line in a csv file. I can't find out the mistake I'm making. Any input on this will be highly appreciated. Here is what I've tried with:
import csv
import requests
from lxml import html
url="https://fantasy.premierleague.com/player-list/"
response = requests.get(url).text
outfile=open('Data_tab.csv','w', newline='')
writer=csv.writer(outfile)
writer.writerow(["Player","Team","Points","Cost"])
tree = html.fromstring(response)
for titles in tree.xpath("//table[#class='ism-table']")[0]:
# tab_r = titles.xpath('.//tr/text()')
tab_d = titles.xpath('.//td/text()')
writer.writerow(tab_d)
You might want to add a level of looping, examining each table row in turn.
Try this:
for titles in tree.xpath("//table[#class='ism-table']")[0]:
for row in titles.xpath('./tr'):
tab_d = row.xpath('./td/text()')
writer.writerow(tab_d)
Or, perhaps this:
table = tree.xpath("//table[#class='ism-table']")[0]
for row in table.xpath('.//tr'):
items = row.xpath('./td/text()')
writer.writerow(items)
Or you could have the first XPath expression find the rows for you:
rows = tree.xpath("(.//table[#class='ism-table'])[1]//tr")
for row in rows:
items = row.xpath('./td/text()')
writer.writerow(items)

Can't store the scraped results in third and fourth column in a csv file

I've written a script which is scraping Address and Phone number of certain shops based on Name and Lid. The way it is searching is that It takes Name and Lid stored in column A and Column B respectively from a csv file. However, after fetching the result based on the search, I expected the parser to put that results in column C and column D respectively as it is shown in the second Image. At this point, I got stuck. I don't know how to manipulate Third and Fourth column using reading or writing method so that the data should be placed there. I'm trying with this now:
import csv
import requests
from lxml import html
Names, Lids = [], []
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
How my csv file looks like now:
My desired output is something like:
You can do it like this. Create a fresh output csv file whose header is based on the input csv, with the addition of the two columns. When you read a csv row it's available as a dictionary, in this case called entry. You can add the new values to this dictionary from the stuff you've gleaned on the 'net. Then write each newly created row out to file.
import csv
import requests
from lxml import html
with open("mytu.csv", "r") as f, open('new_mytu.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Address', 'Phone']
writer = csv.writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
#~ for title in titles:
title = titles[0]
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
new_row = entry
new_row['Address'] = Address
new_row['Phone'] = Contact
writer.writerow(new_row)

CSV Text Extraction Beautifulsoup

I am new to python and this is my first practice code with Beautifulsoup. I have not learned creative solutions to specific data extract problems yet.
This program prints just fine but there is some difficult in extracting to the CSV. It takes the first elements but leaves all others behind. I can only guess there might be some whitespace, delimiter, or something that causes the code to halt extraction after initial texts???
I was trying to get the CSV extraction to happen to each item by row but obviously floundered. Thank you for any help and/or advice you can provide.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
price_page = 'http://www.harryrosen.com/footwear/c/boots'
page = urlopen(price_page)
soup = BeautifulSoup(page, 'html.parser')
product_data = soup.findAll('ul', attrs={'class': 'productInfo'})
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name)
print (shoe_type)
print (shoe_price)
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
writer.writerow([brand_name, shoe_type, shoe_price])
Here is one way to approach the problem:
collect the results into a list of dictionaries with a list comprehension
write the results to a CSV file via the csv.DictWriter and a single .writerows() call
The implementation:
data = [{
'brand': item.li.get_text(strip=True),
'type': item('li')[1].get_text(strip=True),
'price': item.find('li', class_='price').get_text(strip=True)
} for item in product_data]
with open('shoeprice.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['brand', 'type', 'price'])
writer.writerows(data)
If you want to also write the CSV headers, add the writer.writeheader() call before the writer.writerows(data).
Note that you could have as well used the regular csv.writer and a list of lists (or tuples), but I like the explicitness and the increased readability of using dictionaries in this case.
Also note that I've improved the locators used in the loop - I don't think using the .contents list and getting product children by indexes is a good and reliable idea.
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name, shoe_type, shoe_price, spe='\n')
writer.writerow([brand_name, shoe_type, shoe_price])
Change the open file to the outer loop, so you do not need to open file each loop.

Resources