Web scrape table to excel using selenium, python - python-3.x

I'm trying to put this information on an excel file, but can't seem to figure out how to use import csv for it. I looked at other posts as reference but I can't seem to apply it for what I'm doing. I'm sort of new to selenium. Thank you.
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import csv
driver = webdriver.Chrome()
driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")
state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Jersey")
driver.find_element_by_id("city").send_keys("Galloway")
driver.find_element_by_id("name").send_keys("Absegami High School")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()
#scraping the caption of the tables
all_sub_head = driver.find_elements_by_class_name("tableSubHeaderForWsrDetail")
#scraping all the headers of the tables
all_headers = driver.find_elements_by_class_name("tableHeaderForWsrDetail")
#filtering the desired headers
required_headers = all_headers[5:]
#scraoing all the table data
all_contents = driver.find_elements_by_class_name("tdTinyFontForWsrDetail")
#filtering the desired tabla data
required_contents = all_contents[45:]

all_contents is a list of objects.
List comprehension is a quick and common way to gather property values from objects into another list.
Add these lines to the bottom of your script
lstdata = [e.text for e in required_contents]
with open('out.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(lstdata)
Note that you have newlines in your data, so they will appear in the csv file (school address column)

Related

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

Export multiple csv file using different filename in python

I am new to python.
After researching some code based on my idea which is extracting historical stock data,
I have now working code(see below) when extracting individual name and exporting it to a csv file
import investpy
import sys
sys.stdout = open("extracted.csv", "w")
df = investpy.get_stock_historical_data(stock='JFC',
country='philippines',
from_date='25/11/2020',
to_date='18/12/2020')
print(df)
sys.stdout.close()
Now,
I'm trying to make it more advance.
I want to run this code multiple times automatically with different stock name(about 300 plus name) and export it respectively.
I know it is possible but I cannot search the exact terminology to this problem.
Hoping for your help.
Regards,
you can store the stock's name as a list and then iterate through the list and save all the dataframes into separate files.
import investpy
import sys
stocks_list = ['JFC','AAPL',....] # your stock lists
for stock in stocks_list:
df = investpy.get_stock_historical_data(stock=stock,
country='philippines',
from_date='25/11/2020',
to_date='18/12/2020')
print(df)
file_name = 'extracted_'+stock+'.csv'
df.to_csv(file_name,index=False)

Data from a table getting printed to csv in a single line

I've written a script to parse data from the first table of a website. I've used xpath to parse the table. Btw, I didn't use "tr" tag cause without using it I can still see the results in the console when printed. When I run my script, the data are getting scraped but being printed in a single line in a csv file. I can't find out the mistake I'm making. Any input on this will be highly appreciated. Here is what I've tried with:
import csv
import requests
from lxml import html
url="https://fantasy.premierleague.com/player-list/"
response = requests.get(url).text
outfile=open('Data_tab.csv','w', newline='')
writer=csv.writer(outfile)
writer.writerow(["Player","Team","Points","Cost"])
tree = html.fromstring(response)
for titles in tree.xpath("//table[#class='ism-table']")[0]:
# tab_r = titles.xpath('.//tr/text()')
tab_d = titles.xpath('.//td/text()')
writer.writerow(tab_d)
You might want to add a level of looping, examining each table row in turn.
Try this:
for titles in tree.xpath("//table[#class='ism-table']")[0]:
for row in titles.xpath('./tr'):
tab_d = row.xpath('./td/text()')
writer.writerow(tab_d)
Or, perhaps this:
table = tree.xpath("//table[#class='ism-table']")[0]
for row in table.xpath('.//tr'):
items = row.xpath('./td/text()')
writer.writerow(items)
Or you could have the first XPath expression find the rows for you:
rows = tree.xpath("(.//table[#class='ism-table'])[1]//tr")
for row in rows:
items = row.xpath('./td/text()')
writer.writerow(items)

Can't perform reverse web search from a csv file

I've written some code to scrape "Address" and "Phone" against some shop names which is working fine. However, it has got two parameters to be filled in to perform it's activity. I expected to do the same from a csv file where "Name" will be in first column and "Lid" will be in second column and the harvested results will be placed across third and fourth column accordingly. At this point, I can't get any idea as to how I can perform the search from a csv file. Any suggestion will be vastly appreciated.
import requests
from lxml import html
Names=["Literati Cafe","Standard Insurance Co","Suehiro Cafe"]
Lids=["3221083","497670909","12183177"]
for Name in Names and Lids:
Page_link="https://www.yellowpages.com/los-angeles-ca/mip/"+Name.replace(" ","-")+"-"+Name
response = requests.get(Page_link)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
You can get your Names and Lids lists from CSV like:
import csv
Names, Lids = [], []
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
(nevermind PEP violations for now ;)). Then you can use it in the rest of your code, although I'm not sure what you are trying to achieve with your for Name in Names and Lids: loop but it's not giving you what you think it is - it will not loop through the Names list but only through the Lids list.
Also the first order of optimization should be to replace your loop with the loop over the CSV, like:
with open("file_name.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
page_link = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
# rest of your scraping code...

CSV Text Extraction Beautifulsoup

I am new to python and this is my first practice code with Beautifulsoup. I have not learned creative solutions to specific data extract problems yet.
This program prints just fine but there is some difficult in extracting to the CSV. It takes the first elements but leaves all others behind. I can only guess there might be some whitespace, delimiter, or something that causes the code to halt extraction after initial texts???
I was trying to get the CSV extraction to happen to each item by row but obviously floundered. Thank you for any help and/or advice you can provide.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
price_page = 'http://www.harryrosen.com/footwear/c/boots'
page = urlopen(price_page)
soup = BeautifulSoup(page, 'html.parser')
product_data = soup.findAll('ul', attrs={'class': 'productInfo'})
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name)
print (shoe_type)
print (shoe_price)
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
writer.writerow([brand_name, shoe_type, shoe_price])
Here is one way to approach the problem:
collect the results into a list of dictionaries with a list comprehension
write the results to a CSV file via the csv.DictWriter and a single .writerows() call
The implementation:
data = [{
'brand': item.li.get_text(strip=True),
'type': item('li')[1].get_text(strip=True),
'price': item.find('li', class_='price').get_text(strip=True)
} for item in product_data]
with open('shoeprice.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['brand', 'type', 'price'])
writer.writerows(data)
If you want to also write the CSV headers, add the writer.writeheader() call before the writer.writerows(data).
Note that you could have as well used the regular csv.writer and a list of lists (or tuples), but I like the explicitness and the increased readability of using dictionaries in this case.
Also note that I've improved the locators used in the loop - I don't think using the .contents list and getting product children by indexes is a good and reliable idea.
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name, shoe_type, shoe_price, spe='\n')
writer.writerow([brand_name, shoe_type, shoe_price])
Change the open file to the outer loop, so you do not need to open file each loop.

Resources