Scraping addresses using Selenium and Python

Scraping addresses using Selenium and Python - python-3.x

First of all, I have to say that I am starting with Python. I would like to grab addresses from a webpage that is built with a script. I try to use Python 3.x and Selenium. The simple code generates a full list of shops, but I want to split it to build a table with named columns( Name, street, zip code etc). I hope that there is a smart solution.
from selenium import webdriver
browser = webdriver.Chrome(executable_path="E:/Dysk Google/Dokumenty/chromedriver")
browser.get("http://hilding.pl/materace-mazowieckie.html")
shops=browser.find_element_by_id('div_province')
print(shops)
browser.close()

Try the below script. Here is how you can go to get name, street, zipcode etc.
from selenium import webdriver
Browser = webdriver.Chrome() ##If necessary, include the path
Browser.get("http://hilding.pl/materace-mazowieckie.html")
for items in Browser.find_elements_by_css_selector("#div_province .shop"):
name = items.find_element_by_css_selector(".name").text
street = items.find_element_by_css_selector(".streat").text
zip_code = items.find_element_by_css_selector(".zipcode").text
print(name,street,zip_code)
Browser.quit()
Partial Output:
SALON NAP ul. Jagielska 73 02-886
SALON NAP/ DOMOTEKA ul. Malborska 41 03-286
SALON NAP ul. Mysia 3 00 - 496
SKLEP ECCELENT DOMOTEKA ul. Malborska 41 03-286
SKLEP ECCELENT, C.H. MEGA MEBLE al. Jerozolimskie 200 02-486
SKLEP ECCELENT, CH JUPITER ul. Towarowa 22 00-839

Related

Beautiful Soup Scraping

I'm having issues with old working code not functioning correctly anymore.
My python code is scraping a website using beautiful soup and extracting event data (date, event, link).
My code is pulling all of the events which are located in the tbody. Each event is stored in a <tr class="Box">. The issue is that my scraper seems to be stopping after this <tr style ="box-shadow: none;> After it reaches this section (which is a section containing 3 advertisements on the site for events that I don't want to scrape) the code stops pulling event data from within the <tr class="Box">. Is there a way to skip this tr style/ignore future cases?
import pandas as pd
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
source = urllib.request.urlopen('https://10times.com/losangeles-us/technology/conferences').read()
soup = bs.BeautifulSoup(source,'html.parser')
#---Get Event Data---
test1=[]
table = soup.find('tbody')
table_rows = table.find_all('tr') #find table rows (tr)
for x in table_rows:
data = x.find_all('td') #find table data
row = [x.text for x in data]
if len(row) > 2: #Exlcudes rows with only event name/link, but no data.
test1.append(row)
test1

The data is loaded dynamically via JavaScript, so you don't see more results. You can use this example to load more pages:
import requests
from bs4 import BeautifulSoup
url = "https://10times.com/ajax?for=scroll&path=/losangeles-us/technology/conferences"
params = {"page": 1, "ajax": 1}
headers = {"X-Requested-With": "XMLHttpRequest"}
for params["page"] in range(1, 4): # <-- increase number of pages here
print("Page {}..".format(params["page"]))
soup = BeautifulSoup(
requests.get(url, headers=headers, params=params).content,
"html.parser",
)
for tr in soup.select('tr[class="box"]'):
tds = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
print(tds)
Prints:
Page 1..
['Tue, 29 Sep - Thu, 01 Oct 2020', 'Lens Los Angeles', 'Intercontinental Los Angeles Downtown, Los Angeles', 'LENS brings together the entire Degreed community - our clients, invited prospective clients, thought leaders, partners, employees, executives, and industry experts for two days of discussion, workshops,...', 'Business Services IT & Technology', 'Interested']
['Wed, 30 Sep - Sat, 03 Oct 2020', 'FinCon', 'Long Beach Convention & Entertainment Center, Long Beach 20.1 Miles from Los Angeles', 'FinCon will be helping financial influencers and brands create better content, reach their audience, and make more money. Collaborate with other influencers who share your passion for making personal finance...', 'Banking & Finance IT & Technology', 'Interested 7 following']
['Mon, 05 - Wed, 07 Oct 2020', 'NetDiligence Cyber Risk Summit', 'Loews Santa Monica Beach Hotel, Santa Monica 14.6 Miles from Los Angeles', 'NetDiligence Cyber Risk Summit will conference are attended by hundreds of cyber risk insurance, legal/regulatory and security/privacy technology leaders from all over the world. Connect with leaders in...', 'IT & Technology', 'Interested']
... etc.

Web scraping : not able to scrape text and href for a given div, class and to skip the span tag

Trying to get the text and href for top news but not able to scrape it.
website : News site
My code:
import requests
from bs4 import BeautifulSoup
import psycopg2
import time
def checkResponse(url):
response = requests.get(url)
if response.status_code == 200:
return response.content
else:
return None
def getTitleURL():
url = 'http://sandesh.com/'
response = checkResponse(url)
if response is not None:
html = BeautifulSoup(response, 'html.parser')
for values in html.find_all('div', class_='d-top-news-latest'):
headline = values.find(class_='d-s-NSG-regular').text
url = values.find(class_='d-s-NSG-regular').['href']
print(headline + "->" + url)
if __name__ == '__main__':
print('Getting the list of names....')
names = getTitleURL()
print('... done.\n')
Output:
Getting the list of names....
Corona live
મેડિકલ સ્ટાફ પર હુમલા અંગે અમિત શાહે ડોક્ટર્સ સાથે કરી ચર્ચા, સુરક્ષાની ખાતરી આપતા કરી અપીલ
Ahmedabad
ગુજરાતમાં કૂદકેને ભૂસકે વધ્યો કોરોના વાયરસનો કહેર, આજે નવા 94 કેસ નોંધાયા, જાણો કયા- કેટલા કેસ નોંધાયા
Corona live
જીવન અને મોત વચ્ચે સંઘર્ષ કરી રહ્યો છે દુનિયાનો સૌથી મોટો તાનાશાહ કિમ જોંગ! ટ્રમ્પે કહી આ વાત
Ahmedabad
અમદાવાદમાં નર્સિંગ સ્ટાફનો ગુસ્સો ફૂટ્યો, ‘અમારું કોઈ સાંભળતું નથી, અમારો કોરોના ટેસ્ટ જલદી કરાવો’
Business
ભારતીય ટેલિકોમ જગતમાં સૌથી મોટી ડીલ, ફેસબુક બની જિયોની સૌથી મોટી શેરહોલ્ડર
->http://sandesh.com/amit-shah-talk-with-ima-and-doctors-through-video-conference-on-attack/
... done.
I want to skip text inside the tag and also I am able to get only 1 href. Also the headline is a list.
how do I get each title and url.
I am trying to scrape the part in red:

First, At for values in html.find_all('div', class_='d-top-news-latest') you don't need use for because at DOM just have one class d-top-news=latest.
Second, to get the title, you can use select('span') because of your title into the span tag.
Third, you knew the headline is a list, so you need to use for to get each title and URL.
values = html.find('div', class_='d-top-news-latest')
for i in values.find_all('a', href = True):
print(i.select('span'))
print(i['href'])
OUTPUT
Getting the list of names....
[<span>
Corona live
</span>]
http://sandesh.com/maharashtra-home-minister-anil-deshmukh-issue-convicts-list-of-
palghar-case/
[<span>
Corona live
</span>]
http://sandesh.com/two-doctors-turn-black-after-treatment-of-coronavirus-in-china/
[<span>
Corona live
</span>]
http://sandesh.com/bihar-asi-gobind-singh-suspended-for-holding-home-guard-jawans-
after-stopping-officers-car-asi/
[<span>
Ahmedabad
</span>]
http://sandesh.com/jayanti-ravi-surprise-statement-sparks-outcry-big-decision-taken-
despite-more-patients-in-gujarat/
[<span>
Corona live
</span>]
http://sandesh.com/amit-shah-talk-with-ima-and-doctors-through-video-conference-on-
attack/
... done.

to remove the "span" part:
values = html.find('div', class_='d-top-news-latest')
for i in values.find_all('a', href=True):
i.span.decompose()
print(i.text)
print(i['href'])
Output:
Getting the list of names....
ગુજરાતમાં કોરોનાનો કહેરઃ રાજ્યમાં આજે કોરોનાના 135 નવા કેસ, વધુ 8 લોકોનાં મોત
http://sandesh.com/gujarat-corona-update-206-new-cases-and-18-deaths/
ચીનના વૈજ્ઞાનિકોએ જ ખોલી જીનપિંગની પોલ, કોરોના વાયરસને લઈને કર્યો સનસની ખુલાસો
http://sandesh.com/chinese-scientists-claim-over-corona-virus/
શું લોકડાઉન ફરી વધારાશે? PM મોદી 27મીએ ફરી એકવાર તમામ CM સાથે કરશે ચર્ચા
http://sandesh.com/pm-modi-to-hold-video-conference-with-cms-on-april-27-lockdown-
extension/
કોરોના વાયરસને લઈ મોટી ભવિષ્યવાણી, દુનિયાના 30 દેશો પર ઉભુ થશે ભયંકર સંકટ
http://sandesh.com/after-corona-attack-now-hunger-will-kill-many-people-in-the-world/
દેશમાં 24 કલાકમાં 1,486 કોરોનાનાં નવા કેસ, પરંતુ મળ્યા સૌથી મોટા રાહતનાં સમાચાર
http://sandesh.com/recovery-rate-increased-in-corona-patients-says-health-ministry/
... done.

How to pull out a dta when it shows <-!--Content End--> after the JavaScript?

I've just started to learn python. I want to extract glass temperature data every three hours for academic purpose. The website is below:
https://www.weather.gov.hk/wxinfo/ts/display_graph_grass_e.htm?kp&
I try to use BeautifulSoup to pull out the data using the below script. Here is the result:
Before I find the wanted data, there is a < !--Content End-- > after the JavaScript and I can't scrape the script behind it. Why would that happen and if there is any solution for that?

The data is stored as Javascript Array in the HTML page. We can use re and ast.literal_eval (doc) to retrieve it:
import re
import requests
from ast import literal_eval
url = 'https://www.weather.gov.hk/wxinfo/ts/display_graph_grass_e.htm?kp&'
html_text = requests.get(url).text
station_code = literal_eval(re.findall(r'StationCode\s*=.*?(\(.*?\))', html_text)[0])
station_name = literal_eval(re.findall(r'stnname\s*=.*?(\(.*?\))', html_text)[0])
station_height = literal_eval(re.findall(r'stn_height\s*=.*?(\(.*?\))', html_text)[0])
grass_temp = literal_eval(re.findall(r'grasstemp\s*=.*?(\(.*?\))', html_text)[0])
min_since_17 = literal_eval(re.findall(r'minSince17\s*=.*?(\(.*?\))', html_text)[0])
min_hour = literal_eval(re.findall(r'minHour\s*=.*?(\(.*?\))', html_text)[0])
min_minute = literal_eval(re.findall(r'minMinute\s*=.*?(\(.*?\))', html_text)[0])
rows = [*zip(station_code, station_name, station_height, grass_temp, min_since_17, min_hour, min_minute)]
headers = ['Station Code', 'Station Name', 'Station Height', 'Grass Temp', 'Min_since_17', 'Min Hour', 'Min Minute']
print(''.join('{: <20}'.format(d) for d in headers))
for row in rows:
print(''.join('{: <20}'.format(d) for d in row))
Prints:
Station Code Station Name Station Height Grass Temp Min_since_17 Min Hour Min Minute
kp King's Park 65 25.4 25.3 23 35
tkl Ta Kwu Ling 15 25.4 24.8 17 00
tms Tai Mo Shan 955 21.3 21.3 07 19

Why am i getting <searchconsole.query.Report(rows=1)> instead of numbers/strs

Working with search console api,
made it through the basics.
Now i'm stuck on splitting and arranging the data:
When trying to split, i'm getting a NaN, nothing i try works.
46 ((174.0, 3753.0, 0.04636290967226219, 7.816147...
47 ((93.0, 2155.0, 0.0431554524361949, 6.59025522...
48 ((176.0, 4657.0, 0.037792570324243074, 6.90251...
49 ((20.0, 1102.0, 0.018148820326678767, 7.435571...
50 ((31.0, 1133.0, 0.02736098852603707, 8.0935569...
Name: test, dtype: object
When trying to manipulate the data like this (and similar interactions):
data=source['test'].tolist()
data
Its clear that the data is not really available...
[<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>,
<searchconsole.query.Report(rows=1)>]
Anyone have an idea how can i interact with my data ?
Thanks.
for reference, this is the code and the program i work with:
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
def APIsc(date,keyword):
results=webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
return results
source['test']=source.apply(lambda x: APIsc(x.date, x.keyword), axis=1)
source
made by: https://github.com/joshcarty/google-searchconsole

Is there a way to iterate over a list to add into the selenium code?

I am trying to iterate over a large list of dealership names and cities. I want to have it refer back to the list and loop over each entry and get the results separately.
#this is only a portion of the delers the rest are in a file
Dealers= ['Mossy Ford', 'Abel Chevrolet Pontiac Buick', 'Acura of Concord', 'Advantage Audi' ]
driver=webdriver.Chrome("C:\\Users\\kevin\\Anaconda3\\chromedriver.exe")
driver.set_page_load_timeout(30)
driver.get("https://www.bbb.org/")
driver.maximize_window()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/div[2]/button").click()
driver.find_element_by_xpath("""//*[#id="findTypeaheadInput"]""").send_keys("Mossy Ford")
driver.find_element_by_xpath("""//*[#id="nearTypeaheadInput"]""").send_keys("San Diego, CA")
driver.find_element_by_xpath("""/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/button""").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div[6]/div").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/a').click()
#contact_names= driver.find_elements_by_xpath('/html/body/div[1]/div/div/div/div[2]/div/div[5]/div/div[1]/div[1]/div/div/ul[1]')
#print(contact_names)
#print("Query Link: ", driver.current_url)
#driver.quit()
from selenium import webdriver
dealers= ['Mossy Ford', 'Abel Chevrolet Pontiac Buick', 'Acura of Concord']
cities = ['San Diego, CA', 'Rio Vista, CA', 'Concord, CA']
driver=webdriver.Chrome("C:\\Users\\kevin\\Anaconda3\\chromedriver.exe")
driver.set_page_load_timeout(30)
driver.get("https://www.bbb.org/")
driver.maximize_window()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/div[2]/button").click()
for d in dealers:
driver.find_element_by_xpath("""//*[#id="findTypeaheadInput"]""").send_keys("dealers")
for c in cities:
driver.find_element_by_xpath("""//*[#id="nearTypeaheadInput"]""").send_keys("cities")
driver.find_element_by_xpath("""/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/button""").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div[6]/div").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/a').click()
contact_names= driver.find_elements_by_class_name('styles__UlUnstyled-sc-1fixvua-1 ehMHcp')
print(contact_names)
print("Query Link: ", driver.current_url)
driver.quit()
I want to be able to go to each of these different dealerships pages and pull all of their details then loop thru the rest. I am just struggling with the ideas of for loops within selenium.

Its better to create a dictionary with a mapping of dealer and city and loop through
Dealers_Cities_Dict = {
Dealers_Cities_Dict = {
"Mossy Ford": "San Diego, CA",
"Abel Chevrolet Pontiac Buick": "City",
"Acura of Concord'": "City",
"Advantage Audi'": "City"
}
for dealer,city in Dealers_Cities_Dict.items():
//This is where the code sits
driver.find_element_by_id("findTypeaheadInput").send_keys(dealer)
driver.find_element_by_id("nearTypeaheadInput").send_keys(city)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping addresses using Selenium and Python - python-3.x

Related

Beautiful Soup Scraping

Web scraping : not able to scrape text and href for a given div, class and to skip the span tag

How to pull out a dta when it shows <-!--Content End--> after the JavaScript?

Why am i getting <searchconsole.query.Report(rows=1)> instead of numbers/strs

Is there a way to iterate over a list to add into the selenium code?

Categories

Resources